High Definition Standard Definition Theater
Video id : utTaPW32gKY
ImmersiveAmbientModecolor: #d8d5d4 (color 2)
Video Format : 22 (720p) openh264 ( https://github.com/cisco/openh264) mp4a.40.2 | 44100Hz
Audio Format: Opus - Normalized audio
PokeTubeEncryptID: 4aa481082c2396c84d89160ff08fdd3f84115a94398bed9dc5697d0900f36b10c4fc11ea90eefe8b6738b02c073ad49d
Proxy : eu-proxy.poketube.fun - refresh the page to change the proxy location
Date : 1716351053901 - unknown on Apple WebKit
Mystery text : dXRUYVBXMzJnS1kgaSAgbG92ICB1IGV1LXByb3h5LnBva2V0dWJlLmZ1bg==
143 : true
How Fast can Python Parse 1 Billion Rows of Data?
Jump to Connections
140,202 Views • Apr 13, 2024 • Click to toggle off description
To try everything Brilliant has to offer—free—for a full 30 days, visit brilliant.org/DougMercer .
You’ll also get 20% off an annual premium subscription.

———————————————————————————————
Sign up for 1-on-1 coaching at dougmercer.dev/
———————————————————————————————

The 1 billion row challenge is a fun challenge exploring how quickly we can parse a large text file and compute some summary statistics. The coding community created some amazingly clever solutions.

In this video, I walk through some of the top strategies for writing highly performant code in Python. I start with the simplest possible approach, and work my way through JIT compilation, multiprocessing, and memory mapping. By the end, I have a pure Python implementation that is only one order of magnitude slower than the highly optimized Java challenge winner.

On top of that, I show two much simpler, but just as performant solutions that use the polars dataframe library and duckdb (in memory SQL database). In practice, you should use these, cause they are incredibly fast and easy to use.

If you want to take a stab at speeding things up further, you can find the code here github.com/dougmercer-yt/1brc.

References
------------------
Main challenge - github.com/gunnarmorling/1brc
Ifnesi - github.com/ifnesi/1brc/tree/main
Booty - github.com/booty/ruby-1-billion/
Danny van Kooten C solution blog post - www.dannyvankooten.com/blog/2024/1brc/
Awesome duckdb blog post - rmoff.net/2024/01/03/1%EF%B8%8F%E2%83%A3%EF%B8%8F-…
pypy vs Cpython duel blog post - jszafran.dev/posts/how-pypy-impacts-the-performanc…

Chapters
----------------
0:00 Intro
1:09 Let's start simple
2:55 Let's make it fast
10:48 Third party libraries
13:17 But what about Java or C?
14:17 Sponsor
16:04 Outro

Music
----------
"4" by HOME, released under CC BY 3.0 DEED, home96.bandcamp.com/album/resting-state

Go buy their music!

Disclosure
-----------------
This video was sponsored by Brilliant.

#python #datascience #pypy #polars #duckdb #1brc
Metadata And Engagement

Views : 140,202
Genre: Education
Date of upload: Apr 13, 2024 ^^


Rating : 4.897 (121/4,576 LTDR)
RYD date created : 2024-05-22T03:36:27.231384Z
See in json
Tags
Connections
Nyo connections found on the description ;_; report a issue lol

YouTube Comments - 368 Comments

Top Comments of this video!! :3

@dougmercer

1 month ago

To try everything Brilliant has to offer—free—for a full 30 days, visit brilliant.org/DougMercer . You’ll also get 20% off an annual premium subscription.

10 |

@eddie_dane

1 month ago

Are mustaches the new hoodies for programmers now?

313 |

@danieljakob1307

1 month ago

The Summoning Salt homage at 8:26 is brilliant. Fantastic video!

206 |

@guinea_horn

1 month ago

C can't be slower than Java, can it? The slowest C implementation would be to implement the entire JVM and then write bad Java code

476 |

@joker345172

1 month ago

8:24 Amazing trick! It reminds me of computer graphics class where we had to find a way to improve the DDA Line algorithm... No one could do it. Then, the professor showed us the Bresenham algorithm. It's such a simple concept - instead of working with floats, work with integers! - but it saves soooo much time. It goes to show that sometimes the data type you're working with can have a huge effect on how fast your code is. Drawing a parallel to Machine Learning, this is also why new GPUs have FP8 and FP16 as big selling points. Training with FP32, which is still the standard for a lot of applications, is just dog slow compared to using FP16 or even FP8.

55 |

@BosonCollider

1 month ago

The actual lessons from this is: 1: use duckdb 2: otherwise, use polars 3: use pypy more, and push back against libraries that are incompatible with it

93 |

@mathmaniac43

4 weeks ago

What did you not like about the index variables in booty's orginal code? I find named variable indexes more readable than "magic numbers". I would have probably used an enum with incrementing values instead.

12 |

@smol.bird42

1 month ago

your editing has so much taste, great video bro

32 |

@otty4000

1 month ago

wow this was a really great video. Its impressive to explain code/libraries differences that quickly and clearly.

15 |

@FirroLP

1 month ago

Dude, your production quality is so good it's criminal. Had to tell you

15 |

@shadamethyst1258

1 month ago

I'm impressed you did not do any profiling, nor any statistical test to rule out measurement fluctuations

73 |

@50shmekles

3 weeks ago

This is one of the most well-done, detailed and thorough yet clear, concise and to the point videos ever. Thank you for introducing me to new concepts and libraries!

2 |

@fatcats7727

4 weeks ago

Just wanted to say, all of your videos are incredibly clean and well edited, and althought the algorithm isn’t picking it up rn, your efforts will not go unnoticed!

3 |

@richardrubin2192

1 month ago

This is great - thanks, Doug!

3 |

@artlenski8115

1 month ago

Highly optimised C with proper compiler specifiers taking almost double the time of Java implementation, even if GC is turned off.. hard to believe.

7 |

@skanderghamgui5039

1 month ago

I had a project last year where I had to automate a manual process using Python to extract data from an Excel file and auto-fill an XML file. After I finished the project, I reduced the process from 3 months of human work to a 20-minute code run, which made me and my boss very happy. I wish I had seen this video last year; we could have been even happier. Nevertheless, it's great to know that I can achieve such high levels of Python performance. I will ensure better time management for my future projects. Thanks.

72 |

@thahrimdon

3 weeks ago

This is amazing! I was in it with you for the long haul. Had me smiling and frowning the whole way! Great video!

1 |

@MakeDataUseful

1 month ago

Great video, thanks for taking the time to create 🤙

5 |

@nullzeon

1 month ago

how am I just finding out about this channel, editing, knowledge, this video was fantastic!

4 |

@gharren

1 month ago

As we all know, Python is the fastest programming language there is. By the time your program has done it's job, the C++ developer is still busy fixing segfaults.

93 |

Go To Top