I Parsed 1 Billion Rows Of Text (It Sucked)

Показать описание

The 1 Billion Row Challenge (1BRC) was a wonderful idea from the Java community that spread way further than I would have expected. Obviously I have to talk about it!

SOURCES

S/O Ph4se0n3 for the awesome edit 🙏

Рекомендации по теме

Комментарии

I FORGOT TO LINK THE BLOG POST THAT CARRIED THIS VIDEO I AM SO DUMB

tdotgg

Article includes optimization steps that reference relevant assembly
Theo: This is chaos, I hope no rust devs are actually ever doing this
Me: This is sick, I wish my job would let me do this sort of shit

soggy_dev

"824 kb, that's not bad for a billion rows"
I was like, what the fuck? That's not even a billion bytes.

Snollygoster-

NO, I'M RANTING
12:30 - dude starts looking for where splitting is happening - has PARSELOOP VISIBLE ALL THE TIME
13:02 - dude gets to the function, finally
13:12 - AFTER 10 SECONDS DUDE DROPS "ESOTERIC JAVA BULLSHIT" AND REFUSES TO ELABORATE FURTHER
IF THATS HOW JS/TS DEVS WORK THEN OMFG

xirate

"not a CSV if it doesn't use comma separators" -- absolute nonsense, in practice csv files are so unstandardized you're lucky if they have separators and values.

JoeTaber

9:51 "I f-ing hate JavaScript." I like to hate on JavaScript like the next guy, but that is just IEEE floating point math! That's the same in *every* language that implements IEEE floating point! C, Java, C#, Rust, Go, ...

bloody_albatross

As a Data Engineer TBH, unless you're planning to do this daily, you'd just throw it at whatever data ingest or database tools you have set up and eat the time it takes as a one off. More time would be spent checking that the data was all 'nice' and dealing with any weird errors that are usually inevitable in real world data.

I'd just be thankful that the data in the test is all UTF-8 and not some weird set of codepages from 40 years ago, where you have to figure out if it's big or little endian.

Elesario

Watching Theo not open the file with less was painful

AbstruseJoker

17:51 > why is it stored as an SVG when it's not actually an SVG?
but it is SVG, what do you mean?

RandomGeometryDashStuff

I like how Theo sees a scanner and says it's obscure java lol

lapissea

I write a bunch of rust and one of the reason i love it so much is actually because parallelism is so easy in it. arc mutexs can be a bit to wrap your head around but thats only if you need to share info between threads otherwise its a breeze

emeraldbonsai

16:56 what? There's literally no difference to the size of the number between a sum and (count * mean), which is what you're doing. Your approach isn't any less susceptible to overflowing.

tspander

for those asking about copilot, this video was recorded multiple months ago <3

KieranHolroyd

What do you mean "They are not calculating average the traditional way, they are calculating it via the sum plus count." That's literally the definition of average. Also, your solution doesn't prevent overflow, since `mean * visited` is the sum. You are just repeatedly dividing and multiplying the sum for some reason.

MichalMarsalek

BigInt cannot overflow. They are arbitrary-precision integers. The only "limit" is in Safari there is a theoretical limit set at 1 million bits, but there's nothing in the spec that specifies a maximum limit.

dealloc

6:15 😂 “bottom submission? Fuck, am I a bottom?” Came for 1 billion lines…stayed for this one.

10:40 well, it’s test data…you aren’t supposed to do your final code on test data, you test it with a subset of the data, maybe a few rows to get the parsing right, and then up to a million to stress test it, and then use the program on the full data when it’s working as expected.

AvanaVana

It's faster, safer, and slightly more accurate when calculating means to save any division until the end. Just calculate visited like you were, and also calculate sum. Have mean be a calculated value - not a stored one - whenever you want to print it out.

The only catch is that you need store sum in a type that won't overflow. But the upsides are numerous.

No expensive division and multiplication. (Only one division to display final result.)

When multithreading with a single point of storage for results, you don't get incorrect answers when two threads both increment visited before either updates mean. All you need to guarantee is that all the increments and additions get processed exactly once, and order doesn't matter.

Fewer operations to compound floating point errors. Average error magnitude is approximately sqrt(N)*P where N is the number of operations capable of introducing a rounding error, and P is the precision of the stored value (i.e. 2^-[mantissa length]). This one is admittedly a nit, as the primitive number types all have way more precision than is required for the task. But if you were creating a custom number class to optimize for these calculations, this would allow you to save half a bit of precision in the mantissa.

Probably most importantly though, you can store total temperature as an integer value measured in decidegrees, avoiding the hassle of floating point arithmetic altogether.

G.Aaron.Fisher

16:52 yes, but your solution has floating point issues that will be much more prevalent since you're multiplying and dividing so much more often

wih

I loved that he somehow thought a BILLION rows would fit in 800k, uncompressed. 1 billion carriage-returns alone would necessarily HAVE TO be 1 gb. Adding a name, colons and a number would be dozens of gb. Still laughing when he did that.

PhrontDoor

A developer not being able to open or search a 14GB file is peak humor for me as SysOp.

IzzyIkigai

I Parsed 1 Billion Rows Of Text (It Sucked)

I Parsed 1 Billion Rows Of Text (It Sucked)

How Fast can Python Parse 1 Billion Rows of Data?

Java, How Fast Can You Parse 1 Billion Rows of Weather Data? • Roy van Rijn • GOTO 2024

how fast can python parse 1 billion rows of data

How To Scare C++ Programmer

A Billion Rows per Second: Metaprogramming Python for Big Data

How Much Memory for 1,000,000 Threads in 7 Languages | Go, Rust, C#, Elixir, Java, Node, Python

Ville Tuulos - How to Build a SQL-based Data Warehouse for 100+ Billion Rows in Python

Secret To Optimizing SQL Queries - Understand The SQL Execution Order

Industrial-scale Web Scraping with AI & Proxy Networks

Attention in transformers, step-by-step | DL6

Unlock petabyte-scale datasets in Azure with aggregations in Power BI | Azure Friday

Process HUGE Data Sets in Pandas

How to Create Database Indexes: Databases for Developers: Performance #4

Inserting 10 Million Records in SQL Server with C# and ADO.NET (Efficient way)

How to Make 2500 HTTP Requests in 2 Seconds with Async & Await

5 Secrets for making PostgreSQL run BLAZING FAST. How to improve database performance.

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

FAST - Billions of rows with Azure Data Explorer (ADX)

Watch Me Build A PDF Invoice Parsing Flow In Minutes

Python for Everybody - Full University Python Course

Solving distributed systems challenges in Rust

Ten SQL Tricks that You Didn’t Think Were Possible (Lukas Eder)

D. Richard Hipp - SQLite [The Databaseology Lectures - CMU Fall 2015]