I Parsed 1 Billion Rows Of Text (It Sucked)

preview_player
Показать описание
The 1 Billion Row Challenge (1BRC) was a wonderful idea from the Java community that spread way further than I would have expected. Obviously I have to talk about it!

SOURCES

S/O Ph4se0n3 for the awesome edit 🙏
Рекомендации по теме
Комментарии
Автор

I FORGOT TO LINK THE BLOG POST THAT CARRIED THIS VIDEO I AM SO DUMB

tdotgg
Автор

Article includes optimization steps that reference relevant assembly
Theo: This is chaos, I hope no rust devs are actually ever doing this
Me: This is sick, I wish my job would let me do this sort of shit

soggy_dev
Автор

"824 kb, that's not bad for a billion rows"
I was like, what the fuck? That's not even a billion bytes.

Snollygoster-
Автор

NO, I'M RANTING
12:30 - dude starts looking for where splitting is happening - has PARSELOOP VISIBLE ALL THE TIME
13:02 - dude gets to the function, finally
13:12 - AFTER 10 SECONDS DUDE DROPS "ESOTERIC JAVA BULLSHIT" AND REFUSES TO ELABORATE FURTHER
IF THATS HOW JS/TS DEVS WORK THEN OMFG

xirate
Автор

"not a CSV if it doesn't use comma separators" -- absolute nonsense, in practice csv files are so unstandardized you're lucky if they have separators and values.

JoeTaber
Автор

9:51 "I f-ing hate JavaScript." I like to hate on JavaScript like the next guy, but that is just IEEE floating point math! That's the same in *every* language that implements IEEE floating point! C, Java, C#, Rust, Go, ...

bloody_albatross
Автор

As a Data Engineer TBH, unless you're planning to do this daily, you'd just throw it at whatever data ingest or database tools you have set up and eat the time it takes as a one off. More time would be spent checking that the data was all 'nice' and dealing with any weird errors that are usually inevitable in real world data.

I'd just be thankful that the data in the test is all UTF-8 and not some weird set of codepages from 40 years ago, where you have to figure out if it's big or little endian.

Elesario
Автор

Watching Theo not open the file with less was painful

AbstruseJoker
Автор

17:51 > why is it stored as an SVG when it's not actually an SVG?
but it is SVG, what do you mean?

RandomGeometryDashStuff
Автор

I like how Theo sees a scanner and says it's obscure java lol

lapissea
Автор

I write a bunch of rust and one of the reason i love it so much is actually because parallelism is so easy in it. arc mutexs can be a bit to wrap your head around but thats only if you need to share info between threads otherwise its a breeze

emeraldbonsai
Автор

16:56 what? There's literally no difference to the size of the number between a sum and (count * mean), which is what you're doing. Your approach isn't any less susceptible to overflowing.

tspander
Автор

for those asking about copilot, this video was recorded multiple months ago <3

KieranHolroyd
Автор

What do you mean "They are not calculating average the traditional way, they are calculating it via the sum plus count." That's literally the definition of average. Also, your solution doesn't prevent overflow, since `mean * visited` is the sum. You are just repeatedly dividing and multiplying the sum for some reason.

MichalMarsalek
Автор

BigInt cannot overflow. They are arbitrary-precision integers. The only "limit" is in Safari there is a theoretical limit set at 1 million bits, but there's nothing in the spec that specifies a maximum limit.

dealloc
Автор

6:15 😂 “bottom submission? Fuck, am I a bottom?” Came for 1 billion lines…stayed for this one.

10:40 well, it’s test data…you aren’t supposed to do your final code on test data, you test it with a subset of the data, maybe a few rows to get the parsing right, and then up to a million to stress test it, and then use the program on the full data when it’s working as expected.

AvanaVana
Автор

It's faster, safer, and slightly more accurate when calculating means to save any division until the end. Just calculate visited like you were, and also calculate sum. Have mean be a calculated value - not a stored one - whenever you want to print it out.

The only catch is that you need store sum in a type that won't overflow. But the upsides are numerous.

No expensive division and multiplication. (Only one division to display final result.)

When multithreading with a single point of storage for results, you don't get incorrect answers when two threads both increment visited before either updates mean. All you need to guarantee is that all the increments and additions get processed exactly once, and order doesn't matter.

Fewer operations to compound floating point errors. Average error magnitude is approximately sqrt(N)*P where N is the number of operations capable of introducing a rounding error, and P is the precision of the stored value (i.e. 2^-[mantissa length]). This one is admittedly a nit, as the primitive number types all have way more precision than is required for the task. But if you were creating a custom number class to optimize for these calculations, this would allow you to save half a bit of precision in the mantissa.

Probably most importantly though, you can store total temperature as an integer value measured in decidegrees, avoiding the hassle of floating point arithmetic altogether.

G.Aaron.Fisher
Автор

16:52 yes, but your solution has floating point issues that will be much more prevalent since you're multiplying and dividing so much more often

wih
Автор

I loved that he somehow thought a BILLION rows would fit in 800k, uncompressed. 1 billion carriage-returns alone would necessarily HAVE TO be 1 gb. Adding a name, colons and a number would be dozens of gb. Still laughing when he did that.

PhrontDoor
Автор

A developer not being able to open or search a 14GB file is peak humor for me as SysOp.

IzzyIkigai