How slow is iterating over a pandas DataFrame?

preview_player
Показать описание
You can do it, but it's a lot slower than you think. Especially when you see how much we can speed it up.

00:00 - Intro
00:55 - Iterating with iterrows
02:59 - Vectorization explained
03:28 - Iterating with Series Apply (vectorized)
05:17 - PURE SPEED
06:02 - How could we make it even FASTER?

Theme: GitHub Dark Dimmed

#jupyterinvscode, #jupyter, #python
Рекомендации по теме
Комментарии
Автор

While vectorization does help, a lot of the extra time is coming as a result of the 'attempts' being added individually. If you start with an empty list and then append to that list and after the loop is done assign the list to the 'attempts' column, you get a significant decrease in the total time, and there are probably better ways to optimize it. Bottom line, you are not really comparing the best versions of each of the implementations. Either way, it was a good video that sheds some light on vectorization, which helps a lot, especially if you run 'built-in' pandas functions.

unknown
Автор

Where was this 3 weeks ago when I was doing a data visualization project and needed this
<3 <3 <3

Thomas-givy
Автор

This is actually not a vectorized function! You're implying that with the ": pd.Series" but if you inside of the function you can see that it's actually just a string and apply runs it for every row of the series. You can also see that there is no vectorization speed improvement as you can do it in a loop with the same performance in the following way:

cnt = 0

smm = 0
for index, row in
slashIndex = row.index('/')
attempts = float(row[slashIndex-1])
smm += attempts
cnt += 1

It just leaves out the slowest operations but the loop itself is not one of them (thus the improvement is not from vectorizing it).
Sometimes vectorization does lead to drastic speed improvements but this is not an example of that.
The pandas vectorized functions for strings are under df_tweets['tweet_text'].str. (e.g. .str.extract(...), .str.slice(...), .str.index(...))
Apply is usually no faster than a well written for loop / list comprehension and is NOT vectorized.

datanasov
Автор

Ive found that .itertuples has similar times as an apply. So if anyone has a use case where they need to iterate, .itertuples is the way to go.

troysincomb
Автор

One thing I've run into is when you have conditionals in the function you're applying over. For example, if I want to apply some string functions but only to rows where the string is found based on a regex. How might you do that?

jackflitcroft
Автор

Hey bro, a vim tutorial would be appreciated too (;

adheeshsecondsago
Автор

you should quit Microsoft and launch a programming school... I'll totally join. even thought I know all this... it's weirdly ASMR-ish and therapeutic to see elegant code.

chinmayk
Автор

terminal still not working at windows 10

mustafahany