Loop / Iterate over pandas DataFrame (2020)

preview_player
Показать описание
In this video we go over how to iterate (or loop) over the rows in a Pandas DataFrame using Python. There are many ways to accomplish this and we go over some of the most common methods and how we can improve our performance for even faster results. I also discuss how you can reassign values as we loop through the rows in a DataFrame.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

$15 off Annual Dataquest subscription
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Github link to annotated Jupyter Notebook
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Other helpful links

YouTube Videos (Pycon Talks)
1000x faster data manipulation: vectorizing with Pandas and Numpy
Sofia Heisler No more sad pandas optimizing pandas code for speed and efficiency PyCon 2017
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Articles
Optimum approach for iterating over a DataFrame

Crude looping in Pandas, or That Thing You Should Never Ever Do
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

0:00 Intro
0:33 Setup
5:00 iloc
6:39 Lambda
Рекомендации по теме
Комментарии
Автор

Hi Braden - I've been wrestling with how to speed up some conditional changes to a large 1m x 100 row data frame - so the entire df is transformed with True/False according to the condition. The condition is a function of a quantile for the column. I had to use iteritems so I could go column wise. Did not need df.index but used np.where. It took 2.89 secs for the 1m x 100 data frame. It seemed more like 30 secs but still, wow! Thank you for your video - it's the best.

arnoldrosielle
Автор

Thank you!
I've been struggling with filling in a recommendation column from check on three other columns for weeks! Your iloc_example was the piece that I was missing.

chuckt
Автор

Very cool, thanks for taking the time to make this.

fraggotmode
Автор

Savage! Best loop explanation on youtube!

marcellosoto
Автор

this was a great video. Thanks for the clear demonstration of all these different options and the pros/cons

xingtinggong
Автор

Very cool showing the performance differences, waiting for your Lambda video

mikeyg
Автор

Thanks very much, just saved me a huge amount of time analysing :D

kolinsin
Автор

Great presentation. Exactly what I was after.

Zenoandturtle
Автор

Thank you a lot, I have two files with huge data to confront with one another and has taken me over 2 hours to loop and do all the checks, tomorrow I'll try one of your techniques

soreanustefanalexandru
Автор

Finally I get a good reason to master numpy, thank you so much!

juanbetancourt
Автор

Useful content nicely explained! Thank you for sharing.

wcemkfd
Автор

Great stuff! What approach would you recommend when not comparing to a static value, but another dataFrame where I just want to test is df1['col'] == df2['col']
Thanks again for sharing the knowledge....

jhillyt
Автор

not just improvements with regards to large data sets but also for operations that need to happen many times per second.

chapmansbg
Автор

Great video. Helped a bunch.
Any idea how to apply these methods for more columns with changing output - and applying these into the same col?

Borzacchinni
Автор

I will be sure to use np.where more often from now on
Thank you

immanuelsuleiman
Автор

Thanks for this great video! Would you be able to use one of these faster methods to loop over an array or dataframe in a specific # of rows, then create an xml file, and start process again? What I am doing is building XML files from a large dataframe (can be array or series or whatever). The data is just one column of. Each row is being put into a the xml file as a value. I want to do 50K at a time, print file and then continue it again with the last one not having to have exactly 50K but whatever is left. Thanks,

TheCDLSchool
Автор

Finally a clear and concise explanation. Many thanks!!
By the way, If I need to make a calculation say between two sequential rows like A[i] + B[i] / C[i+1] * D[i+1], how this could be done with the fast methods?
Thanks again

oswaldocastro
Автор

Very good tutorial.if we want to access previous row inside apply function, how can we do that. ? If I have index defined, can I do row.index-1 or if I dont have index defined, how do we know current location of current row?

ItsCloudHub
Автор

you are making these time comparasions on different columns. Although the time may be similar, I feel showing calcuation comparasions on one particular column would have been a better way.

tahiliani
Автор

Hey Brayden, super helpful video! I noticed though, after running some tests, that using "np.where" does not like to work when using int values in the dataframe. I found that initializing a numpy array beforehand and then using that instead of directly calling df['column_name'].values in the "np.where" method works, but I'm not exactly sure why. If you have a reasoning behind this I'd love to hear it!

TiR_official