How to Remove Duplicates in Pandas DataFrame | #16 of 53: The Complete Pandas Course

Показать описание

--------------------
Another very useful data wrangling activity is removing duplicates from your data frame.

Now, the definition of duplicate can vary from use case to use case, generally what is referred to as a duplicate row in your data primers. If a given row contains all of its values to be exactly the same as another row in your data frame, a two rows contain all of these values are same, right, all the values in all the columns are same, then these two can be considered as duplicates.

These two rows, by the way, are not duplicates, of course, but that is the definition. But besides this, you might also want to consider a case where suppose, based on only these two columns, say there is another row where state value equal to case and account length also equal to 128, then this row and the other row I just mentioned, are considered as duplicate, other values may change.

But based on these two columns, those two rows will still be considered as a duplicate, right? These are different cases that we can do using pandas. And all of this is possible using this one single function called drop underscore duplicates.

And when dropping the duplicate, there are certain other cases, you might want to keep the very first occurrence, you might want to keep the very last occurrence, you might want to delete all the occurrences also, right. First, keep the first keep the last or delete all the occurrences.

These are the different ways of removing duplicates, as well. Let's run and see the cases. So we have the data frame, you're using this JSON dot CSV itself. Now we want to drop duplicates based on the state column alone, let's run this and see the output.

Now here in this output, this should contain as many states that are existing in the United States. So there are 51 rows, none of these rows will have the same value for state column in any other row, that's guaranteed here.

Likewise, suppose if we want to consider two other rows, say state and area code, in this output, none of the rows will have the same exact values for both state as well as area code here. Alright, so that's the output of this.

Now, instead of keep this one keeps the very first occurrence of a duplicate row. Alright, you could change this to last, notice the output here, in this version, the first version there are 123 rows. Now I am changing it to last let me run this now. Here are also the same 153 rows only comes as an output, what the different rows are selected, that is the last occurrences are selected.

Likewise, an alternative is you may want to completely remove all the duplicate rows. No such rule exists in this data frame. Right. So this is another case.

Now if you remove all of these arguments, and run the default, now this will consider a row as a duplicate only if all the columns in a given row has an exact match with any other row. So those are the different ways we deal with duplicates.