Python Tutorial: Reindexing DataFrames

Показать описание

As a Data Scientist, you'll often find that the data you need is not in a single file. It may be spread across a number of text files, spreadsheets, or databases. You want to be able to import the data of interest as a collection of DataFrames and figure out how to combine them to answer your central questions. This course is all about the act of combining, or merging, DataFrames, an essential part of any working Data Scientist's toolbox. You'll hone your pandas skills by learning how to organize, reshape, and aggregate multiple data sets to answer your specific questions.

Now we can import many files into individual DataFrames, let's investigate sharing information between DataFrames using their Indexes.

This is essential for combining DataFrames later, as Indexes are the means by which DataFrame rows are labelled.

Let's make a brief note on terminology.

The plural of "index" in English can be indexes or indices; both are acceptable.

Let's adopt the convention of using indices as the plural of index when referring to individual labels within an Index data structure.

By contrast, we can use indexes as the plural of index with reference to many Index data structures associated with several Pandas Series or DataFrames.

This is not a standard convention, but it helps us resolve ambiguity in thinking about, say, sets of indices within many indexes.

To start, let's load two DataFrames of temperature data recorded from Pittsburgh in 2013.

For both calls to read_csv(), we use the index_col argument to specify which column becomes the DataFrame Index ('Month' in both cases).

Remember, the Index is a privileged column in Pandas providing convenient access to Series or DataFrame rows.

We can examine our DataFrames w_mean & w_max more closely.

The Mean TemperatureF & Max TemperatureF columns are respectively the average & maximum daily temperatures (in Fahrenheit) observed during three-month intervals or quarters.

For both DataFrames, the column Month is the DataFrame Index.

The month listed in each Index row is the first month of each quarter.

By virtue of how the CSV files are stored, the Index of w_mean is in alphabetical order while the Index of w_max is in chronological order;

the former gives a distorted sense of time-dependent trends.

The DataFrame Indexes are accessed directly with the .index attribute.

Both w_mean & w_max have Indexes of type object because the index labels are strings.

The Pandas type() function shows us the data type of the Index.

We define a list called ordered to impose a deliberate ordering for the Index labels of w_mean.

The DataFrame .reindex() method creates a new DataFrame w_mean2 with the same data as w_mean but with a new row ordering according to the input list ordered.

We can see that w_mean2 has the desired chronological ordering.

The original alphabetically-ordered DataFrame can be recovered with the DataFrame .sort_index() method.

Pandas Index labels are typically sortable data, such as numbers, strings, or datetimes.

The input argument to the .reindex() method can also be another DataFrame Index.

For instance, here, we use the Index from w_max to reindex w_mean in chronological order.

When a suitably indexed DataFrame is available, the reindex() method spares us having to create a list manually or having to sort the Index.

The specific Index labels provided to the .reindex() method are important.

For instance, if we invoke .reindex() again using an input list containing a label that is not in the original DataFrame Index (Dec in this case), an entirely new row is inserted & filled with the null value NaN or not-a-number.

We can also use .reindex() to see where DataFrame rows overlap.

For instance, here, we reindex w_max with the index of w_mean3 showing that w_max does not have a row labelled Dec either.

Using .dropna() removes entire rows in which null values occur.

This is a common first step when merging DataFrames.

Finally, we should realize that order counts.

The latter fixes the row order as desired in w_mean; the former replicates the misleading alphabetical row order in w_max; this is likely not desired.

Try reindexing some DataFrames now in the exercises.