Python Tutorial: Arithmetic with Series & DataFrames

Показать описание

As a Data Scientist, you'll often find that the data you need is not in a single file. It may be spread across a number of text files, spreadsheets, or databases. You want to be able to import the data of interest as a collection of DataFrames and figure out how to combine them to answer your central questions. This course is all about the act of combining, or merging, DataFrames, an essential part of any working Data Scientist's toolbox. You'll hone your pandas skills by learning how to organize, reshape, and aggregate multiple data sets to answer your specific questions.

Let's explore various arithmetic & mathematical operations between Pandas Series & DataFrames.

We load daily weather measurements for Pittsburgh from 2013.

We make 'Date' the Index & we use parse_dates=True to get datetime objects.

With datetime Indexes, we can use convenient strings to slice, say, the first week of July from the 'PrecipitationIn' column.

The Precipitation data are in inches; let's convert them to centimeters.

We use the asterisk to multiply a Series elementwise by 2.54.

Remember, we can broadcast standard scalar mathematical operations.

Here, broadcasting means the multiplication is applied to all entries in the DataFrame.

Let's find the percentage variation in temperature in the first week of July.

That is, the daily minimum & the daily maximum temperatures expressed as a percentage of the daily mean temperature.

We compute this by dividing both the 'Min TemperatureF' & the 'Max TemperatureF' columns by the Mean TemperatureF column and multiplying both by 100.

To begin, slice the 'Min TemperatureF' & 'Max TemperatureF' columns as a DataFrame week1_range.

Next, slice the Mean TemperatureF column as a Series week1_mean.

Dividing DataFrame week1_range by Series week1_mean doesn't quite work.

The column labels don't match so the result has all null values.

Instead, we want to use the DataFrame .divide() method with option axis='rows'.

The .divide() method provides more fine-grained control than the division operator by itself.

This broadcasts the Series week1_mean values across each row to produce the desired ratios.

We can see the temperature range varies by at most about 10% from the mean in that week.

A related computation is to compute a percentage change along a time series.

We do this by subtracting the previous day's value from the current day's value and dividing by the previous day's value.

The pct_change() method does precisely this computation for us.

Here, we also multiply the resulting series by 100 to yield a percentage value.

Notice the value in the first row is NaN because there is no previous entry.

Finally, let's examine how arithmetic operations work between distinct Series or DataFrames with non-aligned indexes, which happens often in practice.

We'll use Olympic medal data from 1896 to 2008.

Here are the top five Bronze-medal winning countries...

...the top five Silver-medal winning countries...

... and the top five Gold-medal winning countries.

All three DataFrames have the same indices for the first three rows ('United States', 'Soviet Union', & 'United Kingdom').

By contrast, the next two rows are either 'France', 'Germany', or 'Italy'.

Let's compute total medals awarded to each country.

We start by adding bronze and silver.

Here, we add two Series of 5 rows & get back a Series with 6 rows.

The index of the sum is the union of the row indices from the original two Series.

Arithmetic operations between Pandas Series are carried out for rows with common index values.

Since 'Germany' does not appear in silver & 'Italy' does not appear in 'bronze', those rows have NaN in the sum.

On examination, we see the value 2247 for the United States row is the sum of 1052 and 1195 from the corresponding rows of the bronze & silver Series respectively.

The null values occur in the same places.

The default fill value is NaN when summand rows fail to align.

We can modify this behavior using the fill_value option.

By specifying fill_value=0, the values of Germany & Italy are no longer null.

Just as the divide() method is more flexible than the slash operator for division, the add() method is more flexible than the plus operator for addition.

Adding all three series together yields six rows of output, but only three rows have non-null values.

That is, 'France', 'Germany' & 'Italy' are not Index labels in all three Series, so each of those rows is NaN in the sum.

We can chain multiple method calls to .add() with fill_value=0 to get rid of those null values in the triple sum.

Now you can get some experience with standard arithmetic operations & methods for Series & DataFrames in the exercises.