Python Tutorial: Handling errors and missing data

Показать описание

---
So far, we've imported flat files with minor tweaks to set column names and manage the amount of data loaded. This is enough if the data is already in great shape. But what if there are issues with the data or the import?

Common issues include incorrect column data types, which can hinder analysis, missing values indicated with custom designators, or records that pandas cannot read. Luckily, read CSV offers ways to address these issues during import, reducing the wrangling needed later on.

When importing data, pandas infers each column's data type. Sometimes, it guesses wrong.

Checking data types in the tax data, we see that pandas interpreted ZIP codes as integers. They're more accurately modeled as strings, though -- ZIP codes are not quantities and include meaningful leading zeros.

Instead of letting pandas guess, we can set the data type of any or all columns with read CSV's dtype keyword argument.

Dtype takes a dictionary, where each key is a column name and each value is the data type that column should be. Note that non-standard data types, like pandas categories, must be passed in quotations.

Here, we specify that the zipcode column should contain strings, leaving pandas to infer the other columns.

Printing the new data frame's dtypes, we see that zipcode's dtype is "object", which is the pandas counterpart to Python strings.

Missing data is another common issue. pandas automatically recognizes some values, like “N/A” or “null”, as missing data, enabling the use of handy data-cleaning functions. But sometimes missing values are represented in ways that pandas won't catch, such as with dummy codes.

In the tax data, records were sorted so that the first few have the ZIP code 0, which is not a valid code and should be treated as missing.

We can tell pandas to consider these missing data with the NA values keyword argument.

NA values accepts either a single value, a list of values, or a dictionary of columns and values in that column to treat as missing.

Let's pass a dictionary specifying that any zeros in zipcode should be treated as missing data.

Then we filter the data using the is NA method on the zipcode column to view rows with missing postal codes.

One last issue you may face are lines that pandas just can't parse. For example, a record could have more values than there are columns, like the second record in this corrupted version of the tax data. Let's try reading it.

By default, trying to load this file results in a long error, and no data is imported.

Luckily, we can change this behavior with two arguments, error bad lines and warn bad lines. Both take Boolean, or true/false values.

Setting error bad lines to False makes pandas skip bad lines and load the rest of the data, instead of throwing an error.

Warn bad lines tells pandas whether to display messages when unparseable lines are skipped.

Let's try importing the corrupted file again, this time with error bad lines set to False and warn bad lines equal to True. Success!

A word of caution: if lines were skipped, it's worth investigating what was left out to see if there are underlying issues that should be addressed.

In this video, you've learned about common issues encountered importing data from flat files, and how to handle those issues. Now, it's your turn to practice. Good luck!

#Python #PythonTutorial #DataCamp #Streamlined #Data #Ingestion #pandas #errors