Python Tutorial: Modifying flat file imports

Показать описание

---
You now know how to load a flat file into a data frame. Sometimes though, you only want part of the file. In this lesson, we'll look at ways to limit the amount of data imported, and how to make that data easier to work with by naming columns.

Let's revisit the U.S. tax data. Checking the data's shape attribute, we see that it has almost 180,000 rows and 147 columns. Unless your computer has very little memory or is running several other programs, pandas can easily handle data this size, but it is more than we need. We have some options to whittle it down.

The first way is to choose columns to load with the usecols keyword argument.

Usecols can take a list of either all column names or all column numbers to import. It can even accept a function to select columns, so that you can, say, import all columns starting with the letter N, but this is a more advanced technique.

To focus on the number of tax returns by income band and ZIP code, we can set up a list of either the relevant column names

or their numbers.

We supply the file path as a string to read CSV as usual. This time, we also pass usecols the list of column names

or numbers to load.

We can check to make sure the two ways produce the same result, and they do.

Another option is to reduce the number of rows imported with the nrows argument.

When reading the file, we pass an integer of the maximum number of rows we want, 1000 here.

Checking the data frame's shape, we see we have exactly the number of rows we asked for.

Nrows is especially handy when combined with the skiprows argument to process a file in chunks.

Skiprows accepts a list of row numbers to skip, a function to determine whether to skip a row, or the number of rows to skip.

Note that pandas automatically makes the first row imported the header, so if you skip the row with column names, you should also specify that header equals none.

Let's get rows 1000 to 1500. Like before, we use nrows to specify how many rows we want, but add that skiprows equals 1000. We skipped the header row, so we also specify there isn't one here.

Checking the head of the data frame, we see there are no column names. Let's fix this.

To assign column names when there aren't any, we use another read CSV argument: names. As you probably guessed, names takes a list of column names to use.

The list must include a name for every column in the data --

if you only want to rename some columns, it should be done after import.

For datasets with fewer columns, you might manually build the names list with the help of a data dictionary. We want to copy a long list of column names from an existing data frame, so we'll pass the first tax subset to the list function to get the column labels.

Then we read the CSV, using nrows and skiprows to get the next 500 records, setting header to none, and passing the list of column names to names.

Now our data frame has column names!

Whew! You just learned workflows to manage the amount of data loaded from a flat file. Now you'll get a chance to practice those techniques. Good luck!

#Python #PythonTutorial #DataCamp #Streamlined #Data #Ingestion #pandas #Modifying #flatfile