Python Tutorial: Introduction to flat files

Показать описание

---
Welcome to the course! I'm Amany Mahfouz, and I'll be your instructor. In this course, we'll focus on ingesting data, a fundamental step in any data science project -- after all, you can't analyze what you can't access. Along the way, you'll learn some techniques to clean messy data.

To do this, we'll use the Python library pandas. pandas was originally developed by Wes McKinney in 2008 for financial quantitative analysis, but today it has a large development community and is used in many disciplines. pandas makes it easy to load and manipulate data, and it integrates with loads of analysis and visualization libraries.

Central to pandas is the data frame.

Data frames are two-dimensional data structures.

This means they have columns, typically labeled with variable names and rows, which also have labels, known as an index in pandas. The default index is the row number, but you can specify a column as the index, and many types of data can be used.

While you can create data frames by hand, you'll usually want to load existing data. Pandas handles many data formats, but let's start with a basic one: flat files.

Flat files are simple, making them popular for storing and sharing data. They can be exported from database management systems and spreadsheet applications, and many online data portals provide flat file downloads.

In a flat file, data is stored as plain text, with no formatting like colors or bold type.

Each line in the file represents one row, with column values separated by a chosen character called a delimiter.

Usually, the delimiter is a comma, and such files are called CSVs, or comma-separated values, but other characters can be used.

A single pandas function loads all flat files, no matter the delimiter: read CSV.

Let's import some tax data published as a CSV by the Internal Revenue Service, the U.S. government's tax collection agency. This file has information about household composition and income by ZIP code, making it useful for social and economic analyses.

First, we import pandas as pd. "pd" is the conventional nickname for pandas.

Then, we pass the file path as a string to pd dot read CSV, and assign the resulting data frame to a variable -- "tax data" here.

Finally, we check the first few rows of the new data frame with tax data dot head.

But what if that file used a different delimiter, like a tab? Rather than have different functions for every possible delimiter, pandas lets you import any flat file with read CSV and its sep keyword argument.

Let's use a tab-separated version of the same tax file to see what this looks like.

Again, we import pandas as pd.

Then we pass the file path string to read CSV, but this time, we include another argument, sep. "Backslash t" represents a tab.

Last, we check out the data frame with the head method.

Now it's your turn to practice what you've learned. Good luck!

#Python #PythonTutorial #DataCamp #Streamlined #Data #Ingestion #pandas #flatfiles