filmov
tv
Python Tutorial : Dataframes and Series

Показать описание
---
Welcome to Exploratory Data Analysis in Python! I'm Allen Downey and I'll be your instructor. The goal of exploratory data analysis is to use data to answer questions and guide decision making.
As a first example, we'll start with a simple question: what is the average birth weight of babies in the United States?
To answer a question like this, we have to find an appropriate dataset or run an experiment to collect it. Then we have to get the data into our development environment and prepare it for analysis, which involves cleaning and validation.
For this question we'll use data from the National Survey of Family Growth, which is available from the National Center for Health Statistics.
The 2013-2015 dataset includes information about a representative sample of women in the U.S. and their children.
The Python module we'll use to read and analyze data is Pandas, which we'll import as `pd`.
Pandas can read data in most common formats, including CSV, Excel, and the format the NSFG data is in, HDF5.
The result from read_hdf() is a DataFrame, which is the primary data structure Pandas uses to store data.
head() shows the first 5 rows of the DataFrame, which contains one row for each pregnancy for each of the women who participated in the survey, and one column for each variable.
The DataFrame has an attribute called shape, which is the number of rows and columns; there are 9358 rows in this dataset, one for each pregnancy, and 10 columns, one for each variable.
The DataFrame also has an attribute called `columns`, which is an Index. That's another Pandas data structure, similar to a list; in this case it's a list of variables names, which are strings.
Based on the names, you might be able to guess what some of the variables are, but in general you have to read the documentation.
In many ways a DataFrame is like a Python dictionary, where the variable names are the keys and the columns are the values. You can select a column from a DataFrame using the bracket operator, with a string as the key.
The result is a Series, which is another Pandas data structure. In this case the Series contains the birth weights, in pounds, of the live births (or in the case of multiple births, the first baby).
head() shows the first five values in the series, the name of the series, and the datatype; float64 means that these values are 64-bit floating-point numbers.
Notice that one of the values is NaN, which stands for "Not a Number". NaN is a special value that can indicate invalid or missing data. In this example, the pregnancy did not end in live birth, so birth weight is inapplicable.
Let's start exploring this data by working on some exercises.
#DataCamp #PythonTutorial #ExploratoryDataAnalysisinPython
Комментарии