Python Tutorial: Importing flat files using pandas

preview_player
Показать описание

---

Congrats! You're now able to import a bunch of different types of flat files into Python as NumPy arrays. Although arrays are incredibly powerful and serve a number of essential purposes, they cannot fulfill one of the most basic needs of a Data Scientist: to have "[two]-dimensional labeled data structure[s] with columns of potentially different types" that you can easily perform a plethora of Data Sciencey type things on manipulate, slice, reshaped, groupby, join, merge, perform statistics in a missing-value-friendly manner, deal with times series. The need for such a data structure, among other issues, prompted Wes McKinney to develop the pandas library for Python. Nothing speaks to the project of pandas more than the documentation itself: "Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain-specific language like R." The data structure most relevant to the data manipulation and analysis workflow that pandas offers is the dataframe and it is the Pythonic analogue of R’s dataframe. As Hadley Wickham tweeted, "A matrix has rows and columns. A data frame has observations and variables."

Manipulating dataframes in pandas can be useful in all steps of the data scientific method, from exploratory data analysis to data wrangling, preprocessing, building models and visualization. Here we will see its great utility in importing flat files, even merely in the way that it deals with missing data, comments along with the many other issues that plague working data scientists. For all of these reasons, it is now standard and best practice in Data Science to use pandas to import flat files as dataframes. Later in this course, we’ll see how many other types of data, whether they’re stored in relational databases, hdf5, MATLAB or excel files, can easily be imported as dataframes.

To use pandas, you first need to import it. Then, if we wish to import a CSV in the most basic case all we need to do is to call the function read_csv() and supply it with a single argument, the name of the file. Having assigned the dataframe to the variable data, we can check the first 5 rows of the dataframe, including the header, with the command 'data dot head()'. We can also easily convert to the dataframe to a numpy array by calling the dataframe attribute values.

Now it's your turn to play around importing flat files using Python. You'll get experience importing a flat file that is straightforward and you'll also get experience importing a flat-file that has a few issues, such as containing comments & strings that should be interpreted as missing values: have fun importing!
Рекомендации по теме
visit shbcf.ru