Python Tutorial: The importance of flat files in data science

Показать описание

---

Now you know how to import plain text files, we're going to look at flat files, such as 'titanic dot csv', in which each row is a unique passenger onboard and each column is a feature of attribute, such as gender, cabin and 'survived or not'.

It is essential for any budding data scientist to know precisely what the term flat file means Flat files are basic text files containing records, that is, table data, without structured relationships. This is in contrast to a relational database, for example, in which columns of distinct tables can be related. We'll get to these later. To be even more precise, flat files consist of records, whereby a record we mean a row of fields or attributes, each of which contains at most one item of information. In the flat file 'titanic dot csv', each row or record is a unique passenger onboard and each column is a feature or attribute, such as name, gender, and cabin.

It is also essential to note that a flat-file can have a header, such as in 'titanic dot csv', which is a row that occurs as the first row and describes the contents of the data columns or states what the corresponding attributes or features in each column are. It will be important to know whether or not your file has a header as it may alter your data import.

The reason that flat files are so important in data science is that we data scientists really honestly like to think in records or rows of attributes.

Now you may have noticed that the file extension was dot csv. You may be wondering what this is? Well, CSV is an acronym for comma-separated value and it means exactly what it says. The values in each row are separated by commas. Another common extension for a flat-file is dot txt, which means a text file.

Values in flat files can be separated by characters or sequences of characters other than commas, such as a tab, and the character or characters in question is called a delimiter.

See here an example of a tab-delimited file. The data consists of the famous MNIST digit recognition images, where each row contains the pixel values of a given image. Note that all fields in the MNIST data are numeric, while the 'titanic dot csv' also contained strings.

How do we import such files? If they consist entirely of numbers and we want to store them as a NumPy array, we could use NumPy. If instead, we want to store the data in a dataframe, we could use pandas. Most of the time, you will use one of these options.

In the rest of this chapter, you'll learn how to import flat files that contain only numerical data, such as the MNIST data, and import flat files that contain both numerical data and strings, such as 'titanic dot csv'.

But first, let's get you to do a couple of quick multiple-choice questions to test your knowledge of flat files.