Python Tutorial: Merging DataFrames with pandas (part 1)

preview_player
Показать описание

As a Data Scientist, you'll often find that the data you need is not in a single file. It may be spread across a number of text files, spreadsheets, or databases. You want to be able to import the data of interest as a collection of DataFrames and figure out how to combine them to answer your central questions. This course is all about the act of combining, or merging, DataFrames, an essential part of any working Data Scientist's toolbox. You'll hone your pandas skills by learning how to organize, reshape, and aggregate multiple data sets to answer your specific questions.

In this chapter, you'll learn about different techniques you can use to import multiple files into DataFrames. Having imported your data into individual DataFrames, you'll then learn how to share information between DataFrames using their Indexes. Understanding how Indexes work is essential information that you'll need for merging DataFrames later in the course.

Welcome to "Merging DataFrames with Pandas".

My name is Dhavide Aruliah.

I'm an applied mathematician and data scientist.

This course is all about merging and combining DataFrames for your data science needs.

Your data rarely exists as DataFrames from the outset: you generally have to deal with text files, spreadsheets, and databases.

Let's first check out how to read multiple files into a collection of DataFrames.

The primary tool we've used for data import is read_csv().

This function accepts the filepath of a comma-separated values file as input and returns a Pandas DataFrame directly.

read_csv() has about fifty optional calling parameters permitting very fine-tuned data import.

Pandas has other convenient tools (with similar default calling syntax) that import various data formats like Excel, HTML, or JSON into DataFrames.

To read multiple files using Pandas, we generally need separate DataFrames.

It's generally more efficient to iterate over a collection of file names.

With that goal, we can create a list filenames with the two filepaths from before.

We then initialize an empty list called dataframes and iterate through the list filenames.

Within each iteration, we invoke read_csv() to read a DataFrame from a file and we append the resulting DataFrame to the list dataframes.

We can also do the preceding computation with a list comprehension.

Comprehensions are a convenient Python construction for exactly this kind of loop where an empty list is appended to within each iteration.

You can check out DataCamp's Python programming courses for more details on comprehensions.

When many filenames have a similar pattern, the glob module from the Python Standard Library is very useful.

Here, we start by importing the function glob() from the built-in glob module.

We use the pattern sales asterisk dot csv to match any strings that start with prefix sales and end with the suffix dot csv.

The asterisk is a wildcard that matches zero or more standard characters.

The function glob() uses the wildcard pattern to create an iterable object filenames containing all matching filenames in the current directory.

Finally, the iterable filenames is consumed in a list comprehension that makes a list called dataframes containing the relevant data structures.

Now it's your turn to practice reading multiple files into DataFrames.
Рекомендации по теме
Комментарии
Автор

Hi I tried your code, it only piled the data on each other instead of merging to rows

temiisaacaugustus
Автор

To my understanding, there is something wrong with the sample code here. The code should be like this:
dataframes = pd.concat([pd.read_csv(f) for f in filenames]) I hope this helps so you don't have to waste your time.

EkaAMaharta
Автор

what is f in above both codes are shown in the video ?

abhisheksaraswat
Автор

but this gives a stupid error that files does not exist now what should I do ?
any solution

rulebreaker