Introduction to Pandas | #1 of 53: The Complete Pandas Course

Показать описание

Let's understand what pandas is, and why you must absolutely master pandas on your data science journey.

Now, pandas is the most popular and widely used library for data analysis and data wrangling as a whole. Anything that you want to do with data is absolutely possible to do with pandas. And for that reason, pandas is my favorite library for data wrangling in Python.

Not just me, if you look at any data scientist working with Python, I would expect that person to know pandas really, really well. Because in order to be able to do hands on data science work in Python, you need to be really comfortable manipulating data first.

So pandas is probably the very first library that you must master on your data science journey. And that is exactly what we will do in this course as well. So what does pandas do? as a practitioner, you will be working a lot with data.

And you will be reading data from a wide variety of sources. It could be CSV files, it could be Excel files, JSON, HTML, various types of databases. It could also be from statistical software, such as SAS, SPSS, and so on. pandas provides nice API to read data from various different sources, as well as different file formats as well. Once you get the data, you will be able to process it, visualize the data and do all sorts of data wrangling with it.

And pandas is compatible with libraries that you use to build machine learning models also. So think of pandas as the Excel for Python. using Excel, Microsoft Excel, you can do all sorts of data manipulation. You can do the same with pandas, but you have more advantages are there is no limitation on the size of the data that you can use with pandas only that the size of the data is limited to the size of the ram in your computer system. So the more memory you have in your system, larger the data you will be able to process.

Now the creator of the pandas packages, Wes McKinney, a highly revered figure in the field. He originally started the work in the year 2008 when he was working for AQR Capital Management. And from 2009 onwards, this project became open source, and a lot more people continue to contribute to this project. And that's the active team that maintains it and continually develops the package as we go along. Now, why pandas because it is the default library it is the most favorite library used by a lot of data scientists in the field. If you want to handle tabular data, the default library is pandas. It is also compatible with other machine learning libraries such as psychic learn and all.

So if you are working with pandas, you are good to go to build your machine learning models as well. And since it has a wide adoption by a lot of members, it is very easy to find solution for any type of problem that you are facing with pandas. Besides this, there is excellent plotting capabilities.

Also, you can plot any type of plot, or the most commonly used plots for data analysis with pandas very easily. And finally, it offers two main data structures, the data frame and series. And both of these data structures are highly optimized for performance. And the API that they provide is very intuitive to work with data. So for all these reasons, you must definitely learn and be very confident working with the pandas library. It will be very useful if you're going to pursue any type of hands on role in the data science space.

Join ML+ membership for exclusive Data science content