Real-World Dataset Cleaning with Python Pandas! (Olympic Athletes Dataset)

Показать описание

I'm prepping a dataset for an upcoming tutorial and I figured walking through the process of cleaning it would work well for a livestream! We use various Python Pandas functions to accomplish our data cleaning goals.

We'll be working off of this repo:

Some topics that we cover:
- How you can use web scraping to collect data like this (Python beautifulsoup).
- Splitting strings into separate columns
- Using regular expressions (regexes) to extract specific details from columns
- Converting columns to datetime & numeric types
- Grabbing only a subset of our columns

Sorry that this was a bit last minute scheduling-wise, will try to give more advance notice in the future!

Video timeline!
0:00 - Livestream Overview
4:00 - About the Olympics dataset (source website and how it was scraped)
9:50 - Cleaning the dataset (getting started with code & data)
19:26 - What aspects of our data should be cleaned?
29:08 - Get rid of bullet points in Used name column
34:08 - How to split Measurements into two separate height/weight numeric columns.
1:05:00 - Parse out dates from Born & Died columns
1:25:43 - Parse out city, region, and country from Born column (working with regular expressions)
1:41:15 - Get rid of the extra columns
1:49:41 - Questions & Answers

-------------------------
Follow me on social media!

-------------------------
Practice your Python Pandas data science skills with problems on StrataScratch!

Join the Python Army to get access to perks!

*I use affiliate links on the products that I recommend. I may earn a purchase commission or a referral bonus from the usage of these links.

Рекомендации по теме

Комментарии

Thank you everyone who tuned in today!!

KeithGalli

I really thank god that I found your channel thanks for sharing knowledge and keep uploading

rrrprogram

Such a great tutorial Keith. Please keep uploading such high quality videos on Pandas and many more

aishwaryapattnaik

I missed the live stream, but I am watching this video atm. This is the second upload of yours I have watched. I am a subscriber and wish to thank you very much for your uploads. Please, keep them coming. I am very new to Python. I am learning Python: firstly, to realise a knowledge graph 'index' for computational shells and shell scripting in the widest possible purview, for a Web app/website version of a dedicated work on computational shells and shell scripting, I have spent the last six months writing. I need to extract all the data from an archive of Markfown files, the book I have written, which involves cleaning, preserving the relationships of the data to inform the generation of an ontology of the computational shells and shell scripting domain, through natural language processing. Establish a dataset. Export dataset into a directed graph. Visualise with NetworkX. I don't yet know how to do any of this. If you could cover some of the processes involved to realise a knowledge graph from a Markdown file, that would be brilliant! Thanks again for your uploads.

beauforda.stenberg

Thanks Keith! I know it takes some time to prepare and record such staff, but please upload more of Python coding!

marcinjagusz

What's your laptop? Cool videos BTW

AndyJagroom-urxh

Great stream this was very helpful! Keep up the good work!

danprovost

we need more like this videos and work on real world data

zahidmhd

Can you do an update on the numpy video, thank you so much for these videos it helped me a lot ❤

AndyJagroom-urxh

Hi Keith, watching this video and following along. Just wondering if when we got the fillna code from chat gpt if we should have applied that to our original data frame? Loving the content!

brendanthorne

okay i need full course on data science

zahidmhd

Hi Keith,

This code handles the issue will:

# Split column 'Measurements'to height_cms and weight_kgs

dfCpy['height_cm'] = None # add a blank column to store height
dfCpy['weight_kgs'] = None # add a blank column to store weight

# Extract height and weight information
dfCpy['height_cm'] = cm', expand=False).astype(float)
dfCpy['weight_kgs'] = kg', expand=False).astype(float)

dfCpy

rrcr

Please Upload more videos related to data cleaning

-ashish

just a note, at 1:19:21 the format = "mixed" isn't really working for me, and it fills the date_born column with NaT values. So, I tried format = "%d %B %Y" and it works

Kira-vsnp

Hawaiian shirt and Twisted Tea! My man

chillydoog

Should i always drop the rows containing null values and then perform the further analysis???

vg

HATS OFF TO YOU SOME REAL LIFE PROBLEMS AND END TO END PROJECTS RELATED. TO DATA SCIENCE

hassankhalid

i did chatgpt for the questions that you framed and it is showing same solution, i could have easily done chatgpt rather than seing this video just download the dataset and put some rows of the dataset in chatgpt and put all the frames question they will be same as in this video for 2 hrs, it took 5 min for chatgpt to do..

SAGAR-oxks

that's what i used :
# Parse out dates from Born and Died

df['Born Date'] = df['Born'].str.replace(r'in.*', '', regex=True)
df['Death Date'] = df['Died'].str.replace(r'in.*', '', regex=True)

youcefbouras-fs

Great! For height/weight parts, it's a bit longer, there be some simple solution
measure_pattern =
df[['height', 'weight']] =

cnliving

Real-World Dataset Cleaning with Python Pandas! (Olympic Athletes Dataset)

Real World Data Cleaning in Python Pandas (Step By Step)

Real-World Dataset Cleaning with Python Pandas! (Olympic Athletes Dataset)

How to Do Data Cleaning (step-by-step tutorial on real-life dataset)

How to Do Data Exploration (step-by-step tutorial on real-life dataset)

Solving real world data science tasks with Python Pandas!

A Real-World Data Cleaning Project - 100% Free!

Hands-On Data Cleaning with Python Pandas on Real World Data

Data Cleaning Tutorial | Cleaning Data With Python and Pandas

Data Preprocessing for AIML: End-to-End Session 26

Economic Data Analysis Project with Python Pandas - Data scraping, cleaning and exploration!

Data Cleaning in Pandas | Python Pandas Tutorials

Learn how to use PANDAS in Python in 15 minutes - with 10 real examples

Pandas & Python for Data Analysis by Example – Full Course for Beginners

Solving real-world data analysis problems with Python Pandas! (Lego dataset analysis)

Real world data cleaning in python pandas step by step

Best Places to Find Datasets for Your Projects

Cleaning Data in Excel | Excel Tutorials for Beginners

Exploratory Data Analysis with Pandas Python

Solving real world data science tasks with Python Beautiful Soup! (movie dataset creation)

New Python Tutorial: Diagnose data for cleaning

Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)

Python Machine Learning Tutorial (Data Science)

Scraping Data from a Real Website | Web Scraping in Python

Solve Real-World Data Science Tasks in Python | Data Analysis with Pandas & Plotly (Full Tutoria...