Data Analyst Portfolio Project | Correlation in Python | Project 4/4

preview_player
Показать описание
Today we continue our Data Analyst Portfolio Project Series. In this project we will be working in Python to find correlations between variables.

Please remember to save this project and add it to your GitHub once you are done!

LINKS:

____________________________________________

SUBSCRIBE!
Do you want to become a Data Analyst? That's what this channel is all about! My goal is to help you learn everything you need in order to start your career or even switch your career into Data Analytics. Be sure to subscribe to not miss out on any content!
____________________________________________

RESOURCES:

Coursera Courses:

Udemy Courses:

*Please note I may earn a small commission for any purchase through these links - Thanks for supporting the channel!*
____________________________________________

SUPPORT MY CHANNEL - PATREON/MERCH

____________________________________________

Websites:
____________________________________________

*All opinions or statements in this video are my own and do not reflect the opinion of the company I work for or have ever worked for*

0:00 Introduction
0:58 Download Dataset
1:45 Download Python IDE
3:16 Import Python Libraries
4:38 Read in Data using Pandas
8:43 Look for Missing Data
12:30 Data Cleaning
25:08 Finding Correlations in the Data
54:21 Saving and Uploading to GitHub
Рекомендации по теме
Комментарии
Автор

Update: Alex, I just accepted my first job as a junior data analyst. This completes my 6-month journey to learn data analytics and change careers, and I could not have done it without your excellent Portfolio videos. Thank you so much for making these available to your viewers for free. After I built my portfolio, companies started taking a second look at my resume and inviting me to interviews. BEFORE the portfolio, I received ONLY rejection emails. Thank you, thank you, thank you!

danielbristow
Автор

If anyone else is having issues due to IntCastingNanError, I advise to try the following:

df['budget'] = pd.to_numeric(df['budget'],
df['gross'] = pd.to_numeric(df['gross'],

it worked! :) Thank you Alex for your amazing videos!

neella
Автор

Hey all, just a "stats" heads up/correction you might want to make for your portfolio:

In this video, Alex wanted to see if the company was "correlated" with gross revenue. What he did was assign values (randomly, I think) to companies, countries, etc. Then he tried to see if those values were related to the gross revenue.

Those randomly assigned values are "measuring" the company, country, etc at the Nominal scale. In other words, they're essentially just being used as a numeric "name"—the values themselves don't mean anything. What that means is that one value being higher than the other doesn't represent an increase in the thing being measured (for example, the USA was assigned a 54 and the UK was assigned a 53. Those are just names... the USA isn't one more of something than the UK).

Because the values themselves don't represent anything, it doesn't make sense to do a correlation with them.

Correlations tell us, as one variable increases, what happens to the other? So in the first question, as the budget increased, what happened to revenue? It increased. But with country, company, or other categorical variables, correlations don't make sense. The values for country and company are random, so the numbers that represent them going up doesn't tell us anything. It's no wonder then, that the correlations weren't large.


Since this is a portfolio project and you want to show potential employers the result, maybe just take that part out—you wouldn't want to make a mistake like that in an application to a potential employer!

(Alex, thanks so much for doing these videos! They're super helpful and I'm very very grateful!)

woahnelly
Автор

I can't wait for the beginner, intermediate, and advanced Python series by Alex the Analyst. It's what the people want, besides a happy Alex.

darkavenger
Автор

Hello! The dataset appears to be updated on Kaggle and for anyone new, you will run into some issues that you need to fix to follow along.

1. Missing data. There are missing values opposed to this video so you will need to fix that. There are many ways to handle missing values but for the sake of time, I decided to drop all rows that have missing data. You will have about 71% of your data remaining. You will need to run the following if your dataframe is named df.

df = df.dropna()

2. Extracting the year is different as the formatting is different. Running the following should extract the correct year.

df['yearcorrect'] = = '([0-9]{4})').astype(int)

3. Duplicates, there aren't any in this dataset so you should be fine on that.

I hope this helps anyone that is working on this and best of luck on your analytics journey!

izzyinsc
Автор

if df.corr() shows the error that a string variable can't be converted into int pass parameter df.corr(numeric_only=TRUE)

naincypushpad
Автор

The dataset is updated and is not the same as the one in the video, if you guys have problems in the 'Create correct year section' you can do a split of the data to get only the year
df['yearcorrect'] =

OmarJimenez-dqsr
Автор

At 11:08, instead of printing null percent, we can use:
for col in df.columns:
print(df[col].isnull().value_counts(), "\n")

This will print how many values are null. Cause you might have 1 missing in 10k values, and you will need high precision in decimals.

rickydonne
Автор

Honestly, you're an absolute legend. You really break down some of the technical barriers that exist for people entering the field of data science. You really are gem to the community.

omashan
Автор

If you are facing an error in datatype change, try the following
df_copy = df.copy()
df_copy['budget'] =
df_copy['gross'] =
df_copy

Thank you Alex for this amazing video

rebekhathangam
Автор

Thank you, Alex! I learned so much.
Anyone's correlation matrix doesn't work? Need to add 'numeric_only = True'. Now the default is false.
correlation_matrix = df.corr(method = 'pearson', numeric_only = True)

sarahcongcongyang
Автор

I really appreciate the fact that you did not edit out the parts were you made "mistakes" and actually fixed them.

gastonsuarez
Автор

there are some missing value in this dataset
Alex try this instead of that for loop statement
df.isnull().sum()
this will give total number of nulls for every column/variable

moushmi_nishiganddha
Автор

At 13:40 If you are facing an error in datatype change, try the following :-


hope it will help uh

vishakhasingh
Автор

Man, thank you so much for this, I know you've put a lot of effort into this project serie and I can definitely say that i'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people to pursuit their dreams! Greetings from Brazil :)

Dpereira
Автор

Hey guys, at 46:24, we can simply assign .copy() method to our new variable if we want to use for loop to iterate over our new variable without affecting the original dataframe or df:

df_numerized = df.copy()

for col_name in df_numerized.columns:
== 'object'):
df_numerized[col_name] =
df_numerized[col_name] =

df_numerized

reezalzainudin
Автор

Man, thank you so much for this, I know you've put a lot of effort into this project series and I can definitely say that I'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people pursue their dreams! Greetings from the UK :)

shanali
Автор

As The Rock says; "FINALLY!"

I'm a bit embarrassed by how excited I get when an ATA video clocks in at over an hour...

Major_Data
Автор

Hey guys the info got updated since this video was posted. While I was going through the project I was able to google the problems as they came up.

In case you guys get stumped here's what I found that works:


This will drop any rows with null values
df = df.dropna(how='any', axis=0)

This will add the released date column into a separate column
df['yearcorrect'] = df['released'].astype(str).str.split(',


Let me know if you that works for y'all

tylerlaquinta
Автор

To everyone getting error for df.corr()

this was my fix:
# since pandas version 2.0.0 now you need to add numeric_only=True param to avoid issue

df.corr(method='pearson', numeric_only=True) #pearson, kendall, spearman

---
correlation_matrix = df.corr(method='pearson', numeric_only=True)

sns.heatmap(correlation_matrix, annot=True)

plt.show()

mikeramirez