Tidy Tuesday live screencast: Analyzing employment and earnings in R

preview_player
Показать описание
I'll analyze a dataset about employment and earnings over time, without looking at the dataset in advance.

Рекомендации по теме
Комментарии
Автор

2:00 Setup and downloading the dataset [use_tidytemplate(), tt <- tidytuesdayR::tt_load("2021-02-23"), tt_load(2021, week=9)]
2:50 "employed" dataset counting industries, years, major_occupation, race_gender [employed <- tt$employed]
3:50 add a dimension column holding race and gender [mutate(dimension = case_when(race_gender == "TOTAL" ~ "Total", ...))]
4:45 Bar chart of industries over time [geom_col(), fct_lump(industry, 8, w=employ_n), fct_reorder(industry, employ_n, sum)]
7:30 Faceted charts by industries [geom_line(), geom_col(), expand_limits(), facet_wrap(~ industry, scales = "free_y")]
9:10 Explaining how the "saw plot pattern" can be avoided and why it appears
10:30 Investigating the nested data structure (industry > major occupation > minor occupation)
11:10 Adding gender and race information to bar chart [ggplot(aes(fill= race_gender))]
12:20 Investigating the drops of industries by gender [group_by(), summarize=sum(employ_n), geom_line(), expand_limits(y=0)]
16:00 Only focus on 2019 to 2020 changes across industries based on race and gender [geom_line(), fct_reorder(industry, employ_n, sum))]
19:30 Reverse the legend order [scale_color_discrete(guide = guide_legend(reverse = TRUE))]
21:28 Produce ratio 2019/2020 and change-% [group_by(industry, dimension, race_gender), summarize(ratio = last(employ_n) / first(employ_n), employed_2019 = first(employ_n), change = ratio - 1)]
22:48 What are the biggest drops in ratio from 2019 to 2020 by race and gender [geom_col(), scale_x_continuous(labels = percent)]
25:30 Add gender information, [geom_col(position = "dodge"), scale_fill_discrete(guide = guide_legend(reverse = TRUE))]
27:10 Lollipop graph [geom_point(aes(size = employed_2019)), scale_size_continuous(labels = comma, guide =FALSE), geom_errorbarh(), geom_vline()]
30:10 Fill bar plot with race information
31:15 Revisit the Lollipop graph with dodged lines [geom_point(position = position_dodge(width = 0.5)), geom_erorbarh(group = race_gender, position = position_dodge(width = .7))]
34:17 Scatterplot of employed_n vs change [geom_point(), = industry)), geom_hline()]
38:00 Data transformation interlude [comparison <- employed_cleaned %>% mutate(paste()) %>% gather(level, occupation, industry, major_occupation, minor_occupation), group_by(), summarize(), ungroup()]
45:50 Write function "compare_lollipop" to use the new comparison data set and visualize changes for major_occupation of a certain industry by gender and race automatically.
52:59 Add total n of jobs to the y-axis of lollipop chart [glue("{ occupation } ({ comma(total_2019 / 1000) }K)")]

Thank you, David, for these entertaining and instructive screencasts.

If anyone wants to see summaries of previous screencasts feel free to visit my channel.

TheDataDigest
Автор

David, thanks a lot for this!
I noticed that lately many of your screencasts increasingly deal with graphing and/or shiny app or dashboard dev.

Can you probably balance it with modelling (Frequentist LM/GLM or Bayesian)? Those were very enlightening (for me, at least)?

alikazmi
Автор

thank you for this amazing analysis, really liked this screencast in particular. Though I have come across some data issues that can be solved with minor cleaning.

1. employed <- employed %>%
mutate(minor_occupation = str_replace(employed$minor_occupation, "Manage-ment", "Management"))

2. employed %>%
filter(!industry %in% c(NA, "Men", "Women", "White", "Black or African American", "Asian"))

3. Millions could look better instead of using labels = Comma()

scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))

vineetsansi
Автор

Sorry, how did you 'load the data' at the beginning of the video? Is there a resource for how to do that? sorry if that's a very basic question

intellectualselfdefense