Alex Petralia - Analyzing Data: What pandas and SQL Taught Me About Taking an Average - PyCon 2018

Показать описание

Speaker: Alex Petralia

“So tell me,” my manager said, “what is an average?”

There’s probably nothing worse than that sinking feeling when you finish an analysis, email it to your manager or client to review, and they point out a mistake so basic you can’t even fathom how you missed it.

This talk is about mine: how to take an average.

In this talk, we follow my arduous and humbling journey of learning how to properly take an average with multidimensional data. We will cover how improperly calculating it can produce grossly incorrect figures, which can slip into publications, research analyses and management reports.

Рекомендации по теме

Комментарии

For the whole talk I was waiting for the speaker to spring the realization on his audience that there were no trades on Wednesday but unfortunately it didn't happen. There were no observed trades on Wednesday so the sum for Wednesday is 0 and would bring down the average. The set of days has to be an independent input as we aren't guaranteed to have trades every day. We would likely need the full set of exchanges as there may be an exchange that is infrequently used.

bigbred

This has highlighted the problem of why a simple question turns into a complex task. I would argue that even the results shown here are answering different questions than asked as they do not take into account the null results.(days where no apples were sold).

The question answered in the example was: for days a seller sold at least one apple what was the daily average of apples sold for each seller?

This is where you need to be aware of business practices happening in the real world. (Does the apple shop only open 3 days or is it 5/6/7 does it open on holidays or do you just want an average over a complete time period month/qtr/year). The main observation is that you can't rely on the original dataset given to give your 'collapsing key'. Sorry if this was obvious.

gileslangdon

A decent talk if you aren't familiar with SQL GROUP BY. For people comfortable with the concept, it's not very valuable.

zgrunschlag

Nice example of how analysis can go wrong if we do not get the basics right. So what basically goes wrong here?

The observational units were misidentified.

In this case the question is about an aggregate of the rows in the table, not on the individual rows themselves.
The solution is straightforward. Construct a table of aggregate data first, then answer the question using the new table.
To do that we do not need a magic formula with three subtly different keys.

Just spent some time on identifying the units you are trying to process.
Use aggregation, selection and joining etc to get a table that has the desired observational unit in each row.
Use this table to answer the question posed.

Hint: be sure you understand the concept of observational unit (IMHO, what was missing here is a clear understanding of this concept)

vilkoos

Jane sold 16 apples in 3 days. Why is her average daily amount 16?

mihalynemes

This surely could have been conveyed in 5 mins.. brevity people!!

adorablecheetah

besides present it nicely to give the conceptual name collapsing key and grouping key to help understand the basic.. honestly, I think it's a really basic concept for even an entry-level analyst...

aiexplainai

Thanks a lot Alex

SQl/Pandas ->Formula
Inner_Collapsingkey - Outer_GroupingKey = Implicit_ObservationKey

Amazing talk.

bharatggaikwad

Who would have thought...
Enjoyed the talk - well executed and concise...

MMphego

Very good, thanks for sharing. A lil' more Pandas and it coulda' been great :)

tripkendall

Im no expert, but I feel like the meth kicked in. @17:30

Also it seems like you're unnecessarily reinventing ' df.set_index.'

mateuszszkubel

Amazing talk. This seemingly simple concepts rapidly get complicated up in greater dimensions.

ronaldokun

Alex Petralia - Analyzing Data: What pandas and SQL Taught Me About Taking an Average - PyCon 2018

Alex Petralia - Analyzing Data: What pandas and SQL Taught Me About Taking an Average - PyCon 2018

Hard technical skills for modern-day research and business analysts (Alex Petralia)

The importance of exploratory data analysis and data visualization in machine learning - PyCon 2018

PyCon.DE 2017 Alexander Hendorf - Effective Data Analysis with Pandas Indexes

Alex Gaynor - Learning From Failure: Post Mortems - PyCon 2018

SQL vs Python Pandas

Christy Heaton - Intro to Spatial Analysis and Maps with Python - PyCon 2018

Why Combine Pandas and SQL

Markus Winand - The Mother of all Query Languages: SQL in the 21st Century

Jake VanderPlas - Performance Python: Seven Strategies for Optimizing Your Numerical Code

Visualising data with Python

[ENG] Alexander Hendorf: 'Efficient Data Mangling with Pandas Indexes'

Julie Qiu - Strategies to Edit Production Data - PyCon 2018

Code as Craft | Tanya Reilly

Skipper Seabold - Introduction to Python for Data Science - PyCon 2018

KISS: Keep it SQL, Stupid | Fishtown Analytics

Panel: Scaling Smart: Understanding Data Skill Needs for Any Organization (DataEDGE Conf. 2012)

Dimiter Naydenov - All You Need is Pandas: Unexpected Success Stories

Albert Sweigart, 'Logging and Testing and Debugging, Oh My!', PyBay2017

PyCon.DE 2017 Alex Conway - Deep Learning for Computer Vision

Create and Query SQL Database with Python | SQL and Pandas

Rae Knowler - Python, Locales and Writing Systems - PyCon 2018

Live Workshop: Intro to Data Analysis with Pandas

Anomalous Air Temperature Reading from 1979 to 2018