Alex Petralia - Analyzing Data: What pandas and SQL Taught Me About Taking an Average - PyCon 2018

preview_player
Показать описание
Speaker: Alex Petralia

“So tell me,” my manager said, “what is an average?”

There’s probably nothing worse than that sinking feeling when you finish an analysis, email it to your manager or client to review, and they point out a mistake so basic you can’t even fathom how you missed it.

This talk is about mine: how to take an average.

In this talk, we follow my arduous and humbling journey of learning how to properly take an average with multidimensional data. We will cover how improperly calculating it can produce grossly incorrect figures, which can slip into publications, research analyses and management reports.

Рекомендации по теме
Комментарии
Автор

For the whole talk I was waiting for the speaker to spring the realization on his audience that there were no trades on Wednesday but unfortunately it didn't happen. There were no observed trades on Wednesday so the sum for Wednesday is 0 and would bring down the average. The set of days has to be an independent input as we aren't guaranteed to have trades every day. We would likely need the full set of exchanges as there may be an exchange that is infrequently used.

bigbred
Автор

This has highlighted the problem of why a simple question turns into a complex task. I would argue that even the results shown here are answering different questions than asked as they do not take into account the null results.(days where no apples were sold).

The question answered in the example was: for days a seller sold at least one apple what was the daily average of apples sold for each seller?

This is where you need to be aware of business practices happening in the real world. (Does the apple shop only open 3 days or is it 5/6/7 does it open on holidays or do you just want an average over a complete time period month/qtr/year). The main observation is that you can't rely on the original dataset given to give your 'collapsing key'. Sorry if this was obvious.

gileslangdon
Автор

A decent talk if you aren't familiar with SQL GROUP BY. For people comfortable with the concept, it's not very valuable.

zgrunschlag
Автор

Nice example of how analysis can go wrong if we do not get the basics right. So what basically goes wrong here?

The observational units were misidentified.

In this case the question is about an aggregate of the rows in the table, not on the individual rows themselves.
The solution is straightforward. Construct a table of aggregate data first, then answer the question using the new table.
To do that we do not need a magic formula with three subtly different keys.

Just spent some time on identifying the units you are trying to process.
Use aggregation, selection and joining etc to get a table that has the desired observational unit in each row.
Use this table to answer the question posed.

Hint: be sure you understand the concept of observational unit (IMHO, what was missing here is a clear understanding of this concept)

vilkoos
Автор

Jane sold 16 apples in 3 days. Why is her average daily amount 16?

mihalynemes
Автор

This surely could have been conveyed in 5 mins.. brevity people!!

adorablecheetah
Автор

besides present it nicely to give the conceptual name collapsing key and grouping key to help understand the basic.. honestly, I think it's a really basic concept for even an entry-level analyst...

aiexplainai
Автор

Thanks a lot Alex

SQl/Pandas ->Formula
Inner_Collapsingkey - Outer_GroupingKey = Implicit_ObservationKey



Amazing talk.

bharatggaikwad
Автор

Who would have thought...
Enjoyed the talk - well executed and concise...

MMphego
Автор

Very good, thanks for sharing. A lil' more Pandas and it coulda' been great :)

tripkendall
Автор

Im no expert, but I feel like the meth kicked in. @17:30

Also it seems like you're unnecessarily reinventing ' df.set_index.'

mateuszszkubel
Автор

Amazing talk. This seemingly simple concepts rapidly get complicated up in greater dimensions.

ronaldokun