7 Tips To Structure Your Python Data Science Projects

preview_player
Показать описание

In this video, I’ll cover 7 tips to streamline the structure of your Python data science projects. With the right setup and thoughtful software design, you'll be able to modify and enhance your projects more efficiently.

🔖 Chapters:
0:00 Intro
0:50 Tip #1: Use a common structure
1:55 Tip #2: Use existing libraries
4:59 Tip #3: Log your results
5:55 Tip #4: Use intermediate data representations
8:09 Tip #5: Move reusable code to a shared editable package
9:24 Tip #6: Move configuration to a separate file
11:45 Tip #7: Write unit tests
14:09 Final thoughts

#arjancodes #softwaredesign #python
Рекомендации по теме
Комментарии
Автор

If you use notebooks, I _highly_ recommend enabling Autoreload. I find myself using notebooks / VS Code interactive sessions frequently. One of my biggest frustrations with notebooks was that if I changed a function, I would have to rerun that cell every time to update the function definition. It was also less conducive to separating my code out into submodules (which are quite convenient). It was a total game changer for me to add "%load_ext autoreload\n%autoreload 2" to my Jupyter startup commands. In a way, this workflow promotes the use putting functions in submodules because any time you call a function, it will reload that file with any changes you have made.

andrewglick
Автор

One thing that I learned long before starting my Python learning. Bugs seem to be inevitable and when one happens I ask myself, "How did this get past my unit tests?" So I go back and modify the test suite to catch the bug. Not quite 'test driven development', but really helpful with any sort of iterative-development or refactoring of a project.

mikefochtman
Автор

I rewrote cookiecutter and turned it into a programmable configuration language. It's called `tackle` and I am almost done with the last rewrite before I start promoting it. Does everything cookiecutter does plus a ton of other features like making modular code generators (ex. a license module you can integrate to your code generator), flow control within the prompts (ex. if you select one option then expose a subset of other options), and schema validation (not really related to cookiecutter but is the main use case of the tool). It's more relatable to jsonnette, cue, and dhall but you can easily make code generators like cookiecutter as well. Almost done with it (a 4 year long personal project that I've rewrote 5 times) and hopefully it gets traction.

robertcannon
Автор

The fastest data storage I have found with python is arrow, but I unusually use csv or json although I have used quite a few databases. Also, I have been slowly learning tip number 5 over the past 10 years or so. Once I force myself to make code that others can use, I find myself being much more proficient and can see why the top coding people generally seem to make tools that others can use in one way or another. Thanks!

williamduvall
Автор

Biggest tip is to combine python scripts with Notebooks. Notebooks allow for fast and visual exploration of the data / problem. Then move pieces of code to a ./lib folder and use it from them. And start adding test. Most of the time you are performing the same operations of the data so you can end up with a shared lib across multiple projects. And that is very handy when starting a new project.

sergioquijanorey
Автор

Great video! As a data scientist myself, I would love to see you work through an example that uses something like MLFlow. It's a very common tool at every DS-job I've worked since it's open source but also part of many managed solutions. Specifically, I'd love to see how you build an MLFlow client, how you structure experiments/runs and when/where you feel it is best in the flow of an ML pipeline to log to/load from MLFlow. Most MLFlow tutorials I've seen are notebook based, which is great for learning the basics but there isn't much guidance out there on how to structure a project that leverages it.

Vanessa-vzcw
Автор

I have had a semi-emoitional experience listening to this :D I have been asbolutely solitary on a project this last 8 months. Deciding at the start, that instead of dealing with it with my old knowledge, with Excel and SQL I would tackle it while learning python along the way, with GPT.
To hear these tips, to realize I just through trial and error or logic, got to these tips! I have not done so bad, is the feeling but also lots to learn because this is just the base. And I for sure got a foot in the door even if just slightly.
Tip 1 - Common structure - totally tried to apply a common structure, but then some external stresses and pressures, I crumbled and it was visible in my code thereafter :/ Next time - this is a must.
Tip 2 - Existing libraries - learning with GPT, can be limiting - specifically asking GPT for alternative libraries for the same solution, asking pros and cons etc, helps you immediately address as many issues with fewer packages or the perfect package directly. GPT didn't point that out as much as it could, so its on you. But learning in more depth, now, after idenitfying which packages I use all over, would aid me to understand under the hood.
i.e. Pipeline tool - when I started, I wanted something like this. Head dev guy in the office said nothing like it exists. So made a non gui module of functions for handling data load and export for csv, xlsx, sql - to see Taipy... uff, can't wait.
Tip 3 - Log Results - as in excel days, copy after copy, different folders, filenames, - to track every transformation to exactly as you said, back track for verification - plus makes it easier to answer external questions as to how and why.
Tip 4 - that was just forced as along the way, given different data sources, data formats - I just refused to hard code sources, so dedicating just to data load phase and how outputs anywhere in the whole project will be presented, stored, displayed - whether in memory or sql or csv an logging where and when everything is stored for easier recalling downstream... was like a matrix
Tip 5 - exactly, jupyter notebook once you have done with code - pop them into spyder, and call them whether as functions import or as I learned last week, i can run .py files and have results loaded into jupyter notebook - but I don't know what else there can be done - its sandbox place, but py files is the aim.
Tip 6 - exactly again, it just had to happen - anything which you start seeing something repeat, think about the context, the occurrence, variations - pop them in a centralized place for easier management - like in excel, lookup tables :) Just trial and error - but to think of it in advance, is game changer for such bigger projects...
Tip 7 - No idea - biggest wow moment for me this one so far - because exactly now I am asked to hand over the code I wrote and I am certain I put a little of tests break points in the code with comments for what change I did as to why to have the break point condition where it can be reverted to a different approach - didn't want to delete any of my test points but had a feeling it would look terrible handing that over... and I am sure the way you talk about it, is much neater than what I have formulated so if there is 'official approach' I would gladly revamp my approach because i literally just used user changeable boolean variables such as sample_mode =True to trigger df.head(1000) throughout the script for example...
Next week I find out which team they move me to, as I just came in to clean data and present results and instead I now have code to take any addresses, validate them and proceed to secondary data point evaluations to consolidate data across foreign sources, to spot what is missing and shouldn't be but also validation on more granular data in this case ownership and tax rate. Its the most intense bit of work I have done alone - every single step needed something created and addressed - no normalization anywhere consistent in a single field of data point, crucial or otherwise. The reality even in the end, whatever you do, its only as good as the data you get, but the stories and scenarios of data which can be told after, if more people were interested in that and understanding of the influence of data diligence, its like a civilization booster in awareness :D

ChrisTian-uwtq
Автор

# 6 is such a great tip. Here's how I do it:
- YAML file with the source code for defaults
- (maybe user /machine dependant YAML file for some test system)
- a YAML file with the data to override any values in evaluation

YAML is great because it's human readable and has comments. So you can tell the user "put a config.yaml with the keys X, Y and Z to the data and off you go".

TheSwissGabber
Автор

Some good points in the video to get people thinking about different aspects. Scalability leads to a temptation for some to use multiple APIs and thereby an API management tool which in turn costs time and increases the probability of a ML library being used like a sledgehammer to crack a nut _(especially if time lack inclines a planner to be avoidant of dependency tree challenges)._ No matter the scalability of a software system _(whilst not seeing it as something to be seen as "regardless" of scale)_ databasing to keep track can become a bottleneck, and so retries and dead letter queues are worth it. Your mention of workflows is wise and jumping from one database to another _(e.g. MySQL to Postgres)_ is very likely to incur thoughts spent on workflows for that very task. You can optimise all you like _(which is noble)_ but these days people are more incentivised to "build-things" and so somebody might pip that "optimiser person" to the post by throwing computer horsepower at the challenge, thereby forcing something to be big rather than scalable in ways other than unidirectionally. Logs tend to mean hash tables. There are advantages to storing in a database like the choices available for DHT versus round robin. Environmental variables can be for ENV metadata to set up a virtual machine. If you own that server, like you suggest, it's an extra thing to secure _(for example against IGMP snooping)._ Containers and sandboxes are an extra layer of security rather than a replacement for security. Multiple BSD jails for example can be managed with ansible for instance.
My comment has no hate in it and I do no harm. I am not appalled or afraid, boasting or envying or complaining... Just saying. Psalms23: Giving thanks and praise to the Lord and peace and love. Also, I'd say Matthew6.

obsoletepowercorrupts
Автор

I could comment something like this on all your videos but It's too time consuming so. You are easily one of the best python teachers I've known. My Data Science game has skyrocketed ever since I found your channel. I actually feel like a professional now as opposed to a newbie. From the bottom of my heart: Thank you!

PS: If you ever have time, I'd appreciate an end to end, rigorous Machine Learning workflow where you cover environment set up, folder structure, OOP concepts, Pydantic, deployment. I've looked for videos like this among yours but not been able to find one.

Lirim_K
Автор

One thing I’ve began doing is using a flatter directory hierarchy and using SQLlite to catalog file paths along with useful, project specific metadata. This way I write a SQL query to pre-filter data by only fetching relevant Parquet file paths to pass into Dask for reading and analyzing.

askii
Автор

This was a really great video. I am a data scientist for over 2 years and was great to see that i already use developed a habit to use some points (probably because of other videos of you 😂❤) but also learned something new too!

Could you do the same thing for data engineering?? That would be awesome!

thomasbrothaler
Автор

I’m a big fan of validating data I’m bringing in via Pandera. It is like defining a contract and if the data coming in breaks that contract (data types, column names, nullability, validation checks, etc.) I want to know about it BEFORE I start processing it. I also use Pandera typing heavily to define my function arguments and return types to make it clear that the data going in and out of my functions validate to a schema which is way better than the generic “pd.DataFrame” e.g.
def myfunc(df: DataFrame[my_input_schema] -> DataFrame[my_output_schema]:

JeremyLangdon
Автор

Love these videos! I found your channel from your Hydra config tutorial and all of your videos have been full of invaluable knowledge that I've already been using in my projects at work! Thank you and keep it up!

DaveGaming
Автор

Great vid! Would love to see your approach to doing a data project eg download data, use airflow to process, train model and host via an API endpoint

chazzwhu
Автор

In defense of notebooks, the exploratory aspect that it allows makes it really nice and quick to find problems within the data. You can look deeper at the objects or dataframes where it halts. If there is a way to combine this ability with a well structured set of scripts, it would be fantastic

d_b_
Автор

Official request for a full Taipy video 🙌

haldanesghost
Автор

As an intermediate data storage format, I typically use duckdb because I'm quite comfortable with SQL, and duckdb allows me to query large sets of data very quickly

spicybaguette
Автор

Great video. I feel data-science projects are rarely examined in terms of design/structure quality. I hope to see more videos about it in the future. Perhaps on writing tests, I sometimes lack ideas about how to test data-science code.

FlisB
Автор

Amazing video! I am very happy to notice that these are the bits of advice I have been pushing around to my work environments. I hope these things become the norm soon.

rafaelagd