Scale EDA & ML Workloads To Clusters & Back With Dask I PyData Chicago January 2022 Meetup

preview_player
Показать описание
Speaker: Gus Cavanaugh

Speaker's Bio: "Big Data" & "The Cloud" promised me infinite scale. But that's not what I found when I stumbled onto a Hadoop cluster after college. What seemed so simple when the architects at my big consulting employer got out the whiteboard became much less so when I had my hands on the keyboard. I found solace in Python, specifically the Anaconda distribution, which I could run on the most archaic Windows workstation or cluster of Linux servers. Eventually, I switched from consulting to software where I thought I was helping companies deploy data science platforms but I really spent my time as an unpaid AWS/Azure consultant fighting with Kubernetes. I recently reunited with former Anaconda colleagues at Coiled, where we provide software and support for commercial and community users of Dask.

Abstract: While "Big Data" may be an overhyped buzzword, it's not uncommon for Python users to end up with more data than can fit on their laptops. Sampling is great, but sometimes you need to process everything. In the past, Python users didn't have much choice beyond Spark (and the fact that most data lakes were HDFS made it the standard option). But today, even the stodgiest enterprises have migrated a ton of data to cheap blob storage in the cloud. This has freed python users from the misery of the JVM (I mean, hey, it's way better to see a Python error than a JVM stack trace, right?). So as a result, tools like Dask make it much easier to scale the tools Python users love, e.g., NumPy, Pandas, Sklearn. In this talk, you'll learn how to scale your PyData workloads with minimal code changes using Dask so that you can focus on your work without having to learn a new API

PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.

00:00 Welcome!
00:10 Help us add time stamps or captions to this video! See the description for details.

Рекомендации по теме
Комментарии
Автор

I wish someone made a good tutorial with advanced data manipulation using dask which is severely lacking in the community

korrapatisrujan
Автор

Great talk, thanks. Shame there wasn't more time.

cerioscha