Scale EDA & ML Workloads To Clusters & Back With Dask I PyData Chicago January 2022 Meetup

Показать описание

Speaker: Gus Cavanaugh

Speaker's Bio: "Big Data" & "The Cloud" promised me infinite scale. But that's not what I found when I stumbled onto a Hadoop cluster after college. What seemed so simple when the architects at my big consulting employer got out the whiteboard became much less so when I had my hands on the keyboard. I found solace in Python, specifically the Anaconda distribution, which I could run on the most archaic Windows workstation or cluster of Linux servers. Eventually, I switched from consulting to software where I thought I was helping companies deploy data science platforms but I really spent my time as an unpaid AWS/Azure consultant fighting with Kubernetes. I recently reunited with former Anaconda colleagues at Coiled, where we provide software and support for commercial and community users of Dask.

Abstract: While "Big Data" may be an overhyped buzzword, it's not uncommon for Python users to end up with more data than can fit on their laptops. Sampling is great, but sometimes you need to process everything. In the past, Python users didn't have much choice beyond Spark (and the fact that most data lakes were HDFS made it the standard option). But today, even the stodgiest enterprises have migrated a ton of data to cheap blob storage in the cloud. This has freed python users from the misery of the JVM (I mean, hey, it's way better to see a Python error than a JVM stack trace, right?). So as a result, tools like Dask make it much easier to scale the tools Python users love, e.g., NumPy, Pandas, Sklearn. In this talk, you'll learn how to scale your PyData workloads with minimal code changes using Dask so that you can focus on your work without having to learn a new API

PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.

00:00 Welcome!
00:10 Help us add time stamps or captions to this video! See the description for details.

Рекомендации по теме

Комментарии

I wish someone made a good tutorial with advanced data manipulation using dask which is severely lacking in the community

korrapatisrujan

Great talk, thanks. Shame there wasn't more time.

cerioscha

Scale EDA & ML Workloads To Clusters & Back With Dask I PyData Chicago January 2022 Meetup

Scale EDA & ML Workloads To Clusters & Back With Dask I PyData Chicago January 2022 Meetup

Predibase - A low-code deep learning platform built for scale

Amazon re:MARS 2022 - Exploratory data analysis and automated feature engineering (MLR311)

Richard Pelgrim - Data Science at Scale with Dask | PyData London 2022

Learn How to Scale Python Data Science with Dask

AWS re:Invent 2022 - Accelerating semiconductor design, simulation, and verification (CMP320)

AWS re:Invent 2019: Electronic design automation: Scaling EDA workflows (MFG304)

WekaIO - Performance at Scale for Model Training

Scaling Kubernetes-based Event-driven Workloads with Keda & Karpenter • Roland Barcia • GOTO 202...

Why do we split data into train test and validation sets?

Introducing the AI ML and Genomics Workloads (SDC 2019)

AWS Summit SF 2022 - Train ML models at scale with Amazon SageMaker Training (AIM207)

Introducing the EDA Workload for the SPEC SFS 2014 Benchmark

DAOS: Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence

DAC 2018 | Design on Cloud: Eliminate Data Bottlenecks for EDA and AI Workloads

PyTorch in 100 Seconds

Leveraging NetApp Cloud Volume ONTAP Remote Caching Capabilities for EDA Workloads | 100% Score |

Kafka in 100 Seconds

Data Storage Requirements for Machine Learning Infrastructure

Outcomes that Matter: AI Data Management at Scale

Run Big Data and Machine Learning Workloads with EC2 Spot Instances (Webinar 3 of 3)

Analytics Orchestration at Scale with Kubernetes, Tensorflow, and Kubeflow (Cloud Next '19)

Dask ML | Scale Machine Learning Code with Dask | Dask Summit 2021

Severin Schmitt, Anna Achenbach, Thorsten Kranz: Delivering AI at Scale