filmov
tv
How to Use DVC for Applications in ML Drug Discovery Pipelines| Estefania Barreto-Ojeda | PyData NYC
Показать описание
Community member Estefania Barreto-Ojeda shares how they use DVC at Cyclica for Applications in ML Drug Discovery Pipelines. This talk was originally given @PyDataTV NYC in the Fall of 2022.
Development of Machine Learning (ML) pipelines in drug discovery faces different challenges from those in traditional software development. In addition to unique challenges during the data engineering stage, drug discovery pipelines require not only the standard Git tracking for source code but also make versioning of data and ML models necessary. In this talk, we will discuss some of the main challenges when working with biological data and how Data Version Control (DVC) tools help to facilitate data- and model-tracking during the development of ML drug discovery pipelines.
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced-level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
00:00 Welcome!
01:18 Overview.
02:03 Part I: Biological data.
03:13 Complexity of biological data. What makes biological data different?
12:49 Overview ML drug discovery pipelines.
15:20 Challenges in ML drug discovery pipelines.
17:12 Part II Implementing data version control in Drug Discovery pipelines.
17:20 Introduction to DVC.
19:13 Installing and inititalizing DVC.
21:24 Set DVC remote.
22:36 Versioning files with DVC. What does dvc add do?
25:21 Implementing DVC in Drug Discovery pipelines - Demo.
27:47 Data versioning.
28:17 Build a DVC ML pipeline.
28:30 Build a DVC ML pipeline - Featurization stage.
32:28 Initial Directed Acyclic Graph (DAG).
32:50 Build a DVC ML pipeline - Processing stage
34:12 Running ML pipelines with DVC repro.
35:48 Build a DVC ML pipeline - Training+Metrics stage
38:42 Final DAG.
40:07 Highlights.
To learn more about Iterative's open-source and SaaS tools please visit:
#dvc #machinelearning #datascience
Development of Machine Learning (ML) pipelines in drug discovery faces different challenges from those in traditional software development. In addition to unique challenges during the data engineering stage, drug discovery pipelines require not only the standard Git tracking for source code but also make versioning of data and ML models necessary. In this talk, we will discuss some of the main challenges when working with biological data and how Data Version Control (DVC) tools help to facilitate data- and model-tracking during the development of ML drug discovery pipelines.
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced-level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
00:00 Welcome!
01:18 Overview.
02:03 Part I: Biological data.
03:13 Complexity of biological data. What makes biological data different?
12:49 Overview ML drug discovery pipelines.
15:20 Challenges in ML drug discovery pipelines.
17:12 Part II Implementing data version control in Drug Discovery pipelines.
17:20 Introduction to DVC.
19:13 Installing and inititalizing DVC.
21:24 Set DVC remote.
22:36 Versioning files with DVC. What does dvc add do?
25:21 Implementing DVC in Drug Discovery pipelines - Demo.
27:47 Data versioning.
28:17 Build a DVC ML pipeline.
28:30 Build a DVC ML pipeline - Featurization stage.
32:28 Initial Directed Acyclic Graph (DAG).
32:50 Build a DVC ML pipeline - Processing stage
34:12 Running ML pipelines with DVC repro.
35:48 Build a DVC ML pipeline - Training+Metrics stage
38:42 Final DAG.
40:07 Highlights.
To learn more about Iterative's open-source and SaaS tools please visit:
#dvc #machinelearning #datascience