Simplifying AI integration on Apache Spark

preview_player
Показать описание
Spark is an ETL and Data Processing engine especially suited for big data. Most of the time an organization has different teams working on different languages, frameworks and libraries, which needs to be integrated in the ETL Pipelines or for general data processing. For example, a Spark ETL job may be written in Scala by data engineering team, but there is a need to integrate a machine learning solution written in python/R developed by Data Science team. These kinds of solutions are not very straightforward to integrate with spark engine, and it required great amount of collaboration between different teams, hence increasing overall project time and cost. Furthermore, these solutions will keep on changing/upgrading with time using latest versions of the technologies and with improved design and implementation, especially in Machine Learning domain where ML models/algorithms keep on improving with new data and new approaches. And so there is significant downtime involved in integrating the these upgraded version.

In this talk we will discuss about how Informatica integrates AI Solutions as part of data processing pipelines executing on top of Spark along with following major features
1. Data Science team can easily share their AI/ML solutions created using any library, language or framework
2. Shared AI/ML solution can be easily consumed in the spark pipeline.
3. Using Informatica products customers can enjoy drag and drop way of creating the Spark Pipeline with the selected solution(s).
4. Various teams can Continuously Integrate and Deploy (CI-CD) different solutions with minimum down time.

In conclusion, we will understand how different teams (like Data Scientist and Data Engineer) can integrated their work together thereby reducing the time/cost consumed in collaboration.

We will also understand how CI/CD is achieved on spark with minimum downtime while integrating various projects specially AI/ML projects using Informatica products.

Thus, by using these features like drag-and-drop way of creating spark pipeline, easy/minimum collaboration between teams and CI-CD, organizations can drastically reduce overall project completion time and cost.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:
Рекомендации по теме
Комментарии
Автор

In nutshell, It's a CI/CD in action with DEI product (with AI Transformation: which handles the lifecycle of AI code)

kanishkachauhan
Автор

May I know how we integrate a RL based scheduler to Spark> and also is there any way to submit single node to master node in spark? thank you

UniverseGames