Applying Multiple ML Pipelines to Heterogeneous Data Streams

Показать описание

"Spark ML Pipelines provide a comprehensive framework for predictive modeling, including feature engineering, batch model training, and real-time predictions based on streams of data. For example, a model predicting likelihood of cart abandonment may be trained periodically using features based on Web activity of customers and applied to a stream of Web events to make real-time predictions for live users. However, in multi-tenant environments where streams contain events from different sources, application of ML Pipelines becomes difficult. Even though the pipeline paradigm can be applied to model training using datasets that contain events separated by source, generating real-time prediction in Spark Streaming poses multiple challenges, since a single micro-batch contains events that require evaluation of different pipelines. In this talk we will show how Altocloud applies Spark Pipelines to train hundreds of predictive models and to enable real-time predictions on high-throughput heterogeneous data streams. In particular we will focus on: 1. Training multiple models for activity streams from different sources. 2. Application of these models in real-time to a heterogeneous stream of events containing behavioural data for millions of users. 3. Automated training, validation, selection, and deployment of multiple predictive models in a multi-tenant environment at scale. With Gevorg Soghomonyan (Altocloud)
Maciej Dabrowski (Altocloud)

Session hashtag: #EUds4"

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us: