Real-time Feature Engineering with Apache Spark Streaming and Hof

Показать описание

Feature Stores for machine learning (ML) are a new class of data platform for the organization, governance, and sharing of features within enterprises. A typical feature store is a dual database architecture, where pre-computed features for training are stored in a scalable SQL platform (Delta Lake, Apache Hudi, Apache Hive), while features served to online applications are stored in a low-latency database or key-value store (MySQL Cluster (NDB), Cassandra, or Redis). Feature Stores, however, do not provide a solution for real-time features (such as user-entered data or machine-generated data) that cannot be pre-computed or cached. If the feature engineering code that transforms the raw data into features is embedded in applications, it may need to be duplicated outside the application in pipelines for generating training data.

In this talk, we introduce Hof (Hopsworks real-time feature engineering) that provides transformation of raw data to features at low latency and scale using Apache Spark Streaming, Pandas UDFs and PyArrow. Applications use Hof by sending raw data to a HTTP or gRPC endpoint and receive the engineered features, before sending the full feature vector to the model for prediction. Hof enables the real-time feature engineering pipeline to be reused across both real-time and offline use cases (when creating training data for the same features). Hof can also enrich real-time features and build complete feature vectors by joining real-time features with features from the online feature store. We will show how the core feature store principles can be extended to real-time feature engineering: code tracking, feature pipeline reuse, ensuring the consistency of features between training and serving, and automated metadata and statistics for features. Finally we will show how the Hof architecture enables real-time features to be debugged, audited and saved for re-use in training models.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us: