Best Practices for Engineering Production-Ready Software with Apache Spark

Показать описание

Notebooks are a great tool for Big Data. They have drastically changed the way scientists and engineers develop and share ideas. However, most world-class Spark products cannot be easily engineered, tested and deployed just by modifying or combining notebooks. Taking a prototype to production with high quality typically involves proper software engineering. The code we develop on such larger-scale projects must be modular, robust, readable, testable, reusable and performant. At Montevideo Labs we have many years of experience helping our clients to architect large Spark systems capable of processing data at peta-byte scale. In previous Spark Summits, we described how we productionalized an unattended Machine Learning system in Spark that trains thousands of ML models daily that are deployed for real-time serving at extremely low latency. In this instance, we will share lessons learned taking other Spark products to production in top tech US companies.

Throughout the session we will address the following questions along with the relevant best practices: How to make your Spark code readable, debuggable, reusable and testable? How to architect Spark components for different processing schemes, like batch ETL, low-latency services and model serving? How to package and deploy Spark applications to the cloud? In particular, we will do a deep dive into how to take advantage of Spark's laziness (and DAG-generation) to structure our code based on best software engineering practices regardless of efficiency issues. Instead of only focusing on code efficiency when structuring our Spark code, we can leverage this 'laziness' to follow the best software patterns and principles to write elegant, testable and highly maintainable code. Moreover, we can encapsulate Spark-specific code in classes and utilities and keep our business rules cleaner. We will aid this presentation with live demos to illustrate the concepts introduced.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:

Рекомендации по теме

Комментарии

Thank you for sharing the information.

howzattt

Great example !, anyone knows something like this but using python dataframe and delta tables? and maybe more focuse on data engenieer perpestive?

julsgranados

Best Practices for Engineering Production-Ready Software with Apache Spark

Best Practices for Engineering Production-Ready Software with Apache Spark

How I Plan My Coding Projects

From a model to production like a Pro: Software-engineering Best-Practices - Marcel Krčah

How do I plan out my software development projects (talks about agile development)

How Senior Programmers ACTUALLY Write Code

Top 5 Most-Used Deployment Strategies

Best Practices Around Production Ready Web Apps with Docker Compose

How To Structure Your Programming Projects

Everything your manager needs to know about platform engineering

Coding Best Practices With Examples | Code Review Best Practices

Jordan Peterson How to Succeed in a New Job

3 Tips To Write Clean Code (from an ex-Google software engineer)

Make Your Python Code More Professional

This RESUME got me 12+ software engineering interviews

A Week of Indie App Development - Creating a new app | Moodmonk Devlog #1

Listen Up Interns

Scale and Optimize Data Engineering Pipelines with Best Practices: Modularity and Automated Testing

The Difference between Managers and Directors (with former CEO)

How to Build a Product that Scales into a Company

Why 75% of Engineers Will NEVER Work As Engineers!!

How to build Standard Operating Procedures (SOPs) using ChatGPT (for FREE)

Getting production ready in Kubernetes

The 9 AI Skills You Need NOW to Stay Ahead of 97% of People

#39 Machine Learning Engineering for Production (MLOps) Specialization [Course 1, Week 3, Lesson 15]