Python Tutorial: Introduction to Spark SQL in Python

preview_player
Показать описание

----
Hello and welcome to this lesson about Spark SQL.

Spark provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

In this lesson, we will create a SQL table from a dataframe
and then query it.

If the first line of the data gives the column names, set the header argument to True.

Another way is to use the query “SELECT * FROM table LIMIT 0”

Yet another way is the “DESCRIBE table” query.

The dataframe is a fundamental data abstraction in Spark.

A Spark DataFrame is a distributed collection of data organized into named columns.

It is conceptually equivalent to a table in a relational database, also called, simply, “tabular” data.

We could have two dataframes having the same types of columns, and containing different data.

We could then concatenate the rows of data in these two tables into a single dataframe. Recall that a Spark DataFrame is a distributed collection of data organized into named columns. What do we mean by “distributed”?

Spark can split this dataset into parts then store each part on a different server.

In this case, Spark is partitioning the data and distributing it automatically, on our behalf. This is one technique that Spark uses to handle large datasets, even though each server may not have enough storage to hold the entire dataset on its own. What’s more, Spark allows us to treat a dataframe like a table, and query it using SQL.

SQL stands for “Structured Query Language”. A query tells the computer what to fetch.

What’s useful about the Spark SQL table is that it allows us to take the data that is in a dataframe, namely, a distributed collection of rows having named columns, and treat it as a single table, and fetch data from it using an SQL query.

We often use an instance of a SparkSession object. By convention this is provided in a variable called "spark". Some implementations of Spark, such as Pyspark Shell, automatically provide an instance of a SparkSession.

The following Spark command reads delimited text data into a dataframe from a file. One of its options allows it to use the first row to define the names of the columns.

It automatically splits each row into columns using the delimiter, which by default is a comma but which can be changed.

Let’s load some data into a dataframe, convert it into a SQL table and query it.
Рекомендации по теме