When we talk about database processing in python, we immediately think of the pandas library. However, when dealing with very large databases, calculations become too slow. Fortunately, there is another python library, very similar to pandas, which allows for the processing of very large amounts of data: PySpark. In this article, we will present central elements of Spark, starting with RDDs, the most basic structure of Spark. We will then study the DataFrame type, a richer structure than RDDs that is optimized for machine learning.
What is Apache Spark ?
Apache Spark is an open-source framework developed by UC Berkeley’s AMPLab that allows for the processing of large databases using distributed computing, a technique that uses multiple units of computation distributed in clusters for a single project in order to divide the execution time of a query.
Spark was developed in Scala and is at its best in its native language. However, the PySpark library allows it to be used with the Python language while maintaining similar performance to Scala implementations.
Pyspark is therefore a good alternative to the pandas library when looking to process data sets that are too large and lead to time-consuming calculations.
How structured PySpark is ?
First of all, it is important to understand the basis of how Spark works.
When you interact with Spark through PySpark, you send instructions to the Driver. The Driver coordinates all operations. The Driver can be communicated by a SparkContext object. This object coordinates the different calculations on the different clusters.
The big advantage of Spark is that the code is completely independent of the SparkContext. So you can develop your code locally on any machine.
What’s the definition of Resilient Distributed Data (RDD) ?
An RDD is the Spark representation of a data table. It is a collection of elements that can be used to contain tuples, dictionaries, lists…
The strength of an RDD lies in its ability to evaluate the code lazily: the start of the calculations is postponed until absolutely necessary.
For example, when importing a file, only a pointer to it is created. It is really only at the last moment, when you are looking to display or use a result, that the calculation is done.
To go further in the handling of an RDD, we can use the documentation available here
An RDD reads line by line, which makes it effective for processing text files (counting the number of occurrences of each word in the miserable integral for example), but it is an unsuitable structure for calculations per column.
To do Machine Learning, we need to introduce a new structure: DataFrames.
DataFrame pyspark
The pyspark DataFrame is the most optimized structure in Machine Learning. It uses the underlying bases of an RDD but has been structured in columns as well as rows in an SQL structure. Its shape is inspired by the DataFrame of the panda module.
Thanks to the DataFrame structure, we can make efficient calculations through a familiar language (similar to pandas), avoiding the cost of learning a new functional language: Scala.
Spark SQL is a Spark module that allows you to work on structured data. It is therefore within this module that the Spark DataFrame was developed.
Spark SQL has a fairly rich one-page documentation, both in examples and explanations. Contrary to what you can find on the internet, this documentation is the only document perpetually updated with the latest version of Spark.
This article is just an introduction to the main concepts of Pyspark. Our trainings contains an entire module on learning this essential tool for handling big data. If you want to master this tool, let yourself be tempted by one of our data science trainings.