Full development lifecycle for PySpark data flows using Databricks on AWS.
In this blog we will focus on creating the project skeleton for a PySpark job, test framework, automating the build with GitLab CI, and deploying the jobs in production with Databricks Docker images on AWS. Phew! that's a lot!
This can be considered Part 2 of my earlier blog post where we discussed the collaborative development of a PySpark/ Keras data flow using Databricks Notebooks.
Now we will see how such a data flow can be made production-ready.
Install pip and virtualenv
python3 -m pip install --user --upgrade pip python3 -m pip install --user virtualenv
Creating the Job and test case
The notebook we discussed in Part 1 can be broken down into at least 2 Python jobs. For simplicity’s sake we will consider a small part of it until the first checkpoint discussed in Part1. This will just convert the CSV file to parquet.
Let's also consider the test case. Not going into detail of every aspect of the test class, it will do a simple assertion to check if data exist in the parquet file. Here the tmp_path_factory is a session-scoped fixture that can be used to create arbitrary temporary directories from any other fixture or test, this is automatically created by the test framework. The spark session is also a session-scoped fixture that we create.
Setting up the test framework is simple. Let's see the steps involved.
This is the contents of my requirements file for the test stage. Not all of these packages are used for this example but you might most probably use Pandas and mocks for some of your test cases.
# requirements-test.in pyspark~=2.4.2 mock==4.0.2 pytest~=5.2.2 pytest-mock~=1.11.2 pandas==0.25.3 wheel==0.34.2
A simple Makefile.txt looks like the following.
We can define the fixture functions in the conftest.py to make them accessible across multiple test files like the spark session defined here.
This is how the test package structure will look like. Notice here we have copied a small sample of the data to a file sample_data.csv.
Before running the test framework we need to first generate the requirements file. Have a look again at the freeze target in the Makefile again. Running the command make freeze in a command shell will generate the test and production requirements files. The newly generated requirements files will have all the dependencies embedded for the packages you require.
The test suit needs to run within a virtual environment to isolate the packages and dependencies. The following commands will create the virtual environment and install the test framework with all the package dependencies.
Then you can run the tests with this command make run-tests. If the tests all pass the output will look like this.
Building and packaging the project
Since 2018 python packaging is done using the wheel package. We have created a setup.py to facilitate the build. Let's see the contents of this file:
Running the following command make src_package will create the build. Check the contents of the Makefile given above to see the list of commands executed. As a result of this step now we can see the contents of the build and dist folders. The .wlh file in dist is the packaged distribution.
In addition, we can also see the .egg-info file. This will have metadata to let us install the package above with pip install.
Now let us see how all this will be packaged into a Docker image.
Configuring and building the Databricks Docker image
We have selected a Docker image with the Databricks runtime so that all the Spark and ML dependencies are already embedded. The Docker build is done in 2 stages.
The first stage will create the Python dependencies installing them from our requirements.txt file. Note that we don’t have to install Java, Scala, or PySpark because these distributions are already available with the Databricks runtime.
In the second stage we will initially copy the packages we build in the first stage and then copy out the codebase into the Docker image to install our package. Now let us have a look at the Docker file.
To build this image locally using the following command:
docker build -t pyspark-databricks-poc .
To run the docker container locally and to log in to it use this command:
docker run exec -it pyspark-databricks-poc /bin/bash
Creating a CI/CD pipeline and deploying to a Databricks cluster
Now we have done most of the heavy lifting with the codebase and it’s time for some dev-ops. We use GitLab as an integrated repository and dev-ops lifecycle management tool. This provides a GitHub like source repository coupled with many other cool features like a built-in CI/CD tool, artifact repository, wiki, etc..
GitLab CI was the intuitive choice for our CI/CD pipeline as it has all the features we were looking for and it's already integrated with our source repository so less extra work. Let us have a look at the stages involved with the use of the YAML file.
We have 3 stages. Each of these stages will run within a docker container executed by a GitLab CI Runner.
The build stage has two parts. The first stage will build and package the source code. The second stage will deploy it within the Docker container.
The output of the stages is shown below.
Test: Only if the tests are all passing we can progress into the next stage of the pipeline.
Build stage 1: Note that at the end of this stage the output ( build/ dist/ .agg-info/ folders ) will be copied to a temporary location to be staged again into the next step.
This is the second build stage where the Docker image with the Databrics runtime will be built with the Python dependencies and our source package installed. This image will then be uploaded to Docker hub.
In the deploy stage, we execute a curl command to invoke the Databricks REST API. This will create a cluster using the Docker image we pushed in the earlier stage as the base. This Docker image will have both the Databricks runtime and our source package. In the below image I have masked the authorisation tokens for both Dataricks and Docker hub.
We are also submitting the job in the “spark_python_task“ within the same curl command. This will result in the job getting executed soon as the cluster is ready.
Finally, login to the Databricks account and check the status of the cluster and the job.