Predictive analytics on large datasets with Databricks notebooks on AWS
Updated: May 13
The notebook for this blog can be downloaded here
Recently I had the opportunity to work with Databricks notebooks with one of our startup clients at Infinite Lambda. I was fascinated by how this service increased the productivity of our team and let us collaborate effectively with developers and also with other stakeholders like business analysts and data scientists. This was especially evident in a Covid 19 affected world, when everyone is working remotely.
This is Part 1 of a 2 part blog:
Part 1 is about demonstration of the capabilities of the Databricks platform with AWS to facilitate predictive analytics workloads on large datasets in a collaborative development setup. The dataset selected is from a Kaggle competition https://www.kaggle.com/c/new-york-city-taxi-fare-prediction. The dataset has over 55 million taxi trips and over 5GB in size.
Part 2 will focus on creating a job with the code in the notebook discussed in Part 1, automating the build and running the job in production with Databricks Docker images on AWS.
Setting up the Databricks environment
Register with Databricks. https://accounts.cloud.databricks.com/registration.html If you are trying out the platform for the first time or even working with a small team the options given by the Standard plan is sufficient to get started.
You will need an AWS IAM role and a S3 bucket to start with. The IAM role should be configured according to Databricks instructions provided in this link.
The S3 bucket should be configured with permissions inline with the following instructions. https://docs.databricks.com/administration-guide/account-settings/aws-storage.html
This is the minimum setup you will need to fire up a cluster with a Databricks runtime and start with a notebook.
Your resources such as Databricks notebooks will be saved in the workspace. The platform will internally save them in the S3 bucket you configured as part of the setup.
This is one of the most useful features of the platform. The runtime provides a highly optimised version of Spark combined with other frameworks to facilitate big data and machine learning use cases. Most of the python packages you will need for different stages of data science and data engineering pipelines are already embedded with different versions of the runtime. You will not need to worry about getting your hands dirty with messy dev-ops. The case study we have in this blog is developed on 6.5 ML runtime. This runtime is packaged with Scala, Spark, Hive and most of the ML frameworks like Keras, Tensorflow and PyTourch, also with python libraries like Numpy and Pandas and much more.
Occasionally you might need to install an additional python library like s3fs which I installed to let Python Pandas access the S3 bucket. This is easily done if the package is available in PyPI in the cluster configuration. You will need to restart the cluster after the install.
More information about the runtime can be found in this resource.https://databricks.com/product/databricks-runtime
Once you know which runtime you are going to use the next step is setting up a cluster to suit your workload. You can setup one or many clusters as you like depending on different use cases you need to work on. For example a 6.5 runtime with 2 worker nodes and a master with 16 G RAM each can be sufficient to run or develop some Spark ETL jobs with a few notebooks attached. However If you need to train a deep learning model you might need a cluster with 6.5 ML runtime with GPU support and more expensive GPU nodes.
Our use case has a PySpark ETL and Keras deep learning pipeline each. For the ETL part we only need a small cluster with limited vcpu and memory. To keep the cost low I created a small cluster with just one worker(16 G, 4 vcore ) and master node(8 G, 2 vcore). Surprisingly this was sufficient to get through all the ETL stages with PySpark. To keep the AWS cost further low an all spot configuration was used.
Databricks will charge you a fee in addition to the AWS cost. The two systems will bill you separately. Further information can be found here https://databricks.com/product/aws-pricing.
Ballpark cost estimate( don't hold me accountable on this ) Non GPU cluster - all spot - 1 master, 1 worker (30 G, 4vcore) - overall cost $6 for 3 hours
A really useful feature in the Databricks workspace is that the notebooks and clusters are inherently detached. All the notebooks you will create will be automatically saved in your workplace. You can attached and detach a notebook from a cluster. If you shut down a cluster you will not loose the notebooks attached to it. You can simply spin a new cluster and attach your notebooks and run again.
Its great for pair programming. You can share a notebook URL with a colleagues and work on the same data flow simultaneously. Changes to a notebook will be displayed realtime for everyone working on it. You can then run a completed notebook and share it with a business analyst with the data outputs to review your job.
Furthermore, you can even version the notebooks and save it to a Github repo. If you are premium user the notebooks can also be scheduled and run as a daily or weekly job.
ETL stage with PySpark
Now let's look at some code. The full notebook for this code can be downloaded from the link at the start of this blog. The Kaggle dataset is pre downloaded into a S3 bucket which is in csv format. To configure access to the S3 bucket here I have used key based access, however its more secure to use an IAM role. The S3 bucket is mounted into the Databricks native DBFS file system. The bucket can be accessed with the s3a url or with the reference to the new mount.
Convert the original cvs dataset to parquet as this format is much faster and memory efficient. This is a one time step and I have created a checkpoint at this place.
Then comes the heavy lifting ETL stage where the dataset is cleaned and new features are engineered.
At the end of the PySpark ETL stage the dataset is split in to Train, Validate and Test and saved back to disk. Also here we select a small sample of the original dataset before the split. In the next stage we will deal with Python Pandas data frames. While developing the Keras deep learning and prediction part of the data flow it is convenient and resource efficient to work with a small sample of the dataset. Once we have the Keras part of the code complete up to the point of making predictions then we can scale up the cluster and also the sample dataset.
Databricks magic functions and utilities
Another convenient feature you can find in Databricks are the utilities functions which can be used to manage files and objects from notebooks. A wide range of utility and magic functions are available on your disposal such as :
More of this can be found here. https://docs.databricks.com/dev-tools/databricks-utils.html
Deep Learning with Keras
Now that we have covered the boring data wrangling bit with Spark as efficiently as we can, focus is now on to more exciting deep learning with Keras. We will need a larger cluster with more memory on the driver node so let's change the profile of the cluster. We will be training the full dataset later in a GPU available cluster but while we are still working on to complete Keras code its better try to manage with a smaller cluster to keep cost low. Change the cluster configuration to have at least 16 Gb of driver memory with the runtime as 6.5 ML. This runtime will have Pandas, numpy, Keras and Tensorflow all pre-installed.
Read the train, validate and test datasets from the parquet files created by PySpark and pre-process by scaling to fit in to the Keras deep learning pipeline.
Define the neural network model
Train and evaluate
Predict with the test dataset and save the results
Now the end to end Keras pipeline is also complete but we are not done yet. We have so far worked on a sample dataset and we also know the end to end dataflow with Spark ETL and Keras is working. Now it's time to ramp up the cluster with a GPU enabled master+worker and run it on the full dataset. The runtime version should also support GPU as shown in the below image.
To run on the full or larger sized dataset change the sample size to larger fraction and re-run the full notebook from Checkpoint 1 onwards.
This might take an hour or 2 depending on the fraction of the dataset you have chosen.
Running the attached notebook
Try running the notebook in a cluster with 30G driver memory at least, unless you select a very small sample size. The Keras pipeline is quite memory intensive.
Databricks workspace, runtimes, clusters and notebooks together create a uniquely powerful environment to create and manage end to end ETL and machine learning data pipelines efficiently. Some of the features we discussed in this blog are:
Developer pair programming and collaboration.
Facilitate data and code reviews with different stakeholders.
Dynamic cluster configurations and detached notebooks make it easy to scale up and down to manage cost.
Runtime configurations with pre packaged software to suit your use cases -no need messy and time consuming dev-ops.
Ability to combine a wide range of packages like PySpark, Pandas, Keras and Tensorflow seamlessly in the same notebook and run on the same cluster configuration end to end.