Manage your big data project
I have recently finished my project building an analytics platform for one of Infinite Lambda’s clients which inspired me to write this blog on how to ‘manage your big data project’. Working with a number of large tech organisations enabled me to experience many different approaches to implementing new big data projects. Being myself a hands-on data engineer with experience in leading small data tech teams I had the exposure to see common challenges in the data world and how organisations overcome those challenges or fell into the same trap.
By 2022, statistics predict that big data and analytics revenues will be at 274.3 billion dollars. In healthcare alone, projections show the industry could save as much as 300 billion dollars if it could only integrate its big data with other systems and business processes. While these projections on corporate investments and benefits are significant, organisations continue to lag when it comes to effectively managing big data projects. There are many reasons why these projects may not deliver the expected and the most common reasons originate from the team consistency and available tools.
Assemble your team
You may come across with different titles of data professionals like data scientist, BI engineer, data engineer and data analyst which raises the question of how to assemble a team for your data project. This is the first big question you need to answer: what is it that you want to achieve with your data project?
Do you want to have a system that integrates CRM, Salesforce, Finance and other operational databases which enables you to report on trends and KPIs?
Do you want to integrate your in-house bespoke product with the above-mentioned systems and carry out reporting on it?
Do you want to trace activity on your website and analyse it?
Do you want to feed batches or stream data to your machine learning model?
I assume you have most of the above requirements, and they need to be delivered by different data professionals and different sets of tools. Let’s dive into the context of why you may need different data professionals.
You may hire a couple of smart data scientists and they can build outstanding ML models, transform and feed data to those models. However, often they may not know how to scale services, integrate and model data. They just might have never learnt that because no matter what data scientist certificate an individual holds, all courses assume that the data is available on-demand which isn’t the case at most of the organisations. As a result, they might struggle to access data they really need for their models.
You may also hire data engineers who can build batch and stream data pipelines, a data warehouse for the company’s data silo with your favourite BI visualisation tool connected to it. However, data engineers may not be able to produce machine learning models or participate in all business meetings to understand the intricacies of your organisation and carry out sound analytics you need.
Lastly, there is the IT infrastructure which as a discipline orients towards cloud computing moving away from on-premises and data centres. So you may hire infrastructure engineers who are responsible for designing, building, deploying, and maintaining the IT infrastructure preferably in the cloud. Neither the analyst nor the BI dev would have adequate knowledge to effectively chip into the build of IT infrastructure. While collaboration is vital on the infrastructure level you probably don’t want your developers to manage the infrastructure as it goes against any central governance which you really need when it comes to infrastructure especially when it’s cloud-based.
Developers have so-called T-shaped skills where the vertical bar of the T refers to expert experience and understanding of a particular area, while the top of the T refers to an ability to collaborate with experts from other disciplines, gaining even greater understanding and knowledge from this collaboration.
When you assemble a team you need to define what skills your project will require and need to make sure you hire people with the right skill set. There is some leeway in how you fine-tune your skilled team but should do it with caution. For instance, let’s say you don’t actually need much of the skills a BI dev has but someone who knows your favourite BI tool only, so the analysts might do that too. This particular setting might work but when you apply the same logic on roles that require extensive programming ability (left side and middle of the table) then don’t expect better than entry-level delivery.
I have seen companies where they hired 10 data scientists and after 3–4 years it came to the time to hire the first data engineer. But it also isn’t rare to have many Data Engineers who just happened to work within their own silo and don’t collaborate with the Infrastructure Engineers.
You may also be thinking what the size of your data squad should be. The best baseline measure is to look at the size of the engineering team responsible for the server-side development, like web-, mobile apps and APIs, and compare it with your prospective data team. If you want to analyse your product data then your data team needs to be in balance with the other engineers.
Data is a different dimension altogether than building, for instance, an API. There is not much in the horizontal top line that overlaps data skills with server-side development, thus you can’t expect API developers or even full-stack developers to carry out data projects effectively.
We at Infinite Lambda employ infrastructure and data professionals with great T-shaped skills where we train individuals to not only extend their vertical depth but also their horizontal line as well in order to provide an expert team for your project. We not only want to build something great from the ground up but also leave behind something that can easily be maintained and provides value for the long term. As a bonus, we are happy to consult on how you should assemble your data team for maintenance and for further incremental developments.
Have your tools ready for your solution
It’s hard to know where to start once you’ve decided that, yes, you want to dive into the fascinating world of data and machine learning. Just looking at all the technologies you have to understand and tools you’re supposed to master can be dizzying. You have a rough idea about what you want to build and what your data squad will look like and this is the time you need to choose your tools for your technical solution.
With the reduced cost of storage, the amount of data produced by systems skyrocketed and every piece of information may be considered important for your organisation. As a result, you will also need some high powered CPUs/GPUs which can process that sheer volume of data. Today a commercialised and programmable quantum computer is still far to access. Therefore, everything you build needs to scale horizontally which includes your data pipeline, stream and batch processes. As a result, your next challenge will be managing the infrastructure because regardless of whether you want to build your solution from open source tools or managed services you will need to allocate the right amount of infrastructure skilled people to your project which resource is usually scarce at every organisation.
Cloud service providers identified this issue very early and they provide many cloud resources for data solutions like for pub/sub, message queues, distributed storage (HDFS), distributed processing systems etc. which might be of interest for your data project. These are, however, individual components of your future solution. You may need to programmatically integrate many of these components to deliver a tool your team can use. Many companies capitalised on this fact and went further than the cloud providers and developed their own cloud-based managed service that can be used for data solutions.
Just to mention some:
Snowflake built a database which can scale effectively and has many useful built-in DB functions for data-warehousing that your data engineers can use.
Databricks built a scalable notebook experience with ML support which can interpret 4 programming languages well known by data scientists.
Confluent built a tool that can function as a service bus for streaming, provides a UI for monitoring and you can access its functionality through CLI and API in a simpler fashion.
Astronomer built an orchestration tool which scales with K8 and integrates with your CI/CD.
Fivetran built a data pipelining tool which can connect to social media platforms, databases, apps, you name it.
Don’t forget that all these managed services have their own IT infrastructure because scalability is part of the service. You may have challenges to securely integrate these tools with your network.
For your data project, you will need most of the above-mentioned functionalities if not all of them. So the big dilemma is how you tread the line between using open source technologies, cloud resources (as components) and utilising managed services.
Let me give you an example of stream data engineering to highlight the complexity. For the sake of simplicity let’s say your cloud service provider is AWS and you want to implement stream data engineering for low latency processes plus archive all data flows through your stream which you later run batch processes on for analytics and reporting. You may use AWS MSK (Managed Service for Kafka) or Kinesis cloud resources.
You want to monitor your data flow and transform data with low latency. In case of MSK you need, for example, Elasticsearch and Kibana to monitor your stream and for transformation Apache Spark. So when you choose MSK you need to programmatically integrate with other cloud resources and open source tools. While Kinesis integrates with other AWS components like AWS Cloud Watch for monitoring, AWS Lambda for transformation and Kinesis Analytics for analysing the data near real-time.
Archive for batch processing:
Both MSK and Kinesis can buffer differently but neither can archive your data. In order to archive your data, you need to connect to the data flow, capture the information and flush it to a shared medium preferably to a distributed storage like HDFS. In case of MSK you need to use a connector, for example, the well known Pinterest developed open-source Secor which can capture data and flush it to S3 while Kinesis integrates with Firehose cloud resource which can capture and flush the data to S3.
So when would you use MSK over Kinesis? The big difference between these two components is that Kinesis is a throughput provisioning system while MSK is a cluster provisioning system. Your solution architect needs to know the difference among MSK, Kinesis, Confluent and open source tools and how to string the components together to deliver a robust and cost-effective solution. In addition, the general truth is that the more you build yourself with open source tools and/or cloud resources the more flexibility you have to tailor your solution for your requirements.
When would you use a managed service over cloud resources and open source tools? When you don’t have the time and the resources to build it and maintain it yourself. However, don’t forget you’ll need infrastructure engineers to securely connect your managed tool with your network. Also, don’t forget that you may not have as much flexibility with a managed service than you’d when you build the tool yourself. Developer’s nightmare is to be constrained to a managed service with known limitations but the management considers those limitations as edge cases at that time and what was believed to be edge case later becomes one of the crucial functionalities your tool is required to be able to do and the job is then the developers’ to sort it out with those limitations.
We at Infinite Lambda have a wealth of data engineering experience and managed to implement end to end data solutions for many businesses. We do know the ins and outs of these tools and can integrate the different components regardless of whether it’s open-source tools, cloud resources or managed services. We formulate different scenarios to evaluate the best combination of components in order to provide a robust and cost-effective long term solution for your technical challenges.
Have a chat with us and see how we can help with your big data project!