Glovo manages 2 TB of data every day with AWS
2022
A world leader in the delivery sector, Glovo currently receives 2 TB of data per day from online orders and offers from its suppliers. For the Spanish multinational, it is essential that these high volumes of data have the greatest possible impact on providing the best customer service. Using Amazon Web Services solutions has allowed them to achieve this.
Thanks to the abundance of services oriented towards the ingestion, processing and use of data in AWS, at Glovo we have been able to build and scale a data platform that supports our entire business, and evolves with us as we move towards a Data Mesh organizational model.”
Oliver Fenton
Data Platform Director, Glovo
Glovo is building its data platform with AWS
Glovo was born in Barcelona in 2015, and is primarily engaged in the home delivery of online food orders. Today, the app has a presence in 25 countries and more than 1,300 cities, with more than 150,000 partner restaurants and establishments. In addition to the best restaurants, it offers users all kinds of other establishments, including supermarkets, electrical, health and beauty, and gift stores, among others.
Glovo has a strong technological component and a platform that connects customers, businesses and couriers through its website and mobile application. On a daily basis, it handles a data volume of 2 TB. “As a fast-growing company, with data demands (use cases and quantities) growing rapidly, it is critical to Glovo's success to have a strong structure around how we collect, collate and leverage data to have the greatest possible impact on the business”, explains Oliver Fenton, Data Platform Director at Glovo.
The protection of users' personal data is very important to Glovo, and it always ensures maximum security. “AWS allows us to meet the goal of securely storing data and accessing it when needed, while always complying with all applicable regulations,” says Oliver Fenton.
Glovo turned to Amazon Web Services for the first iteration of its data platform. This choice was made on the recommendation of some of their data team members, who were former users of AWS solutions. The product chosen to start building this data platform was Amazon Redshift, a cloud-managed data warehouse service. “It was very easy to get started with and powerful enough to meet all of our needs,” acknowledges Fenton. This data analytics layer running on top of Amazon Redshift was soon followed by ETL use cases: “The first was tabulating raw data (about 100 GB of raw data per day). To meet this need, we use the Amazon EMR big data platform. Specifically, Apache Spark technology on Amazon EMR ephemeral clusters running on Amazon EC2 instances, with the aim of processing the raw data in Amazon S3 and building the analytical tables, also stored in S3 and registered in the AWS Glue data catalog”.
However, “shortly thereafter, people started building their own ETL processes on this data. The processing was also performed in Spark on Amazon EMR (EC2) and was orchestrated via the Luigi module, an open-source orchestrator developed by Spotify, deployed on an EC2 instance. Some results from these ETLs needed to be made available in our Looker BI tool, and due to certain limitations in our processes at the time, we also had to produce copies in Amazon Redshift.”
Just as Glovo's business was growing exponentially, its data platform continued to scale and needed to cover varied use cases. A need was detected to increase the refresh rate to improve availability of operational data, and Importer was built, a data ingestion tool that takes advantage of the capabilities of Apache Spark on Amazon EMR (running on EC2) to extract data from transactional databases running on Amazon Aurora and other types of sources, and stream it to the DataLake in Amazon S3. This ensures that this data is available as tables. "These tables are created using data in Delta format and the entire process of extracting and using Importer is orchestrated through Luigi running on an EC2 instance," continues the manager of the data platform at the delivery company.
In parallel to the work of the data teams at Glovo, application developers began to adopt microservice architectures, generating new use cases to manage data: “The Backend started to be divided into microservices. Inter-process communication was done using Amazon Kinesis streams. These events, in Avro format, were required for certain analytics cases. To upload them into S3, we created a framework that we call Rivers, using Apache Beam technology and the Amazon Managed Service for Apache Flink infrastructure, which gives the events a specific directory structure that facilitates another subsequent process to collect them and integrate them into analytical tables using Importer”.
Towards a Data Mesh
All the cases described above are run on a monolithic data platform, "which does not scale", clarifies Oliver Fenton, slowing down the increase in use cases and business decision-making around the company’s use of data. For this reason, Glovo has begun a journey towards a data mesh that allows it to divide responsibilities and give greater autonomy to teams regarding the use of data. “As part of this path, we have started to create what we call Self Service Data Pipelines (SSDP). For this, we still use Spark on Amazon EMR (EC2) for processing, but we also include an Amazon Managed Workflows for Apache Airflow (MWAA) instance per data domain, to allow data teams more control over their own code and deployments. Both Spark and non-Spark applications are packaged in Docker containers and uploaded to the Amazon Elastic Container Registry (ECR). Non-Spark applications run on Amazon Elastic Container Service (Amazon ECS) on Fargate, while Spark applications run on Amazon EMR (EC2) leveraging YARN support for Docker.”
Additionally, Glovo has established Starburst as the query engine for its Data Mesh. “We started using Amazon Elastic Kubernetes Service (Amazon EKS) within the data teams at Glovo to run our Starburst query engine, potentially removing the need to have Amazon Redshift as a means of running certain user queries from Looker. We have also started to investigate EMR in EKS so that we don’t have to pay for the activation time of the EMR clusters (EC2)”, says Oliver Fenton.
By using AWS solutions, Glovo has gained flexibility in scaling its platform within the data mesh framework: “In the case of Amazon EMR, we can provision clusters that will be active only during the necessary processing time; each group can be sized independently. Having these multiple clusters also allows us to easily isolate the different use cases”, explains this Glovo technical manager.
Additionally, the data team has expressed interest in the new serverless capabilities available in AWS services such as Amazon EMR or Amazon EKS: “Because they allow us to start something and get it up and running more easily without having to know all the underlying details. The effort involved in deployment and operation also tends to be lower.” Even without implementing purely serverless models (since they have only recently been launched by AWS), the use of a combination of services running on AWS and the adoption of Data Mesh “has allowed Glovo to reduce the time it takes to create data products from 5 to 2 weeks. Thanks to the abundance of services oriented towards the ingestion, processing and use of data in AWS, at Glovo we have been able to build and scale a data platform that supports our entire business and evolves as we move to a Data Mesh organizational model.”
In the future, the company wants to migrate its existing applications to Amazon Elastic Kubernetes Services (EKS) and is already investing in implementing a layer of governance, security and data access control to allow better management of access to the necessary data for each use case: “We are running Starburst on EKS and are now working to have the Privacera data access management tool running on Amazon EKS.” Glovo is also implementing a Notebooks solution on Amazon EKS (based on JupyterHub) “for real-time collaboration in this integrated development environment. In this way, analysts and other profiles with access to data within Glovo will be able to work more efficiently”, concludes Fenton.
About Glovo
Glovo is a technological platform of reference in the delivery sector. Founded in 2015 and headquartered in Barcelona, it currently has a presence in more than 1,500 cities across 25 countries. In addition to connecting users with the best restaurants in their city, it includes services in other categories such as groceries, health and beauty, gifts and express deliveries. In Spain, Glovo is available in more than 280 cities.
Benefits with AWS
- Better availability of operational data
- More control over code and deployments by data teams
- Flexibility to scale the platform within the data mesh framework
- Less effort involved in deployment and operation
- Data product creation time reduced from 5 to 2 weeks
AWS Services Used
Amazon EMR
Amazon EMR is the industry-leading cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning using open-source frameworks such as Apache Spark, Apache Hive, and Presto.
Amazon EKS
Amazon Elastic Kubernetes Service (Amazon EKS) is a managed container service for running and scaling Kubernetes applications in the cloud or on premises.
Amazon Managed Service for Apache Flink
Set up and integrate data sources and destinations with minimal code, continuously process data with subsecond latencies, and respond to events in real time.
Amazon Redshift
Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and machine learning to deliver the best price performance at any scale.
Start now
Companies of all sizes in every industry are transforming their businesses with AWS every day. Contact our experts and start your own journey in AWS Cloud today.