How Airtel Unified Data Mesh Principles and Practice with DataHub
Everyone talks about data mesh principles, and it’s easy to see why — organizations are dealing with growing amounts of data, and traditional approaches just can’t keep up.
The promise of data mesh is appealing: giving teams ownership of their own data while still ensuring everything is connected, discoverable, and secure.
But the reality is setting up a data mesh is tough.
Between dealing with data silos, ensuring teams have the tools they need, and managing security at scale, there’s a lot to handle.
Today, we share how Airtel, one of the top three mobile service providers in the world, created a data meshwork for them—in their own context, with their unique tools, and to serve their specific needs—with DataHub.
Airtel’s Journey to a Data Mesh
Until 2018, Airtel’s data landscape was fragmented, with data scattered across different systems and teams, making it difficult to leverage effectively. Airtel started centralizing its data into a single data lake to address this.
Fast-forward to today, their operations have grown — they process over 30 petabytes of data and run over 10,000 daily jobs.
However, as their data scale expanded, the centralized model started to show its limitations — issues like scalability challenges, bottlenecks, and slower agility.
It was the realization of constraints that led Airtel to embrace a decentralized data mesh approach that was guided by four key data mesh principles:
- Data Domain Ownership: Ensures individual domains manage and take responsibility for their data.
- Data as a Product: Treats data as a usable and reliable product for business needs.
- Federated Governance: Creates a governance framework to support collaboration across decentralized teams.
- Self-Serve Data Infrastructure: Provides teams with a self-serve platform to independently manage and use their data — while ensuring consistency and control.
Understanding How Airtel’s Tech Stack Supports Their Data Mesh Implementation
Let’s break down the architecture of Airtel’s data mesh by walking through a typical user journey.
When a user joins a domain or system, their journey often begins with proposing a new data product. This typically involves outlining the specific requirements and purpose of the product you aim to create.
Once the proposal is approved, the next step is to create the data product. This involves several key phases:
- Data Ingestion: This involves ingesting raw data from external sources and writing it into storage.
- Data Transformation: After ingestion, the raw data undergoes transformation to shape it into a usable, meaningful format. This transformed data forms the foundation of the data product.
- Data Product Creation: With the transformed data in place, the data product is finalized and made available for consumption.
Core Architecture for Data Ingestion, Transformation, Consumption, & Governance
Airtel uses a combination of custom-built tools and open source solutions to support the data mesh flow. These include:
- A Hierarchy Management Tool: Users create and manage domains through a hierarchy management tool. This tool defines roles, assigns people to those roles, and establishes governance structures for each domain.
- BRF (Business Requirement Formalizer): The BRF allows users to propose data products within a domain. It captures requirements, formalizes them, and justifies the need for the data product.
- ICF (Ingestion Configuration Forms): These forms enable users to configure ingestion processes with ease by allowing data ingestion into storage through simple configurations.
- Apache Airflow for Ingestion: The architecture ingests data from streams, APIs, files, and databases. Apache Airflow orchestrates tasks such as ingestion, transformation, and modeling.
- DBT for Transformations: DBT communicates with the transformation engine, powered by Apache Spark, through the gateway. This setup provides scalability and delivers high performance for data transformations.
- Query Layer: Airtel’s query layer combines a multi-tenant distributed gateway (Tungsten Gateway) with a query engine (Stream). This layer allows users to run efficient queries and ensures quick access to data.
- Reverse ETL and Dashboards: A custom-built reverse ETL application operationalizes data for business use cases. Teams build dashboards in Tableau by routing data through the architecture for real-time visualization and decision-making.
- Governance Layer: DataHub supports:
- Data Discovery and Lineage: DataHub enables teams to discover datasets and track lineage, offering clear visibility into data usage and flow across domains.
- Data Quality: Airtel enforces data quality through custom rules and metrics. Teams integrate these metrics into DataHub for visibility and control.
- Access Control: Apache Ranger enforces access control based on metadata. Integration with DataHub allows Ranger to define security policies using metadata tags, such as PII classification, ensuring secure access to sensitive data.
They also use an ELK Stack (Elasticsearch, Logstash, Kibana) to track platform metrics, data metrics, and SLAs. They also track and optimize data-related costs using a custom tool.
How DataHub is Helping Implement Data Mesh Principles
Here’s how Airtel’s approach uses DataHub to implement the four key principles of data mesh:
1. Domain Ownership
At the core of Airtel’s data mesh approach lies the principle of domain ownership, where responsibility and accountability for data reside with the respective teams that produce it.
Each domain is empowered to manage its data effectively, ensuring it is well-documented, discoverable, and trustworthy for others within the organization. This means teams are tasked with cataloging their data and defining and monitoring its quality.
This decentralized approach to data ownership enables individual teams to enforce quality checks that are most relevant to their domain.
In addition to quality, domain teams are also responsible for the security and accessibility of their data. Airtel leverages Apache Ranger for robust access control, empowering teams to define and enforce security policies at the metadata level.
DataHub provides the infrastructure to reflect these validations and quality checks across the mesh.
2. Data as a Product
Airtel’s approach to data as a product follows the foundational principles: discoverable, addressable, trustworthy, self-describing, interoperable, and secure.
Here’s how they brought these principles to life using DataHub:
- Discoverability: Airtel’s ingestion process ensures that all required information is cataloged and business descriptions are clearly documented before any data can be ingested.
Airtel ensures all ingested data is cataloged in DataHub with detailed business descriptions. This metadata serves as a prerequisite for ingestion, making it easy for users to locate and identify datasets.
- Addressability: Once metadata is ingested, it is linked to a URN. This identifier allows every dataset to be easily located within DataHub. The URN system integrates with ingestion forms, ensuring metadata remains consistent across the data pipeline.
Airtel uses DBT as the transformation engine because its lineage is automatically reflected in DataHub. All connectors provide complete visibility into the data pipeline.
- Trustworthiness: This is where assertions come into play. In DBT, users can specify assertions, which are then reflected in DataHub, providing visibility into trends and data quality.
Similar to how DBT offers assertion capabilities for transformations, Airtel has implemented custom assertions for data ingestion. These assertions are part of a comprehensive library that users can apply at both the record and aggregate levels.
All of these assertions are also reflected in DataHub for transparency and data governance.
- Being Self-Descriptive: A key dimension of a data product is that it must be self-describing. Airtel chose DBT because it allows users to view their transformation scripts on DataHub.
In addition, data stewards play a crucial role in maintaining the understandability of the data generated within their domain. Their responsibilities are clearly defined, helping ensure that users can easily navigate and interpret the data.
Additionally, roles are tightly coupled with the data products created, with data product owners clearly identified. This structure ensures that users know exactly whom to contact if they have any questions or need clarification.
- Interoperability: Airtel ensures data interoperability by integrating DataHub with a variety of tools and systems, including DBT, Apache Ranger, and custom ingestion platforms. This approach allows datasets to work seamlessly with other systems, enabling users to combine and analyze data from different sources. Applying consistent metadata standards across tools further enhances this interoperability.
- Security: Another important dimension is creating secure data products with effective access control. Airtel adopts a tag-based approach for managing data security. Tags in DataHub serve as a mechanism to associate metadata with the data itself. Airtel leverages these tags to define and enforce tag-based access control policies in Apache Ranger.
This gives them precise control over who accesses specific data based on tags, such as PII classifications.
- Usability: Finally, making the data usable is a key focus. One challenge Airtel faces as an enterprise organization is empowering business users to perform exploratory data analysis. Given the vast volume of data, this essentially involves working with big data. A key gap Airtel needed to address was the SQL proficiency required for users to interact with the data effectively.
To bridge this gap, Airtel uses DataHub to provide business context through catalog enrichment. By integrating Elastic, another cornerstone of its data stack, Airtel taps into its large language models to create a knowledge graph. This knowledge graph helps process and pass on identified metadata, which is then used to generate SQL queries for users. These SQL queries are designed for Oracle execution, enabling users to perform their exploratory data analysis.
Airtel’s use of DataHub shows what’s possible when you treat data as a product. They’ve created an ecosystem where data is easy to find, trust, and use.
3. Federated Governance
Here’s how DataHub serves as a centerpiece in Airtel’s data governance approach:
- Data Discovery and Lineage: DataHub helps users trace the flow of data across systems, ensuring transparency.
- Data Quality: While Airtel relies on custom solutions for quality checks, these are integrated into DataHub for centralized access.
- Access Control: As highlighted earlier, Apache Ranger works alongside DataHub to enforce role-based access policies, using metadata like PII tags to regulate permissions.
- Monitoring and Metrics: Governance also includes platform and data monitoring through the ELK stack alongside custom computational governance models.
4. A Self-Serve Platform at the Center
DataHub is at the core of Airtel’s decentralized data mesh, allowing teams to manage their data autonomously while maintaining consistency and governance across the organization.
- Centralized Metadata Management: DataHub serves as a hub for all metadata, making data easily discoverable, cataloged, and well-documented.
- Integration with Key Tools: DataHub connects seamlessly with DBT, Apache Ranger, and custom ingestion tools, streamlining workflows and ensuring data consistency.
- Support for Governance: DataHub helps enforce governance policies, such as data lineage, quality checks, and access control, ensuring data integrity and security.
- Empowerment through Self-Service: Teams independently manage their data products, reducing reliance on centralized data teams.
Wrapping Up
Airtel’s journey with DataHub is a great example of how embracing a data mesh can really shift the way an organization handles and shares data. It’s all about allowing teams to own their data while keeping things secure, consistent, and easy to navigate.
For more details, hear them directly from the Airtel team.
If you’re working on something similar or are building something cool with DataHub, share your story with us so we can showcase your work to the amazing DataHub Slack community.