How Netflix is Collaborating with DataHub to Enhance its Extensibility

DataHub Community
DataHub
Published in
4 min readMay 14, 2024

--

Netflix, the global entertainment powerhouse, has been at the forefront of leveraging data to enhance user experience and drive business decisions. In this blog post, we look at how Netflix’s Data Platform Team partnered with DataHub to build a truly self-serve data platform that optimizes data management and simplifies data discovery and governance.

In our January 2024 Town Hall, Ajoy Majumdar and Kevin Chun from Netflix’s Data Platform Team shared how they joined forces with us to unlock new dimensions of DataHub’s extensibility and harness the full potential of their data assets.

To truly understand the nature of this collaboration, we’ll need a quick look at Netflix’s current approach to its data platform architecture.

Netflix’s Approach to its Data Platform

The Netflix data platform architecture is structured into three fundamental layers:

  • Online Stores: Netflix's online stores span a spectrum of offerings from Cassandra, CockroachDB, and more. They are unified through Netflix's Data Gateway Abstraction Service. This service mesh handles the storage and security complexities, enabling application engineers to simply define a data storage type (e.g., key-value, graph) without having to think about the underlying data store itself.
  • Real-time Data Infrastructure: This layer facilitates the seamless flow of streaming data and houses essential tools for efficient data movement. From orchestrating data movement between various sources to managing event streams from devices, this infrastructure ensures the uninterrupted flow of critical data into the warehouse.
  • Big Data Warehouse: The warehouse tools serve as the interface for data analysis and decision-making. Leveraging technologies like Jupyter Notebooks, Spark, Presto, and more, these tools ensure efficient data processing and storage within the warehouse.
Netflix’s Data Platform Architecture; Courtesy: Netflix’s Data Platform Team

Netflix’s Journey Towards Building a Robust Data Catalog

Netflix’s journey to building a comprehensive data catalog began with Metacat, a cataloging solution within the Big Data Warehouse layer, aimed at:

  • Federating metadata across different data stores in the data warehouse layer.
  • Enabling users to enrich datasets with business context programmatically or through job executions.

While Metacat functioned well as a metadata aggregator within the Big Data Warehouse layer, the team kicked off an initiative to build a central data catalog service encompassing all Warehouse, Online Store, and Real-Time layers components.

Throughout this process, it became apparent that the burden of building and managing connectors fell upon the Data Platform Team rather than the teams responsible for the data itself. Additionally, there was a pressing need for a policy engine to enforce governance measures within the catalog.

“There was a need to evolve the product to become a self-serve platform to enable the relevant source system teams to define the asset or entity types, and start ingesting the data into the catalog,” shares Ajoy.

This marked a shift in the Netflix team’s strategy: the central data catalog needed to provide the expected catalog functionality while also serving as a central platform where source system teams could define asset or entity types and seamlessly integrate data into the central catalog.

A Deeper Look at Netflix’s Foundational Data Catalog Needs

In their quest for an ideal data catalog solution, Netflix identified three foundational capabilities that the data catalog must provide:

  • Scope for New Entity Types: The unique nature of Netflix’s data ecosystem called for the creation of new entity types, unique to Netflix. For instance, a custom asset type to accommodate GraphQL schemas.
  • A Custom Ownership Model: The evolving ownership models within Netflix’s datasets needed the creation of a custom ownership framework, enabling finer granularity and enhanced insights into data ownership.
  • Custom Properties: To ensure alignment with privacy and legal standards, Netflix needed to define custom properties within the catalog, for seamless integration and data governance processes. These properties — defined by Netflix’s privacy and legal teams based on specific glossaries relevant to their regulatory obligations — serve as guidelines for the terms under which data should be ingested into Netflix’s systems.

Leveraging DataHub’s Extensibility for a Tailored Data Catalog Solution for Netflix

Netflix evaluated several leading metadata management tools before selecting DataHub. “We were looking for a product that could be both a data catalog and a data platform,” explains Ajoy.

DataHub’s extensibility was at the core of its appeal, supported by its robust scalability and feature set, developer experience, and community support:

“DataHub gave us the extensibility features we needed to define new entity types easily and augment existing ones. During our evaluation, we assessed both functional and nonfunctional aspects, and DataHub performed exceptionally well in managing our traffic load and data volume.

It offers a great developer experience, a well-documented taxonomy, and — very importantly — solid community support”, shares Ajoy.

What’s Next?

The Core DataHub Team is working with the Netflix Team to deliver the following features in our upcoming open-source release:

  • Strongly Typed Model-Driven OpenAPI: This approach leverages strongly typed models for robustness and consistency in API interactions. It streamlines development efforts and enhances interoperability across different systems.
  • Pluggable Validation and Mutation Processing: This enables the implementation of custom validation logic, ensuring data integrity and compliance with organizational standards. Additionally, support for mutation processing allows users to define custom data transformation pipelines tailored to their specific use cases.

A huge shoutout to Ajoy, Kevin, and the amazing team at Netflix for collaborating with us to develop these extensibility features. We can’t wait to see how these capabilities empower more teams to innovate and adapt to changing data landscapes.

Dive deeper into the journey shared by Ajoy from the Netflix team and DataHub’s very own David Leifker here:

--

--

DataHub is an extensible metadata platform, enabling data discovery, data observability, and federated governance to tame the complexity of your data stack