Amazon DataZone: Automate Data Discovery

Overview

Remove time from manual entry of data attributes in the data catalog, which also introduces potential errors. Generate business context and recommend analysis for datasets, which boosts data discovery results. Understand where your data came from, and which sources will be impacted by changes. More, richer data in the business data catalog also improves the search experience. Reduce your time searching for and using data from weeks to days.

Page Topics

Key features

Key features

The Amazon DataZone business data catalog acts as a federated organizational registry where technical metadata can be published as assets, and you can add enriched business context. You can make data visible with business context for all your users to find, understand, and trust data quickly and easily.

Automate adding business descriptions and names to data, which helps you easily understand context and helps you avoid dealing with cryptic technical names. This automation is powered by large language models (LLMs) to increase accuracy and consistency. 

Faceted search works on top of the business data catalog to help data consumers and producers find data assets using familiar structural information, such as table and column names, as well as business terms.

For each dataset, generate a list of the most valuable columns and the likely analytics uses. 

With data quality statistics in Amazon DataZone, data consumers can see data quality metrics from AWS Glue data quality or third-party systems. Data consumers can trust the data sources they use for decisions, and have data quality context as they search for assets. Producers and IT teams can also use APIs to incorporate the data quality statistics from third-party systems into a unified, out-of-console portal. Data producers can bring in AWS Glue data quality results on a schedule to make sure that the scores are current, even as the data continues to change.

Understand the movement of data over time. Data lineage can raise trust and an organization’s data literacy by helping data consumers understand where data came from, how it changed, and its consumption. You can reduce time spent in mapping a data asset and its relationships, troubleshooting and developing pipelines, and asserting data governance practices.

Group data assets into defined packages (data products) tailored for specific business use cases to streamline cataloging and enable data consumers to easily discover and subscribe to the data. Data producers can curate a collection of relevant assets, add business context, and publish it as a data product unit. This simplifies the process for data consumers to locate all necessary data assets for particular use cases. Consumers can subscribe to all assets within a data product through a single approval workflow. Data producers can manage the product's lifecycle, including editing the asset collection, unpublishing, deleting it, and maintaining subscriptions. Amazon DataZone also offers API support for data product workflows, facilitating integration and automation.

Use cases

Reduce your time to insights by finding the right data, in the right context. Data can be trusted only when it is consistent, accurate, complete, timely,  traceable, and has a transparent data quality score. With distributed ownership, each department or the analytics team maintains the fidelity of assets so that data consumers know that they are using the right data.

Build a business data catalog by crawling your assets and bringing in the technical metadata (not the actual data) to enrich with business context. The business context can be enriched with standardized glossaries and terms. You can also customize additional metadata with metadata forms.

Using the right data requires understanding the data context. Amazon DataZone helps build that context for all the data that is catalogued with glossaries and metadata forms. Now, the data owner can share as much information as possible to set the data context for the data consumer to find, understand, and then subscribe to data. The data quality score helps data consumers understand if a data asset is fit-for-purpose.

Reduce spending time mapping data assets and their relationships, troubleshooting and developing pipelines, and asserting data governance practices. Through a graphical experience, data consumers understand the asset’s origin. Data producers can assess the effect of changes on a table or column by understanding which systems or data consumers use the data (impact analysis). Data producers can also troubleshoot data issues by reviewing snapshots of a data asset’s lineage to spot the error source. Amazon DataZone visualizes data lineage captured from OpenLineage events, an open standard for lineage collection, but can also capture custom lineage mappings. The lineage helps data producers to include data lineage while sharing the data, which increases trust in the data sources.

Videos

AWS re:Invent 2023 - How to build a business catalog with Amazon DataZone (21:37)
AWS re:Invent 2023 - Understand your data with business context (55:40)

FAQs

What kind of information is in the Amazon DataZone business data catalog?

In the Amazon DataZone business data catalog, business metadata provides information authored or used by business people and gives context to organizational data. This could include the following information:

  • Ownership: Modern data-centric organizations employ a distributed data stewardship process where lines of business (LOBs) are responsible for managing their own data. A catalog tracks that ownership so interested parties can find and request access to data as part of their business tasks.
  • Classification: Data discovery is a key task that business metadata can support. Data discovery uses centrally defined corporate ontologies and taxonomies to classify data sources and helps you find relevant data objects.
  • Relationships: You can use the Amazon DataZone business data catalog to add relationship information as metadata. As with a technical dataset schema, the business data catalog shows relationships between objects in the catalog, such as those between databases, datasets, and their columns.
  • Schema: AI recommendations for descriptions can use the technical and business schema to generated recommended descriptions and usage for data.
  • Origin and consumption: Data lineage and impact analysis, as well as custom mappings from OpenLineage, are linked to in the business data catalog.

What can I catalog with Amazon DataZone?

Amazon DataZone supports data assets published directly from the AWS Glue Data Catalog and Amazon Redshift. These two sources can be used to catalog data in the following locations:

  • Amazon Simple Storage Service (Amazon S3) data lakes
  • Many of the AWS purpose-built databases like Amazon Relational Database Service (Amazon RDS) through an AWS Glue crawler
  • Over 100-plus Amazon AppFlow connectors, to bring in data from third-party applications like Snowflake, Salesforce, and Google Analytics