What is data governance?
Data governance is a methodology that ensures data is in the proper condition to support business initiatives and operations. Aligning data governance to business initiatives has many benefits.
- Justify funding for the data governance program
- Motivate participation by the business communities
- Drive the priority of data governance activities
- Drive the level of data integration required across participating business areas
- Help to determine the right operating model, especially the level of centralization and decentralization required.
What is analytics governance?
Analytics governance is both governing data for use in analytic applications, as well as governing usage of analytics systems. Your analytics governance team can establish governance mechanisms, such as analytics report versioning and documentation. As always, keep track of regulatory requirements, establish company policy, and provide guardrails to the broader organization.
Why is data governance important?
According to Gartner, through 2025, 80% of organizations seeking to scale digital business will fail because they do not take a modern approach to data and analytics governance. It’s no wonder that Chief Data Officers identify data governance as a top priority for their data initiatives. In a 2023 survey of 350 CDOs and CDO-equivalent roles, MIT CDOIQ found that 45% Chief Data Officers identify data governance as a top priority. These data leaders are looking to put a governance model in place that lets them make data available to the right people and applications when they need it – while keeping the data safe and secure, with appropriate controls in place.
Governance has historically been employed to lock down data in silos, with the goal of preventing data leakage or misuse. However, the consequence of data silos is that legitimate users must navigate barriers to get access to data when they need it. Inadvertently, data-driven innovation gets stifled.
You have two levers to make governance an enabler of innovation: access and control. The key to success is finding the right balance between access and control – and the balancing point is different for each organization. When you exercise too much control, the data gets locked up in silos and users are not able to access the data when they need it. This not only stifles creativity, but also leads to the creation of shadow IT systems that leave data out of date, and unsecured. On the other hand, when you provide too much access, data ends up in applications and data stores that increase the risk of data leakage.
Establishing the right governance – one that balances access and control – gives people trust and confidence in the data by promoting appropriate discovery, curation, protection, and sharing of data. This encourages innovation, while safeguarding the data.
What is machine learning (ML) governance?
ML governance applies many of the same data governance practices to ML. Data quality and data integration need to provide the data required for model training and production deployment (feature stores are one important aspect of this). Responsible artificial intelligence (AI) is paying special attention to using sensitive data for building models. Additional ML governance capabilities include enabling people to participate in model building, deployment, and monitoring; documenting model training, versioning, supported use cases, and guiding ethical model use; and monitoring the model in production for accuracy, drift, overfitting, and underfitting.
Generative AI requires additional data governance capabilities, like quality and integrity of data to support adaptation of foundation models for training and for inference, governance of Generative AI toxicity and bias, and foundation model (FM) operations: FMOps.
You can support AI/ML with the same data governance program. Data preparation is necessary to transform data into a form that AI/ML models can use for training and production inference—but the most efficient data preparation is the preparation you don’t have to do. Data scientists spend too much time preparing data for each use case—your data governance team can help alleviate this undifferentiated heavy lifting. In addition, data governance can oversee the creation of shaped feature stores to be used across AI and ML use cases.
Finally, sensitive data needs to be protected appropriately, so your team can mitigate the risks of sensitive data being used to train the foundation models.
Much like analytics in general, you have to govern the use of AI/ML models that you build or customize. Ideally, this should be closely associated with analytics governance, because that function will know how to support various business areas.
What are the main challenges of data governance?
The most common strategic challenge for data governance is to align your program to business initiatives instead of is proposing the value of data governance directly. For example, you might propose the value of making it easier for end users to find the data they’re looking for, or you might propose the value of resolving data quality issues. But these are solutions in search of a problem. If you do it this way, you’ll end up competing for funding and sponsorship with business initiatives you should be supporting. Instead, position data governance to support business initiatives. Every major business initiative requires data. Data governance should ensure that the data is in the right condition to support business initiative success. Don’t overlook reporting and auditing practices for how data governance supports these initiatives.
Another common strategic challenge for data governance is to avoid applying data governance too narrowly. A too-narrow definition could be aligning the program with individual business areas or use cases without taking a wider view across business areas. A narrow definition could also mean defining data governance by only one or two data governance capabilities. For example, having a data catalog does not constitute a data governance program.
What are styles of data governance?
Your data governance program should balance centralization and decentralization (including self-service). Throughout your organization, you’ll have a mix of centralized, federated, and decentralized governance—again, depending on the business requirements. You should empower domain teams as much as possible while maintaining coherence across domains (such as the ability to link data together).
- Centralized data governance: Central organizations are ultimately responsible for mission statements, policies, tool choices, and more. The day-to-day actions are many times pushed into lines of business (LOB).
- Federated data governance: Federated data governance empowers individual business units or business initiatives to operate in the way that best matches their needs. With federated data governance, there is still a smaller centralized team which focuses their work on solving problems that repeat the most frequently, including enterprise-wide data quality tools, for example.
- Self-serve or decentralized data governance: Each LOB does what they need for their specific project. Each project uses any tools or processes from other projects where there is a fit-for-use. As topics like data mesh (itself decentralized) increase in popularity, so does self-service data governance.
Who builds data governance?
Building a business-centric data governance program requires many job functions.
- Executive sponsors understand many business initiatives on the corporate roadmap, and can help determine priorities for data governance support.
- Data stewards are from the business and are involved in the details of projects day to day. Hey help understand the data issues that are likely to cause challenges with targeted business initiatives.
- Data owners make policies about the data, including who should have access to the data and under what circumstances, how to interpret and apply regulations, and key term definitions
- Data engineers are from IT (usually), and provide tools that help secure data, manage data quality, integrate data from a variety of sources, and find the right data.
How does data governance work?
Data governance requires people, process, and technology solutions across a range of capabilities.
Curate data at scale to limit data sprawl. Curating your data at scale means identifying and managing your most valuable data sources, including databases, data lakes, and data warehouses, so you can limit the proliferation and transformation of critical data assets. Curating data also means ensuring that the right data is accurate, fresh, and free of sensitive information so users can have confidence in data-driven decisions and in the data feeding applications.
Capabilities: Data quality management, data integration, and master data management
Discover and understand your data in context to accelerate data-driven decisions. Understanding your data in context means that all users can discover and comprehend the meaning of their data so they can so they can use it confidently to drive business value. With a centralized data catalog, data can be found easily, access can be requested, and data can be used to make business decisions.
Capabilities: data profiling, data lineage, and data catalogs
Protect and securely share your data with control and confidence. Protecting your data means being able to strike the right balance between data privacy, security, and access. It’s essential to be able to govern data access across organizational boundaries, with tools that are intuitive for both business and engineering users.
Capabilities: Data lifecycle, data compliance, and data security
Reduce business risk and improve regulatory compliance. Reducing risk means understanding how that data is being used and by whom. AWS services help you monitor and audit data access—including access through ML models-- to help ensure data security and regulatory compliance. Machine learning also requires auditing transparency to ensure responsible use and simplified reporting.
Capabilities: usage auditing for data and ML
How can you make your data governance teams better?
The key to an effective data governance program is to attach to already-funded business initiatives. Make sure your team understands which data domains, sources, and elements are needed to support those initiatives.
- Build a data governance roadmap that shows support for targeted business initiatives. Then start to identify data overlap between chosen business initiatives.
- Identify applications and business intelligence use cases that the data needs to support and feed, including requirements for freshness and privacy.
- Understand what fit-for-purpose data looks like for each chosen business initiative.
- Sustain and expand the data governance program by embedding it in the enterprise operating model, so data planning and implementation becomes a natural part of the operation of the organization.
- Organize the analytics community for self-service and consistency.
- Support artificial intelligence (AI) and machine learning (ML) with data governance and ML governance. Use the same data governance program, but extend to feature stores and ML models.
What are the AWS offerings for data governance?
With end-to-end data governance on AWS, organizations have control over where their data sits, who has access to it, and what can be done with it at every step of the data workflow. Data governance with AWS helps organizations accelerate data-driven decisions by making it easy for the right people and applications to securely and safely find, access, and share the right data when they need it. You can curate data by automating data integration and data quality to limit the proliferation of data. You can discover and understand your data with centralized catalogs that boost data literacy. You can protect your data with precise permissions that let you share data with confidence. You can reduce risk and improve regulatory compliance by monitoring and auditing data access.
- Amazon DataZone – unlock data across organizational boundaries with built-in governance
- AWS Glue – discover, prepare, and integrate all your data at any scale
- AWS Lake Formation – build, manage, and secure data lakes in days
- Amazon QuickSight unified business intelligence at hyperscale
- Amazon SageMaker – build, train, and deploy machine learning models for use cases with fully managed infrastructure, tools, and workflows
- ML governance web page
- Amazon Bedrock – build and scale generative AI applications with foundation models (FMs)
- Amazon Macie - discover and protect sensitive data at scale
- Amazon Simple Storage Service (Amazon S3) access points – object storage built to retrieve any amount of data from anywhere
- AWS Data Exchange – easily find, subscribe to, and use third-party data in the cloud
- AWS Clean Rooms – create clean rooms in minutes to collaborate with your partners without sharing raw data
Get started with Data Governance on AWS by creating a free account today.