In this blog post, we explore data governance as a necessary component of responsible AI governance, and outline existing best practices. Effective data governance can increase input data quality and help prevent issues that might otherwise emerge later on in the AI development process.
Executive Summary
Don’t forget about the data! Data governance is critical because datasets form the foundation upon which AI systems are built. When problems with datasets are not addressed early on, they cause data cascades: issues which propagate downstream throughout an AI system. Fortunately, an emerging set of data governance practices can help companies increase the quality of their input data and reduce data cascades. Specifically, companies should consider dataset documentation as a tool to combat data cascades. This paper outlines two primary dataset documentation frameworks: the first helps decrease opacity surrounding existing datasets, and the second guides the development of new datasets. Both work to mitigate and address data cascades; however, it would be naive to rely on dataset documentation alone—additional strategies for data governance must also be considered.
Introduction
When considering Responsible Artificial Intelligence (RAI) governance, it is important to adopt the perspective of AI “system” governance, and not the more limited AI “model” governance. System governance begins at the earliest stages of problem scoping for a new AI system, while model governance is isolated to the model training and testing process. The Credo AI Governance platform offers a comprehensive system approach to AI governance, including governing the datasets, a critical piece of overarching AI system governance.
In machine learning, data work has historically been undervalued, leading to widespread use of low quality datasets in AI development. Increasingly, scholars have taken note, proposing new strategies for data governance to improve the quality of data work for machine learning.
This post presents two key strategies for enterprises based on a review of data governance literature:
- Documentation is an effective strategy for data governance, increasing communication and accountability throughout an AI system’s development pipeline. A growing body of research outlines a set of best practices for dataset documentation. One proposal by Gebru et al. in their 2021 paper, “Datasheets for Datasets,” encourages all datasets to be accompanied by “Datasheets.” Datasheets can be leveraged by developers evaluating whether to use an existing dataset, and as a tool for increasing cross-organizational communication. Next, Hutchinson et al. in their 2021 paper, “Towards Accountability in Machine Learning Dataset Development,” propose a framework that builds on software development practices and can be leveraged by companies to guide internal processes related to dataset conception, creation, testing, and maintenance.
- The most effective forms of data governance are not limited to documentation alone—but include a culture that values data work and auditing of an entire AI system. A number of other strategies can be employed to promote data governance, including the intentional development of a company culture that values data work, as well as efforts to ensure that audits evaluate entire AI systems (rather than a model in isolation). A growing ecosystem of enterprises have emerged in the market which can provide tooling that can be used by companies to develop their data governance infrastructure.
The “How to” for Data Governance: Pay Attention to Documentation
In computer science, there is a popular refrain “garbage in, garbage out”—low quality input data produces equally low quality output data. Nowhere is this observation more true than in the world of AI. Datasets are the foundation upon which AI models are built, and low quality datasets have resulting effects on the downstream model outputs.
In fact, there’s even a term for this: “data cascades,” which refers to “compounding negative events caused by issues with dataset quality.” Although data cascades are widespread, they are also preventable.
One of the primary reasons dataset issues emerge is when developers make use of existing datasets that are badly documented and maintained. In other cases, where datasets are developed from scratch, a lack of intention in the development process can lead to problems. Fortunately, emerging strategies seek to prevent data cascades.
Datasheets for Datasets: External Documentation Practices
When a developer begins working with a new existing dataset it is important for them to understand:
- the purpose for which a dataset was created;
- the process by which it was developed; and,
- whether or not the dataset is actively maintained.
Historically, a key issue in dataset development has been a lack of standardized documentation processes to address these questions. Such standardized documentation is important both at the individual and enterprise levels:
- For individual dataset creators, documentation can serve as a reflective process.
- For individual dataset consumers, documentation ensures transparency and provides critical information about if and how a dataset should be used.
- For enterprises, clear documentation is particularly important when proprietary datasets are being developed for and used in cross-organizational collaboration.
To address the broad issue of a lack of processes for standardized dataset documentation, researchers borrow standard documentation practices from the electronics industry to propose “Datasheets.” Similar to “model cards,” Datasheets provide a set of questions designed to help encourage documentation (the below list is an example summary, a full set of questions is outlined in the aforementioned Gebru et al. 2021 paper):
- Motivation. Why was a dataset created? Who created the dataset? How was its creation funded?
- Composition. Is the dataset comprehensive, or a sampling of instances from a larger set? What errors or noise might exist in the dataset? Does the dataset contain potentially confidential information? Does the dataset identify individuals, either indirectly or directly?
- Collection. How was the data acquired? What mechanisms were used for dataset collection (hardware, software, manual efforts), and how were such mechanisms validated? Over what timeframe was the data collected? If the dataset relates to people, were individuals notified about dataset collection? Has a data protection impact analysis been conducted?
- Pre-processing. Did any pre-processing/cleaning/labeling of the data occur? If so, is the original raw data still accessible?
- Uses. Has the dataset already been used for any tasks? What (other) tasks might it be appropriate for? Are there tasks for which the dataset should not be used?
- Distribution. Will the dataset be distributed to third parties outside of the company? If so: how, when, and what sort of copyright and IP licensing considerations have been made?
- Maintenance. Who is the active manager of the dataset? Will the dataset be updated and if so, how often?
A Framework for Internal Dataset Development Processes
While Datasheets can be used to encourage intention in dataset development, these artifacts are often built to serve the need of external documentation once a dataset has already been created. The aforementioned paper by Hutchinson et al. outlines a proposal for documentation practices that can facilitate communication and decision-making at each stage of the dataset development process. Although this framework can also serve external documentation and auditing purposes, in the enterprise context, it also provides structure for internal dataset development processes.
According to the framework, at a dataset’s conception, there are two important considerations to be made:
- First, a developer must consider what they aim to achieve in developing a new dataset. A Data Requirements Specifications document provides detailed information on intended and unintended uses, as well as creation, distribution, performance, and maintenance requirements.
- Second, once a developer has established “what” they want to create, they must then consider “how” they will go about the process of creation. A Dataset Design Document does just this by providing a plan for meeting specification requirements. Further, this document should include justifications as to how trade-offs were navigated and what assumptions were made.
During the actual process of dataset creation, dataset developers are encouraged to track the details of implementation in a Dataset Implementation Diary. Just as it is best practice in software development to provide comments explaining code, dataset developers should leave fine-grained documentation that is tightly coupled with the dataset. Issue-tracking systems (i.e. on GitHub and other software providers) offer one option for tracking and storing this information. The Credo AI Platform provides a centralized repository for documentation for your AI Systems, so you can make this information available to all relevant stakeholders throughout the AI development lifecycle.
Evaluation of whether a dataset is fit for use follows dataset creation. A Dataset Testing Report summarizes the evaluations performed, and their results. Such a report should include both requirements testing and adversarial testing. Requirements testing aims to provide an answer to the question: “does a dataset meet the initial specifications for which it is designed?” Adversarial testing is less straightforward, and seeks to stress-test a dataset for any potential unexpected harms that may emerge when it is put to use. Testers may evaluate for specific risks of privacy violations, data omissions, and the potential for the dataset to be misused.
Often, significant energy is put into the development of a new dataset, without resources set aside for ongoing upkeep. To mitigate the issues that may emerge without maintenance, once a dataset has passed requisite testing, developers must come up with a plan for ongoing maintenance which can be documented as a Dataset Maintenance Plan. There are three primary areas of maintenance which should be included. A plan for corrective maintenance provides the process for addressing errors that are discovered once a dataset is in use (e.g addressing data poisoning). Adaptive maintenance intends to ensure that critical qualities of a dataset are preserved in an environment that may be constantly changing. Finally, preventive maintenance works to preempt issues that may occur in the future.
Templates for the documents discussed in this section can be found in the aforementioned paper proposing this framework.
Beyond Documentation: Further Strategies for Data Governance
Governance is most effective when a diverse set of strategies are employed. While the above documentation practices may help decrease the prevalence of data cascades, it is inevitable that documentation alone will not solve every data quality issue. In addition to documentation, there are a number of other best practices and resources companies can make use of in working towards the development of higher quality datasets.
Create a culture that values dataset development.
Data work has been described as “under-valued and de-glamorised.” A company culture that perpetuates this myth is more likely to run into data quality issues. Companies can address this issue by recognizing the efforts of those doing data work to intentionally make visible what is often overlooked.
This also involves ensuring that appropriate infrastructure exists for data governance. The aforementioned proposals may be effective in mitigating dataset harms, but their implementation can be burdensome. In work environments where little attention is currently given to data work, specific efforts should be made to educate employees on the importance of dataset quality and assign responsibility for data governance oversight.
Ensure that AI audits evaluate entire systems, not just models.
Auditing is an increasingly popular strategy in the world of RAI governance. New York City Local Law 144, which went into effect on July 5, 2023, requires firms that use automated employment decision tools to conduct bias audits. The CEOs of top AI companies like Anthropic and OpenAI are increasingly calling for the auditing of AI systems.
Whether this auditing is done externally or internally, performed formally or informally—AI audits should consider the entire end-to-end development of an AI system, with particular attention to the dataset development process. The Datasheets outlined in the previous section can serve as an auditable artifact for third-party auditors.
Engage with the broader data governance ecosystem.
There is already a rich ecosystem of organizations actively working to make it easier for companies to adopt effective data governance practices.
- Partnership on AI, a Credo AI partner, compiled a resource which synthesizes research into a set of best documentation practices. Their recommendations are broken down into the categories of data specification, curation, and integration. These suggestions can be very easily incorporated into current internal system design and processes.
- Google Research published Data Cards, “a dataset documentation framework aimed at increasing transparency across dataset lifecycles.” The Data Cards template, along with example Data Cards can be found here. The Data Cards Playbook is a self-guided toolkit for creating Data Cards. This toolkit is designed to adapt to individual work environments and contexts.
- The Data Nutrition Project built a tool for the easy generation of standardized labels for datasets. While these labels are intended to be public-facing, they can help organizations conceptualize how similar documentation could increase information-sharing internally. Here is the draft label for LAION-5B, a large-scale image dataset.
Conclusion
Data governance is a necessary component of RAI governance. Documentation strategies like Datasheets can increase communication between parties, while a new framework for dataset artifact creation can guide the internal development process. However, documentation alone should not be the only tool used in data governance. Further strategies might include developing an internal culture of appreciation towards data work and ensuring that AI audits are comprehensive. By following these recommendations, companies can increase the quality of the datasets being used to train their AI systems and decrease the prevalence of data cascades.
The Credo AI Platform is here to help you on your data and AI governance journey. To learn more about how the Credo AI Platform can help you unify your data governance and AI governance activities in a single, comprehensive Platform, reach out to our team today.
DISCLAIMER. The information we provide here is for informational purposes only and is not intended in any way to represent legal advice or a legal opinion that you can rely on. It is your sole responsibility to consult an attorney to resolve any legal issues related to this information.