Reflections on NIST Symposium in September 2024, Part 1

Published 10/04/2024

Written by Ken Huang, CEO of DistributedApps.ai and VP of Research at CSA GCR.

Yesterday (September 24, 2024), I had the privilege of attending the NIST symposium “Unleashing AI Innovation, Enabling Trust.” The first day of the event had two panels.

The first panel, titled “Advancing Generative AI Risk Management: Understanding Risks, AI Actors, and Mitigations Across the AI Value Chain,” moderated by Chloe Autio (NIST Associate), featured panelists Helen Toner (Center for Security and Emerging Technology), Taja Naidoo (Meta, Open Loop), Jessica Newman (University of California, Berkeley), and Chris Meserole (Foundation Model Forum). The conversation on risk management in generative AI was engaging, and I plan to write a separate article focusing on that discussion.

However, this article centers on the second panel, “Addressing Synthetic Content,” moderated by Rob Reich (Stanford professor and NIST AI senior advisor). The panelists included Katherine Lee (Google DeepMind), Rebecca Portnoff (Thorn), Leonard Rosenthol (Adobe), and William Corvey (DARPA Information Innovation Office). Their discussion on the complexities of synthetic content/data inspired me to think about a 3D matrix framework to better define and scope synthetic data so we can apply guardrails accordingly.

After the panel, I had the opportunity to discuss my 3D matrix idea with Rob Reich, Leonard Rosenthol, and William Corvey. They found the concept intriguing. Rob also encouraged me to write something about it. To define and scope synthetic data or content, which still lacks a clear definition, we can pragmatically use the following 3D matrix: Who Uses the Data, Who Generates the Data, and the Nature of the Synthetic Data Itself.

As a high level overview, the following mindmap can be helpful:

3D Matrix Framework for Synthetic Data

Dimension 1: Who is Using the Data?

This dimension focuses on the actors using synthetic data, including human actors, organizations, AI agents, downstream applications, etc. The following is just some examples of users/actors. Keep in mind that the list is not exhaustive.

Actor	Application	Key Considerations	Ethical Implications
AI Engineers	Training, fine-tuning, testing models	Model robustness, fairness, bias reduction	Ensuring models don't perpetuate or amplify existing biases
Businesses (e.g., Banks)	Simulations, stress tests, decision-making	Regulatory compliance, privacy, risk modeling accuracy	Protecting customer privacy, ensuring fair lending practices
Regulators	Model validation, compliance checks	Transparency, accountability, legal/ethical standards	Balancing innovation with consumer protection
Researchers	Experimentation, generalization studies	Ethical use, reproducibility, bias avoidance	Ensuring research integrity and societal benefit
Downstream Applications	Automated decision-making	Quality maintenance, error prevention, bias mitigation	Preventing cascading errors in AI systems
Software Developers	AI-generated code	Security, error prevention, standards compliance	Ensuring AI-generated code is safe and ethical
LLM Engineers	Synthetic prompt generation	Accuracy, realism, ethical considerations	Preventing harmful or biased language generation
Gaming Industry	Character/environment generation	Cultural/racial bias avoidance, realism	Creating inclusive and respectful gaming experiences
Autonomous Vehicle Developers	Driving scenario simulation	Safety testing, edge case consideration	Ensuring public safety in real-world deployment
Healthcare AI Systems	Patient data simulation, outcome prediction	Privacy compliance, clinical diversity	Protecting patient confidentiality, ensuring equitable healthcare
AI Agents in Multi-Agent Systems	Synthetic interaction generation	Realistic behaviors, ethical interactions	Developing AI systems that cooperate ethically

Dimension 2: Who Generates the Data?

This dimension explores the various actors (generators) and processes involved in generating synthetic data. In many cases, generators and users can be the same actors. For example, a user can use Synthetic generated prompts to get results for a LLM. In this case both users and generators are the same actors. The following table only lists some examples. You can add more example to this table to help you to think about guardrails and ethical implications.

Generator	Description	Key Considerations	Ethical Implications
Data Engineers	Manually create or curate synthetic datasets	Data quality, representation, scalability	Ensuring diverse and inclusive data representation
AI Platforms	Automated generation of synthetic data	Algorithm transparency, data volume, variability	Preventing algorithmic bias in data generation
GANs (Generative Adversarial Networks), LLMs, or Other GenAI Models	Generate realistic synthetic data	Data fidelity, mode collapse, privacy preservation	Balancing realism with privacy protection
Federated Learning Systems	Generate aggregated insights from distributed data	Data privacy, model accuracy, communication efficiency	Protecting individual privacy while enabling collaborative learning
Simulation Engines	Create synthetic environments and scenarios	Physical accuracy, edge case coverage, computational efficiency	Ensuring simulations don't oversimplify complex real-world scenarios
Data Augmentation Tools	Modify existing data to create new synthetic samples	Data diversity, transformation validity, bias introduction	Avoiding introduction of artificial patterns or biases
Synthetic Data Marketplaces	Platforms for buying/selling synthetic datasets	Data quality assurance, usage rights, pricing models	Ensuring ethical sourcing and use of synthetic data
Crowdsourcing Platforms	Human-generated synthetic data through tasks	Task design, quality control, fair compensation	Protecting crowd workers and ensuring data quality
Domain Experts	Manual creation of specialized synthetic data	Accuracy, domain-specific nuances, scalability	Balancing expert knowledge with potential individual biases
Hybrid Human-AI Systems	Collaborative data generation between humans and AI	Human oversight, AI assistance, iterative refinement	Leveraging AI capabilities while maintaining human judgment

Dimension 3: The Nature of the Synthetic Data Itself

This dimension examines the characteristics and qualities of the synthetic data itself.

Characteristic	Description	Key Considerations	Ethical Implications
Fidelity	How closely the data resembles real-world data	High-fidelity vs. low-fidelity trade-offs	Balancing realism with privacy protection
Structure	The format and organization of the data	Structured vs. unstructured data handling	Ensuring fair representation across data types
Volume	The amount of synthetic data generated	Scalability, storage, processing requirements	Preventing data abundance from overshadowing quality(overfitting)
Variability	The diversity and range of the synthetic data	Coverage of edge cases, rare events	Ensuring inclusive representation of minorities
Temporal Aspects	Time-related characteristics of the data	Historical trends, future projections	Avoiding temporal biases or anachronisms
Multimodality	Combination of different data types or sources	Integration challenges, consistency	Ensuring fairness across different modalities
Privacy Level	Degree of anonymization or obfuscation	Re-identification risks, utility preservation	Balancing data utility with individual privacy
Augmentation Level	Extent of modification from original data	Originality, transformation validity	Clearly distinguishing synthetic from real data
Domain Specificity	How tailored the data is to a particular field	Generalizability, domain expert validation	Avoiding over-specialization that may limit applicability
Interoperability	Compatibility with different systems or formats	Standards compliance, data portability	Ensuring equitable access and use across platforms

Guardrails and Policy Recommendations for Synthetic Data based on 3D Framework

I. Guardrails

Based on the 3D framework, we can begin to propose the following potential guardrails for synthetic data using a few examples:

A. Guardrails for Actors Using Synthetic Data (Dimension 1)

Actor	Guardrail	Description
AI Engineers	Bias Mitigation Protocol	Implement a standardized protocol for detecting and mitigating biases in models trained on synthetic data
Businesses	Synthetic Data Disclosure	Mandate clear disclosure when synthetic data is used in decision-making processes affecting customers or stakeholders
Regulators	Synthetic Data Audit Framework	Develop a comprehensive audit framework specifically for models and systems using synthetic data
Researchers	Reproducibility Standards	Establish standards for documenting synthetic data generation and use in research to ensure reproducibility
Downstream Applications	Continuous Monitoring System	Implement systems for continuous monitoring and quality assurance of applications using synthetic data

B. Guardrails for Generators of Synthetic Data (Dimension 2)

Generator	Guardrail	Description
Data Engineers	Diversity Checklist	Implement a mandatory diversity and inclusion checklist for manually created synthetic datasets
AI Platforms	Transparency Reports	Require regular transparency reports detailing the algorithms and processes used in synthetic data generation
GANs/LLMs/GenAI Systems	Privacy Preservation Metrics	Establish and adhere to strict privacy preservation metrics for GAN-generated synthetic data
Federated Learning Systems	Data Contribution Limits	Set limits on individual data contributions to federated learning systems to prevent over-representation
Simulation Engines	Reality Alignment Checks	Implement regular checks to ensure simulated scenarios align with real-world complexities

C. Guardrails for the Nature of Synthetic Data (Dimension 3)

Characteristic	Guardrail	Description
Fidelity	Fidelity Scoring System	Develop a standardized scoring system to rate the fidelity of synthetic datasets
Structure	Structural Integrity Checks	Implement automated checks to ensure structural consistency between synthetic and real-world data
Volume	Volume Justification Requirement	Require justification for large-volume synthetic data generation to prevent unnecessary data proliferation
Variability	Variability Audits	Conduct regular audits to ensure synthetic data captures appropriate variability, including edge cases
Temporal Aspects	Temporal Consistency Verification	Implement verification processes to ensure temporal consistency in time-sensitive synthetic data

II. Policy Recommendations

Based on the 3D framework and proposed guardrails, we can conceive the following potential policies:

A. Policies for Regulating Actors Using Synthetic Data (Dimension 1)

Mandatory Synthetic Data Training: Require all professionals working with synthetic data to undergo training on ethical considerations and best practices.
Synthetic Data Impact Assessments: Mandate regular impact assessments for high-stakes applications using synthetic data.
Cross-Sector Collaboration Initiatives: Establish programs to facilitate knowledge sharing and best practices across different sectors using synthetic data.

B. Policies for Overseeing Synthetic Data Generation (Dimension 2)

Certification Program for Synthetic Data Generators: Develop a certification program for individuals and organizations generating synthetic data.
Open-Source Initiative for Synthetic Data Tools: Promote the development and use of open-source tools for synthetic data generation to increase transparency and accessibility.
Privacy-Preserving Technology Incentives: Create incentives for the development and adoption of advanced privacy-preserving technologies in synthetic data generation.

C. Policies for Governing the Nature of Synthetic Data (Dimension 3)

Synthetic Data Labeling Standards: Establish clear standards for labeling synthetic data, including information on its nature, fidelity, and intended use.
Data Quality Frameworks: Develop comprehensive frameworks for assessing and ensuring the quality of synthetic data across various dimensions.
Ethical Guidelines for Synthetic Data Use: Create sector-specific ethical guidelines for the use of synthetic data, addressing unique challenges in areas such as healthcare, finance, and criminal justice.

D. Overarching Policy Recommendations

Adaptive Regulatory Framework: Implement an adaptive regulatory approach that can quickly respond to technological advancements in synthetic data.
International Cooperation on Synthetic Data Standards: Foster international collaboration to develop global standards for synthetic data generation and use.
Research Funding for Long-Term Impact Studies: Allocate funding for research into the long-term societal and ethical impacts of widespread synthetic data use.
Synthetic Data Ethics Board: Establish a national or international ethics board focused specifically on synthetic data issues.
Whistleblower Protection for Synthetic Data Misuse: Implement strong whistleblower protections for individuals reporting unethical practices in synthetic data generation or use.

These potential guardrails and policy recommendations may be used to create a comprehensive framework for the ethical and responsible use of synthetic data, addressing the multifaceted nature of its users, generators, and inherent characteristics. As the field evolves, these recommendations should be regularly reviewed and updated to reflect new technological advancements and emerging ethical considerations.

Find Part 2 of my reflection here.

References

This article was inspired by the NIST symposium, "Unleashing AI Innovation, Enabling Trust" (September 24, 2024), and conversations with industry leaders including Rob Reich, Leonard Rosenthol, and William Corvey.

Artificial Intelligence Data Security Events Standards