Training Your LLM Dragons: Why DSPM is the Key to AI Security

Published 01/17/2025

Originally published by Proofpoint DSPM (formerly Normalyze).

Written by Parag Bajaria.

AI’s transformative potential comes with a price—its complexity and reliance on sensitive data make it a prime target for security threats. For most organizations, the two primary use cases, custom large language models (LLMs) and tools like Microsoft Copilot, introduce unique challenges.

Custom LLMs often require extensive training on organizational data, creating risks of embedding sensitive information into models. Meanwhile, Microsoft Copilot integrates with enterprise applications and processes, potentially exposing personal, financial, and proprietary data if not properly governed. Whether through intentional attacks or accidental mishandling, these implementations demand a robust security approach to prevent data exposure and ensure compliance.

Key threats to AI implementations include:

Prompt Injection Attacks: Crafty prompts can manipulate models into disclosing sensitive information indirectly, bypassing traditional security measures.
Training Data Poisoning: Malicious actors or oversights can embed sensitive or biased data into training sets, leading to unethical or insecure model outputs.
Data Leakage in Outputs: Poorly configured models may inadvertently expose private data during user interactions or as part of their outputs.
Compliance Failures: AI systems that mishandle regulated data risk steep fines under laws like GDPR, CCPA, or HIPAA and erode customer trust.

In my recent webinar, Training Your LLM Dragons: Why DSPM is Foundational for Every AI Initiative, I discussed these risks with Vamsi Koduru, Director of Product Management at Normalyze, and together we walked through practical strategies and demos illustrating how to tackle them head-on.

Use Case 1: Securing Custom LLMs

Custom LLMs allow organizations to fine-tune AI models to meet specific business needs, but they also create significant risks. Sensitive data can enter the model during training or through interactions, potentially leading to inadvertent disclosures. Security teams can secure custom LLMs with these steps:

Audit and Sanitize Training Data:
- Regularly review datasets for sensitive or regulated information before using them in training.
- Implement data anonymization techniques, such as masking or encryption, to protect PII and other critical data.
Monitor Data Lineage:
- Use tools to map how data flows from ingestion to model training and outputs.
- Ensure traceability to maintain compliance and quickly address vulnerabilities.
Set Strict Access Controls:
- Enforce role-based permissions for data scientists and engineers interacting with training datasets.
- Limit access to sensitive datasets to only those who absolutely need it.
Proactively Monitor Outputs:
- Analyze model responses to ensure they don’t unintentionally reveal sensitive information, particularly after updates or retraining cycles.

Use Case 2: Mitigating Risks in Microsoft Copilot

Microsoft Copilot delivers accurate, contextually relevant responses through a process called grounding. By accessing Microsoft Graph and the Semantic Index, grounding pulls context from across your organizational applications to generate more specific and tailored prompts for its LLM. While this enhances response quality, it also introduces risks of data leakage or misuse if sensitive or poorly governed data sources are accessed during the process. Security teams can secure Copilot implementations with these steps:

Enforce Sensitivity Labels:
- Map sensitive data to Microsoft Information Protection (MIP) labels to ensure proper access restrictions.
- Assign labels consistently across files and applications to govern what data Copilot can access.
Curate Approved Data Sources:
- Consider using a curated set of approved SharePoint sites or datasets for Copilot to minimize exposure of unvetted data.
- Ensure all included datasets are sanitized for sensitive or regulated content.
Monitor Prompt Behavior and Outputs:
- Log and analyze prompts to identify unusual or malicious behavior.
- Use tools to monitor Copilot’s outputs and flag sensitive information in real time.
Limit Access by Role:
- Configure Copilot’s access based on user roles to ensure employees only see data relevant to their responsibilities.

Build a Secure AI Framework

Regardless of the use case, a proactive and layered approach is essential to securing AI infrastructure. Here’s a summary of the steps organizations should take:

Discover and Classify Sensitive Data: Use automated tools to identify PII, intellectual property, and regulated data across your cloud and on-premises environments.
Ensure Data Lineage Visibility: Track how sensitive data moves through your AI workflows, from ingestion to model training and beyond.
Establish Role-Based Access Controls: Limit access to sensitive data and ensure permissions align with employees’ responsibilities.
Audit and Anonymize Data: Sanitize training datasets and ensure outputs don’t inadvertently disclose sensitive information.
Continuously Monitor Interactions: Track user inputs, model prompts, and outputs to identify and mitigate risks as they arise.

The Path Forward

AI is a transformative tool, but its reliance on sensitive data creates unique challenges for security teams. By adopting a structured approach to securing AI infrastructure, organizations can unlock the potential of custom LLMs and tools like Microsoft Copilot without compromising data integrity, compliance, or trust.

For a deeper dive into these strategies watch the full webinar recording.

Frequently Asked Questions

1. What is DSPM, and why is it critical for AI implementations?

Data Security Posture Management (DSPM) is a strategy and set of tools designed to discover, classify, and monitor valuable and sensitive data as well as user access across an organization’s cloud and on-premises environments . For AI implementations like custom LLMs and Microsoft Copilot, DSPM is crucial for ensuring that sensitive or regulated data is properly governed, reducing the risk of data leakage, misuse, or compliance violations.

2. What are the main risks of using custom LLMs in organizations?

Custom LLMs can introduce risks such as:

Embedding sensitive data in models during training due to unsanitized datasets.
Inadvertent data leakage in model outputs.
Compliance failures if regulated data (e.g., PII) is mishandled.
Security vulnerabilities like training data poisoning or prompt injection attacks.

These risks highlight the importance of auditing training data, monitoring data flows, and enforcing strict access controls.

3. How does Microsoft Copilot use organizational data, and what risks does this create?

Microsoft Copilot uses a process called grounding, where it accesses data from Microsoft Graph and the Semantic Index to provide contextually relevant responses. While this improves accuracy, it also creates risks, such as:

Data leakage if sensitive files or emails are improperly governed.
Misuse of confidential information if role-based access controls are inadequate.
Exposure of regulated data if sensitivity labels are not consistently applied.

4. How can organizations secure sensitive data in AI workflows?

Organizations can secure AI workflows by:

Discovering and classifying data to identify sensitive or regulated information.
Enforcing role-based access controls to limit who can access what data.
Monitoring data lineage to track how data flows into and out of AI systems.
Auditing and anonymizing data to ensure sensitive information is masked or encrypted.
Continuously monitoring interactions to identify and mitigate risks in real time.

About the Author

author headshot Parag Bajaria brings over 15 years of seasoned leadership in product management, with expertise in cloud security, DevOps, and fostering the growth of startup companies. His strategic foresight has been pivotal in developing pioneering products in CIEM, CWPP, CSPM, and CNAPP, steering the market toward innovative security solutions. At the helm of product teams, Parag's approach blends cutting-edge technical insight with keen market instincts, leading to the development and launch of innovative, user-centric solutions. At Normalyze, Parag spearheads the development of the next generation of AI-powered Data Security Platforms to discover, classify and secure data no matter where it is.