SAGE: The Format STIX, OSCAL, and SARIF Don't Cover
Published 05/04/2026
Security research lives in PDFs. PDFs are good for humans and useless to machines. That mismatch was annoying a few years ago. It's expensive today.
Detection engineers are feeding those PDFs into RAG pipelines so their copilots can answer questions about threat actors, control mappings, and risk decisions. The pipelines are pulling text out of multi-column layouts, splitting paragraphs at chunk boundaries, and dropping the structure that made the prose comprehensible in the first place. A reasoning chain that took an analyst three pages to construct gets retrieved as a six-line snippet with no surrounding context, no provenance, and no way to verify whether the source has been tampered with. Then your agent confidently quotes it back to you.
The industry has tried to fix the machine-readability problem in narrow slices. STIX carries indicators. OSCAL carries controls. CycloneDX carries components. SARIF carries findings. They do their jobs well. None of them carries the analytical prose that explains a threat model, a risk decision, or a design tradeoff. The why and the how have no standard at all. They sit in PDFs, are poorly extracted, and travel through agent pipelines as low-fidelity text with no integrity guarantees.
Call it the structured narrative gap. It used to be a documentation problem. With agentic security tools shipping into production, it's an attack surface.
Poisoned Guidance With Legitimate Metadata
Wei Zou and his coauthors published PoisonedRAG at USENIX Security 2025. They achieved a 90% attack success rate by injecting five malicious texts into a knowledge base with millions of entries. Five corrupted documents in a corpus of millions. Take that as the floor.
Now, picture your security copilot's knowledge base consisting of CSA whitepapers, NIST drafts, vendor advisories, and internal threat models. PDFs scraped, chunked, and embedded with no integrity check, no signature, and no consistent way to authenticate the publisher. An attacker who places a single document into your retrieval index gets to influence what your incident responder sees at 3 AM.
Structured documents make this worse before they make it better. Most RAG systems give documents with rich metadata preferential retrieval treatment, because the metadata helps with filtering and ranking. A poisoned document that appears structurally legitimate commands higher implicit trust than a noisy PDF extract. The author thinks they're solving a discovery problem. They're handing the attacker a sharper weapon.
Anyone shipping a "structured AI-ready security document" without integrity is shipping a poisoning vector. PoisonedRAG was 2024 work, so the risk is real, and the risk is now because the defenses the authors evaluated didn't hold.
What We Built And Why
Cloud Security Alliance is publishing SAGE today. The acronym stands for Security Analysis and Guidance Exchange. It was Jim Reavis’ brainchild, and I’m honored to co-author the spec with him. The release contains three artifacts: a normative specification, a blank document template, and a guide that walks an author from a blank page to a published document.
SAGE is valid CommonMark with valid YAML frontmatter. Existing renderers and parsers handle it correctly. New tooling extracts the metadata, verifies integrity, propagates trust markings, and chunks semantically along the heading boundaries the author intended.
The frontmatter is classified into three tiers. Ten required fields cover identity, classification, provenance, and integrity. Four conditional fields when AI is involved in authorship. An optional set covering taxonomy, relationships, and machine processing hints. Document identity uses a {ORG}-{YEAR}-{TYPE}-{SEQ} namespace, with self-assigned organization prefixes and CSA and OWASP reserved. The generation_metadata block replaces the old "ai_assisted" boolean with authorship mode, model identifier, version, and human review level. It's extensible for the tool-call traces and delegation chains we'll need as agent authorship matures.
Trust marking lives in frontmatter using FIRST TLP 2.0, with explicit handling rules for multi-agent workflows. Agents must not push TLP:RED or TLP:AMBER+STRICT content into shared context windows. Derivative documents must carry a TLP at least as restrictive as their most restrictive source. Related documents carry a relationship_type from a controlled vocabulary covering supplements, supersedes, references, implements, extends, contradicts, and updates. Consumers follow those relationships programmatically instead of inferring them from prose.
The piece doing the bulk of the work on integrity is content_hash. We hash the document body with SHA-256 after normalizing to UTF-8 NFC and LF line endings. The output is 64-character lowercase hex. The spec ships with normative test vectors, so two independent parsers produce identical hashes for the same input. An optional cryptographic signature block at v1 lets publishers sign over that hash. Tamper detection and publisher authentication are required from day one, because shipping structured AI-ready content without them is irresponsible.
SAGE sits alongside STIX, OSCAL, CycloneDX, and SARIF, not against them. Indicators travel as STIX. Controls travel as OSCAL. Components travel as CycloneDX. Findings travel as SARIF. The reasoning that connects an indicator to a control to a risk decision travels as SAGE.
What You Do Today
CSA is retrofitting its corpus first. Reference tooling follows: a validator, LangChain and LlamaIndex loaders, and a PDF-to-SAGE converter for legacy content. Vendor adoption is the phase after that. OASIS formalization is the phase after that.
You don't have to wait for any of it. Read the specification end-to-end. Run the template through a document you'd otherwise publish as a PDF. Try to break the schema. Compute a content_hash over a draft, ship it to a colleague, have them recompute it, and confirm the hashes match. File an issue where the spec is ambiguous, contradictory, or wrong.
For converting existing content, any2md is an open-source CLI I wrote for my own RAG pipelines and soft-released at RSAC. It takes PDFs, DOCX, HTML, URLs, and plain text and emits SAGE-shaped Markdown with a reproducible content_hash that matches the spec's normalization rules. MIT-licensed, pip-installable, at github.com/rocklambros/any2md.
If you ship security content, start asking your tooling vendors when they'll consume SAGE. If you run a copilot or RAG pipeline that ingests security research, start asking your providers how they verify what they're indexing today. The structured narrative gap was a problem nobody owned. It has owners now.
Unlock Cloud Security Insights
Subscribe to our newsletter for the latest expert trends and updates
Related Articles:
Identity in the Age of AI: Rethinking Zero Trust's First Pillar
Published: 05/01/2026
AARM: Finding a Path to Secure the Agentic Runtime
Published: 04/30/2026
An Actionable Guide to GDPR Compliance for Startups
Published: 04/30/2026
Securing the Agentic Control Plane: Key Progress at the CSAI Foundation
Published: 04/29/2026




.jpg)
.png)
.jpeg)
.jpg)
