Validating LLM-Generated Control Mappings Beyond Aggregate Accuracy

Published 07/02/2026

Written by Laura Voicu, Co-Founder and CDSO of the Enterprise Risk Quantification Institute.

Security control frameworks continue to grow in scope and complexity. The CSA AI Controls Matrix (AICM) alone has 243 control objectives. NIST CSF has hundreds of subcategories. Organizations operating under multiple frameworks need to map between them for compliance cross-referencing, gap analysis, risk aggregation, and audit preparation.

LLMs have become a practical necessity for this work. An LLM can produce a complete mapping in minutes, with structured output, valid schema, and a plausible rationale for every assignment. For many organizations, the volume of work makes automation the only viable option.

The challenge is that LLMs are non-deterministic, and the output has no ground truth to validate against. There is no authoritative AICM-to-NIST crosswalk handed down from a trusted source. The mapping requires judgment, and an LLM's judgment may be subtly wrong in ways that are invisible on inspection.

When teams evaluate these mappings, they typically check accuracy: what percentage of labels match a reference set. This can be problematic on skewed taxonomies where 60% of controls map to one category, because a classifier that always outputs that category scores 60% while understanding nothing. Cohen's kappa corrects for this by measuring agreement above what chance alone would produce. An 85% accuracy score on a skewed taxonomy can mask a kappa of 0.4 meaning the model is barely better than a weighted coin flip once the easy cases are accounted for.

Figure 1: An 85% accuracy score can mask a kappa of 0.4 (near-random classification) when the taxonomy is skewed.

Figure 1: An 85% accuracy score can mask a kappa of 0.4 (near-random classification) when the taxonomy is skewed.

Even chance-corrected accuracy, measured in aggregate, hides the failures that matter most. A model that classifies 85% of controls correctly but systematically misclassifies the ambiguous ones — the controls on boundaries between categories, where the mapping may actually affect risk decisions — is dangerous in ways an aggregate score cannot show.

Five dimensions of validation

Psychometrics has solved this problem before. The field has spent sixty years validating imperfect measurement instruments without ground truth which is exactly the same situation LLM classifications create. The framework described here applies five independent tests, each targeting a failure mode the others miss.

Each dimension produces a pass, marginal, or fail verdict against thresholds the organization sets based on the stakes of the downstream decision.

Figure 2: The five-dimension validation dashboard showing radar chart, configurable thresholds, and per-dimension verdicts for the AICM-to-FAIR-CAM demonstration.

Figure 2: The five-dimension validation dashboard showing radar chart, configurable thresholds, and per-dimension verdicts for the AICM-to-FAIR-CAM demonstration.

Coherence

Coherence measures whether the LLM agrees with domain experts, using chance-corrected kappa rather than raw accuracy.

The idea behind this validation is not to look for perfection. It is whether the LLM agrees with humans as well as humans agree with each other. If a panel of three analysts only reaches kappa 0.65 on a particular mapping task, the ceiling is set by the inherent ambiguity of the task. Holding an LLM to kappa 0.90 on work that humans find ambiguous sets a ceiling that doesn’t exist.

This does not require classifying every control by hand. A stratified sample of items (for example 20–30 controls), covering all source and target framework categories, is sufficient to compute a reliable kappa with confidence intervals. The sample must be designed to cover the full cross-tabulation, not just the common categories.

When multiple raters are available, Fleiss' kappa extends the measurement to any number of raters simultaneously, with the LLM treated as one rater among several.

Consistency

Every taxonomy has structural constraints. Categories require specific subcategories. Some combinations are logically impossible. Parent-child relationships must be respected. These checks are entirely deterministic.

LLMs can produce output that looks structurally valid to a human reviewer while quietly inventing taxonomy paths that do not exist — a hallucinated category, a subdomain that does not belong under the assigned domain, a function the taxonomy does not define. These violations are invisible on spot-check and trivial to catch with automated rules.

Convergent validity

Convergent validity asks whether independent methods of measuring the same thing produce the same result.

For control mappings, this means comparing the LLM's direct classification against mappings derived through a different method. If published crosswalks exist between the source and target frameworks — even transitively, such as AICM → CIS Controls → target taxonomy — the LLM's direct mapping should converge with the independently derived path. When it doesn't, that's worth investigating.

This also checks the experts. If analysts all agree with the LLM but the classifications diverge from published consensus, the agreement may be groupthink rather than consensus.

Adversarial discrimination

This dimension separates a classifier that understands the taxonomy from one that pattern-matches on vocabulary.

The technique draws on minimal pairs from linguistics. If two control descriptions share most of their vocabulary but differ in one dimension, then that should change the classification.

Consider the AICM's 243 control objectives. Mapping AICM controls to a risk taxonomy requires distinguishing between controls that assess a threat (a monitoring activity) and controls that resist a threat (a prevention activity). The vocabulary overlaps heavily — both mention adversarial threats and AI models — but the functional role of the control is fundamentally different.

Unlike human reviewers who would immediately notice the difference, LLMs are most vulnerable to exactly this kind of confusion.

Figure 3: Adversarial discrimination detail. Minimal pairs test whether the classifier distinguishes controls that assess threats from those that resist them. 3 of 8 pairs discriminated correctly.

Figure 3: Adversarial discrimination detail. Minimal pairs test whether the classifier distinguishes controls that assess threats from those that resist them. 3 of 8 pairs discriminated correctly.

Stability and sensitivity

Stability measures whether rephrasing a control without changing its meaning preserves the classification. "Establish governance policies supported by leadership" and "set up a governance programme backed by leadership" describe the same control. If the label changes, the model is responding to surface form, not semantics.

Sensitivity measures whether changing the control description in a way that should change the classification actually does so. Consider "establish policies and procedures for an information governance program" versus "implement automated configuration scanning to enforce governance baselines." Both contain the word "governance." But the first sets expectations, while the second automates enforcement. One verb shifts the functional domain, and the classifier should update accordingly.

Configurable thresholds

There is no universal threshold for any of these dimensions. The right threshold depends on what the classification feeds into. A mapping that supports a board-level risk report needs higher coherence than one that supports an internal inventory. A compliance mapping subject to audit needs near-perfect consistency. An exploratory cross-reference for gap analysis can tolerate lower convergent validity.

Demonstration: AICM to FAIR-CAM

To test the framework in practice, 20 AICM v1.0.2 controls were mapped to the FAIR Controls Analytics Model (FAIR-CAM; FAIR Institute), an extension of the Factor Analysis of Information Risk (FAIR; FAIR Institute) framework that classifies controls by functional domain: Loss Event Controls (prevention, detection, response), Variance Management Controls (identification, correction), or Decision Support Controls (governance, expectations).

The classifier was an LLM using a structured prompt that returns a domain, subdomain, and function for each control description. Human expert domain mappings provided the coherence reference. The convergent validity reference was derived transitively through published crosswalks: AICM → CIS Controls v8 → FAIR-CAM.

The mapping challenge is that AICM and FAIR-CAM organize controls along different axes. AICM groups controls by security domain: model development security, AI system operations, governance. FAIR-CAM classifies by the functional role a control plays in risk management: does it prevent loss events, manage variance in the control environment, or support decision-making? A control like "define policies for training pipeline security" belongs to AICM's model development domain, but its functional role is governance — it sets expectations, not preventing attacks directly. A classifier that sees "training pipeline security" and maps the control to loss prevention has been fooled by the subject of the policy rather than its role.

The results across all five dimensions:

Coherence: PASS. Fleiss' kappa across three LLM runs and one human expert rater: κ = 0.924 (90% CI: 0.873-0.975). LLM-to-human agreement reached κ = 0.847, exceeding what many expert panels achieve on ambiguous classification tasks.
Consistency: PASS. 82 structural rules checked, zero violations. Every mapping used valid FAIR-CAM taxonomy paths with no hallucinated domains, subdomains, or functions.
Convergent validity: FAIL. Of 11 controls with available transitive mappings through CIS Controls v8, 7 agreed with the LLM (κ = 0.463, accuracy = 64%). The four disagreements reflect ambiguity in how different crosswalks interpret the same control.
Adversarial discrimination: FAIL. 3 of 8 minimal pairs correctly discriminated (37.5%). The classifier could not reliably distinguish controls that assess threats from controls that resist threats. The vocabulary overlaps, but the functional role differs. This is the most important failure mode and exactly the boundary cases where classification affects risk decisions.
Stability: MARGINAL. 79% paraphrase stability (CI: 0.67–0.92). Rephrasing control descriptions without changing their meaning caused classification shifts in roughly one in five cases. Change detection was perfect — all perturbations designed to shift the classification were correctly detected.

Practical implications

For organizations using LLM-generated control mappings today, coherence is the most actionable starting point. A stratified sample of a statistically significant number of controls, covering all source and target categories, independently classified by a qualified analyst and compared using chance-corrected kappa, provides more information about whether to trust the output than any amount of manual review.

In the AICM-to-FAIR-CAM demonstration, the coherence score (κ = 0.847) indicates the LLM's mappings are trustworthy in aggregate. The adversarial score (37.5% discrimination) identifies exactly where they are not: boundary cases between threat assessment and threat resistance. Both numbers are needed and either alone is not sufficient.

The framework, including the AICM-to-FAIR-CAM demonstration and all five dimensions, is available as open source here.

References

[1] D. T. Campbell and D. W. Fiske, “Convergent and discriminant validation by the multitrait-multimethod matrix.,” Psychological Bulletin, vol. 56, no. 2, pp. 81–105, 1959, doi: 10.1037/h0046016.

[2] J. Cohen, “A Coefficient of Agreement for Nominal Scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, Apr. 1960, doi: 10.1177/001316446002000104.

[3] J. Jones, “FAIR Controls Analytics Model (FAIR-CAM) Standard V1.0,” 2025.

[4] J. L. Fleiss, “Measuring nominal scale agreement among many raters.,” Psychological Bulletin, vol. 76, no. 5, pp. 378–382, Nov. 1971, doi: 10.1037/h0031619.

[5] J. Jones and J. Freund, Measuring and Managing Information Risk: A FAIR Approach , 2nd ed. Elsevier, 2025.

[6] D. Kaushik, E. Hovy, and Z. C. Lipton, “Learning the Difference that Makes a Difference with Counterfactually-Augmented Data,” arXiv.org. [Online]. Available: https://arxiv.org/abs/1909.12434

[7] A. M. Bean et al., “Measuring what Matters: Construct Validity in Large Language Model Benchmarks,” arXiv.org. [Online]. Available: https://arxiv.org/abs/2511.04703

[8] M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA: Association for Computational Linguistics, 2020, pp. 4902–4912. Accessed: Jun. 29, 2026. [Online]. Available: https://doi.org/10.18653/v1/2020.acl-main.442

[9] A. Warstadt et al., “BLiMP: The Benchmark of Linguistic Minimal Pairs for English,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 377–392, Dec. 2020, doi: 10.1162/tacl_a_00321.

About the Author

Laura Voicu is Co-Founder and CDSO of the Enterprise Risk Quantification Institute and lead author of the Cloud Security Alliance technical paper "Securing LLM-Backed Systems". She is a security data scientist (PhD, CISSP) and works at the intersection of data science, AI, and cybersecurity, specializing in empirical cyber risk quantification and the secure design of AI-powered systems, with 15 years of experience across banking, telecommunications, technology, research, and early-stage ventures.