GenAI Model Risk Management in Financial Institutions | SR 11-7 Framework Guide

By Isabella Costa
·
March 19, 2026
·
15 min read

The GenAI Revolution in Financial Services
Regulatory Landscape: SR 11-7, EU AI Act and NIST
GenAI Applications in Banking and Finance
Model Risks vs Non-Model Risks in Generative AI
The Six-Stage GenAI Model Lifecycle
Conceptual Soundness: Six Pillars of GenAI Validation
Outcome Analysis: Hallucination and Toxicity Testing
Implementation Controls and the Governance Framework
Ongoing Monitoring and Periodic Validation
The Future of GenAI Risk Management in Finance

📌 Key Takeaways

First SR 11-7 aligned GenAI framework: This research paper provides the first comprehensive model risk management framework for generative AI models aligned to the three pillars of SR 11-7 regulation — Conceptual Soundness, Outcome Analysis, and Ongoing Monitoring.
12 risk categories identified: Six model risks (data privacy, explainability, hallucination, fairness, toxicity, usage) and six non-model risks (reputation, regulatory, third-party, technology, cybersecurity, human capital) require distinct management approaches.
Three automated hallucination detection methods: Natural Language Inference, Self-check GPT, and Chain-of-Verification provide per-output hallucination scores, though human evaluation remains essential for calibrating error bounds.
Six governance controls defined: User, Usage, Human-in-the-Loop, Terms of Use Alert, Input, and Output controls create layered protection against GenAI risks in financial decision-making.
Explainability remains the hardest challenge: Traditional methods like SHAP and LIME are computationally expensive for LLMs, and global explainability requires a paradigm shift that current research has not yet achieved.

The GenAI Revolution in Financial Services

The rapid adoption of generative AI across financial institutions has created an urgent need for robust model risk management frameworks that can address the unique challenges posed by large language models. Since ChatGPT’s breakthrough success in 2023, banks and financial enterprises have accelerated their exploration of GenAI applications, from customer service chatbots to credit memo generation and regulatory compliance automation. However, these models — characterized by billions of parameters and trained on petabytes of data — present fundamentally different risk profiles than the traditional statistical models that existing regulatory frameworks were designed to govern.

A landmark research paper by Yu and Ye (2025) addresses this gap by providing the first end-to-end model risk management framework for generative AI models aligned to the Federal Reserve’s SR 11-7 supervisory guidance. This framework defines the incremental testing required for effective GenAI model risk management across three foundational pillars: Conceptual Soundness with six review areas, Outcome Analysis with five review areas, and Ongoing Monitoring with continuous performance tracking. The paper represents a critical contribution to an industry where regulatory expectations are rapidly evolving and the stakes of model failure include legal, regulatory, reputational, and financial consequences.

This interactive analysis from Libertify’s Interactive Library breaks down the complete framework, examining each component of GenAI model risk management and providing practical insights for risk officers, model validators, and technology leaders navigating this rapidly evolving landscape.

Regulatory Landscape: SR 11-7, EU AI Act and NIST

The regulatory framework governing GenAI model risk management in financial institutions rests on three primary pillars. The foundational regulation is SR 11-7, the “Supervisory Guidance on Model Risk Management” issued jointly by the Board of Governors of the Federal Reserve System and the Office of the Comptroller of the Currency in 2011. This guidance defines model risk management as a shared responsibility between the line of business, model development teams, and model risk management teams, and prescribes model validation as processes that verify models perform as expected.

While SR 11-7 predates the generative AI era, its three-pillar structure — Conceptual Soundness, Outcome Analysis, and Ongoing Monitoring — provides a robust foundation that can be extended to address GenAI-specific challenges. The research paper demonstrates how each pillar requires incremental testing methodologies when applied to large language models, creating a bridge between established regulatory expectations and emerging technology capabilities.

The EU AI Act (2024) adds a second regulatory dimension, establishing the first comprehensive AI legislation globally. For financial institutions operating in European markets, the AI Act creates additional compliance requirements around risk classification, transparency, and human oversight that intersect with SR 11-7 obligations. The NIST AI Risk Management Framework (AI 600-1, 2024) provides a third reference point, offering a generative AI-specific risk profile that complements both SR 11-7 and the EU AI Act. Together, these three frameworks create a comprehensive regulatory environment that financial institutions must navigate when deploying GenAI models.

GenAI Applications in Banking and Finance

The research identifies three primary categories of GenAI applications currently being deployed or piloted across financial institutions, each with distinct risk profiles and validation requirements. Understanding these applications is essential for applying the appropriate level of model risk management oversight.

Generative summarization represents the first category, where models condense information from specific contexts into actionable summaries. In banking, this includes enhancing the efficiency of complaint resolution by summarizing customer interactions and extracting pain point insights from phone call transcripts. These applications require validation focused on completeness, hallucination prevention, and fluency — ensuring that summaries accurately represent source material without introducing fabricated information.

Retrieval Augmented Generation (RAG) constitutes the second category, combining information retrieval with generative summarization. Financial institutions use RAG systems for chatbots that retrieve and summarize third-party research for credit memos, and for tools that help employees navigate internal policy documents. RAG validation requires additional evaluation dimensions including faithfulness to retrieved context, answer relevance, and context relevance — metrics that the RAGAS framework provides. Explore more analyses of financial technology research in our Interactive Library.

General content generation forms the third category, encompassing open-ended prompt-based generation for first drafts of internal communications, marketing materials, and internal legal or policy documents. This category presents the broadest risk surface because the freeform nature of both inputs and outputs allows large variation in model behaviour, making consistent validation particularly challenging.

Transform complex financial research and regulatory documents into interactive video experiences your compliance team will engage with.

Try It Free →

Model Risks vs Non-Model Risks in Generative AI

The framework distinguishes between six categories of model risks and six categories of non-model risks, creating a comprehensive risk taxonomy that financial institutions can use to structure their GenAI governance programmes. This distinction is critical because model risks and non-model risks require fundamentally different mitigation strategies and oversight mechanisms.

Model risks begin with data and privacy risk, which is uniquely challenging for GenAI because it is virtually impossible to test data integrity for foundation models trained on petabytes of public data. Customizing these models for domain-specific financial use cases creates additional risks of exposing proprietary or confidential data. Explainability risk follows, driven by the billions of parameters and black-box nature of large language models that make explaining outcomes extraordinarily challenging. Performance and hallucination risk represents perhaps the most financially dangerous category — the paper notes that GenAI outputs may capture general trends but grossly misinterpret details, especially where financial numbers are involved.

Fairness and toxicity risk encompasses outputs that are factually correct but contain inappropriate language, or that reflect deep-rooted racial or gender biases propagated from pre-training datasets. Usage risk addresses the concern that out-of-box foundation models can be repurposed for multiple applications beyond their approved use cases, creating uncontrolled risk exposure.

Non-model risks include reputation risk from altered customer servicing, regulatory and compliance risk from innovations directly impacting consumers, third-party risks from vendor models and cloud hosting (including data leakage and copyright concerns), technology risk from sophisticated implementations that may change critical applications, cybersecurity risk from architectural vulnerabilities, and human capital risk from operational efficiency changes that may decrease or increase staffing needs.

The Six-Stage GenAI Model Lifecycle

The research paper presents a sequential six-stage model lifecycle that maps GenAI development and deployment to established risk management practices. Each stage involves specific actors, validation requirements, and decision gates that must be satisfied before progression to the next stage.

Stage one — Risk Rank Assessment — requires identification of the use case followed by thorough risk assessment of all inherent risks by multiple independent risk assessment bodies. The risk rating is determined by the reliance of business on model output, material impact from model errors (assessed along financial, customer/reputational, and regulatory dimensions), complexity of modelling choices, and feasibility of required controls. This rating then determines the minimum standards of model risk management activities throughout the remaining lifecycle stages.

Stage two — Development — places responsibility with the first line of defence, where developers test the GenAI model comprehensively across dimensions of completeness, relevance, correctness, and alignment. Stage three — Initial Validation — shifts to the second line of defence, where validators review developer challenges and results while conducting independent testing for conceptual soundness and outcome analysis. This includes implementation approval, ensuring the model behaves consistently between development and production environments.

Stages four through six cover Model Deployment and Use (with procedural checks and control lockdown), Ongoing Performance Monitoring (periodic evaluation of performance deterioration and risk increases), and Periodic Validation Activities (second-line periodic reviews of monitoring effectiveness). The paper emphasises that additional testing is required whenever changes occur to the model, data source, or implementation environment.

Conceptual Soundness: Six Pillars of GenAI Validation

The Conceptual Soundness framework comprises six interconnected components that establish the theoretical and practical foundation for GenAI model validation. Each component addresses a distinct dimension of model risk that traditional validation approaches cannot adequately cover.

Literature Review and Selection Rationale (CS #1) requires thorough evaluation of the proposed foundation model for the target business use case, including discussion of abilities, weaknesses, and social impact. Data Quality Check (CS #2) encompasses three sub-areas: data privacy and leakage testing using RegEx suites for PII detection (credit card numbers, SSNs, emails) and jailbreaking prevention; data sampling and bias assessment using embedding models to cluster populations for stratified sampling; and annotation and labelling quality review for fine-tuning examples.

Model Specification (CS #3) covers prompt engineering, hyperparameter tuning, evaluation of foundation model input context size limitations, and decisions between fine-tuning and prompting approaches, including post-model compression techniques such as pruning, quantization, and distillation. Model Explainability (CS #4) remains the most challenging area — the paper acknowledges that traditional methods like SHAP and LIME suffer from computational complexity when applied to LLMs, and that global explainability requires “a paradigm shift” that current research has not achieved.

Bias and Fairness Testing (CS #5) uses guardrail models on output to detect bias, with quantification through metrics comparing protected versus non-protected groups, evaluated against established benchmark datasets. Benchmarking (CS #6) requires demonstrating alternate methodologies — including simpler methods that must be evaluated to justify the complexity of generative approaches, such as comparing extractive summarization against generative summarization.

Make regulatory frameworks and compliance research accessible — convert any document into an interactive experience in minutes.

Get Started →

Outcome Analysis: Hallucination and Toxicity Testing

The Outcome Analysis framework defines five review areas that evaluate GenAI model performance through empirical testing rather than theoretical assessment. These areas address the most operationally significant risks of deploying generative AI in financial contexts.

Performance Metrics and Model Output Replication (OA #1) confronts a fundamental challenge: the auto-regressive nature of GenAI models makes exact output replication difficult because randomness introduced in the decoder prevents 100% reproducibility. The paper recommends documenting all decoding parameters including random number generation algorithms, and using semantic invariance techniques to quantify uncertainty when exact reproducibility is not achievable. Task-dependent evaluation dimensions are essential — summarization requires assessment of completeness, hallucination, and fluency, while RAG applications demand evaluation of faithfulness, answer relevance, and context relevance.

Robustness Testing (OA #2) applies controlled perturbations to input text with expected output to measure generalisation capability. Examples include synonym replacement and misspelling introduction for summarization tasks, and perturbations to input queries or retrieved context for RAG applications. Weakness Detection (OA #3) identifies segments where model performance falls below acceptable thresholds, using embedding models with clustering algorithms like BERTopic to discover weak performance regions defined by semantic or linguistic characteristics.

Hallucination Detection (OA #4) employs three automated methods. Natural Language Inference uses the source context as a premise and evaluates generated output chunks as hypotheses, converting contradiction logits into hallucination scores. Self-check GPT generates multiple stochastic responses and identifies hallucinations through inconsistency across generations. Chain-of-Verification has the LLM identify key facts, generate verification questions, and independently answer them — measuring consistency between original output and verification answers. All approaches produce per-output scores but require human evaluation for quality assurance and error bound establishment.

Toxicity Detection (OA #5) addresses the alignment problem where factually correct outputs contain inappropriate language. The framework prescribes controls at three stages: development-stage controls through instruction fine-tuning and RLHF, implementation-stage controls through guardrail models like the Detoxify library, and validation-stage testing using large output sets evaluated by toxicity detection models.

Implementation Controls and the Governance Framework

The paper defines six types of implementation controls that create layered protection against GenAI risks in financial decision-making environments. These controls operate at different points in the model interaction pipeline, from user access through output screening, ensuring comprehensive risk mitigation.

User Controls ensure only authorised users interact with GenAI models appropriately, implementing access management IT controls and requiring certified training for heightened risk scenarios prior to use. Usage Controls ensure models are used responsibly for their intended purpose only, locking down functionality to specific approved tasks and maintaining records of all or sample model interactions for audit purposes.

Human-in-the-Loop Controls represent a critical governance mechanism, ensuring that model outputs are never fed directly into automated workflows without review by certified humans with subject matter expertise. This is particularly important in financial contexts where automated decision-making based on hallucinated outputs could have significant consequences. Terms of Use Alert Controls ensure users are aware of heightened risks through UI alerts and disclaimers within or alongside generated text.

Input Controls prevent queries that would manifest heightened risks through preprocessing to block harmful or unethical prompts, and through pre-designed prompt templates and examples that guide model behaviour within acceptable boundaries. Output Controls screen results for heightened risks before use, deploying guardrail models for toxicity and hallucination detection and imposing bounds on response length and complexity. Together, these six controls create a comprehensive governance architecture that addresses GenAI risks at every stage of the interaction lifecycle. Financial institutions can explore additional AI governance research in the Libertify Interactive Library.

Ongoing Monitoring and Periodic Validation

Effective ongoing monitoring for GenAI models requires a fundamentally different approach than traditional model monitoring due to the dynamic nature of language model performance and the complexity of text-based outputs. The research paper specifies several essential components of a robust monitoring plan.

Monitoring frequency must be reasonable and justified, with sufficient and clearly defined KPI metrics that address key model risks. Acceptable thresholds for KPIs must capture deteriorating performance before it affects business outcomes, with decision-making guidance when multiple KPIs signal different conditions. Action plans for poor model performance must be predetermined and documented, ensuring rapid response when monitoring triggers are activated.

GenAI-specific monitoring additions include toxicity and hallucination KPIs measured continuously against production outputs, query domain stability metrics that detect when user queries drift beyond the model’s validated operating range, and operational KPIs including real-time user feedback mechanisms and tracking of successful versus attempted generation counts. The paper emphasises that automated metrics must be calibrated to human judgments to reduce annotation costs while maintaining quality standards.

Periodic Validation Activities (PVA) complement ongoing monitoring by having the second line of defence periodically review monitoring effectiveness, reassess risk ratings based on observed performance, and conduct independent testing that may reveal risks not captured by routine monitoring. The combination of continuous monitoring and periodic deep validation creates a defence-in-depth approach to GenAI model risk that aligns with SR 11-7’s expectations for ongoing vigilance.

The Future of GenAI Risk Management in Finance

The research paper acknowledges that generative AI models and their use cases are “in an early stage” and expects significant evolution “over the next few years.” This candid assessment underscores both the importance of establishing robust risk management frameworks now and the need for those frameworks to be adaptive enough to accommodate rapid technological change.

Several areas emerge as critical frontiers for future development. Explainability remains perhaps the most significant unsolved challenge — the paper notes that considerable research is required, particularly in global explainability that could provide systematic understanding of how LLMs arrive at their outputs across diverse inputs. Current approaches are limited to local explainability on a per-input-output basis, which is insufficient for the comprehensive model understanding that regulators and risk managers require.

The regulatory landscape itself is rapidly evolving, with the EU AI Act establishing new compliance requirements, the Bank for International Settlements actively researching AI governance in banking, and national regulators worldwide developing position papers on GenAI deployment in financial services. Financial institutions that invest in structured risk management frameworks now — aligned with SR 11-7’s enduring principles — will be better positioned to adapt as regulatory expectations crystallise.

The convergence of improved evaluation methodologies, better automated testing tools, and evolving regulatory guidance suggests that GenAI model risk management will mature significantly over the coming years. However, the fundamental tension between the transformative potential of generative AI in finance and the critical importance of controlling model risk will persist. The framework presented in this research provides a solid foundation that financial institutions can build upon as both the technology and the regulatory environment continue to evolve.

Discover how leading financial institutions transform regulatory research into interactive experiences that drive compliance understanding.

Start Now →

Frequently Asked Questions

How does SR 11-7 apply to generative AI models in banking?

SR 11-7 provides the foundational regulatory framework for model risk management in financial institutions. For generative AI models, the framework extends across three pillars: Conceptual Soundness (6 review areas covering literature review, data quality, model specification, explainability, bias testing, and benchmarking), Outcome Analysis (5 areas including performance metrics, robustness testing, weakness detection, hallucination detection, and toxicity detection), and Ongoing Monitoring with continuous performance tracking and periodic validation activities.

What are the main risks of deploying generative AI in financial services?

The paper identifies six model risks (data and privacy risk, explainability challenges, performance and hallucination issues, fairness and toxicity concerns, and usage risk) and six non-model risks (reputation risk, regulatory and compliance risk, third-party risks, technology risk, cybersecurity risk, and human capital risk). Hallucination is particularly dangerous in finance because factually incorrect outputs may capture trends but grossly misinterpret financial numbers.

How can financial institutions detect hallucinations in GenAI outputs?

The framework proposes three automated methods: Natural Language Inference (NLI) which uses source context as a premise and checks generated output for contradictions; Self-check GPT which generates multiple stochastic responses and measures inconsistency across generations; and Chain-of-Verification which has an LLM identify key facts, generate verification questions, and independently answer them to check consistency. All methods produce per-output hallucination scores but still require human evaluation for quality assurance.

What governance controls should banks implement for GenAI models?

The paper defines six types of controls: User Controls (access management and certified training), Usage Controls (locking functionality to approved tasks), Human-in-the-Loop Controls (requiring certified human review before decisions), Terms of Use Alert Controls (disclaimers and risk warnings), Input Controls (blocking harmful prompts and using pre-designed templates), and Output Controls (guardrail models for toxicity and hallucination screening with response bounds).

What is the recommended model lifecycle for GenAI in financial institutions?

The paper presents a six-stage lifecycle: Risk Rank Assessment (identification and risk evaluation by independent bodies), Development (first-line testing across completeness, relevance, correctness, and alignment), Initial Validation (second-line independent testing for conceptual soundness and outcome analysis), Model Deployment and Use (procedural checks and control lockdown), Ongoing Performance Monitoring (periodic KPI evaluation), and Periodic Validation Activities (second-line periodic reviews of monitoring effectiveness).

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

Transform Your First Document Free →

No credit card required · 30-second setup