AI Financial Risk Retrieval: How RiskEmbed Transforms Regulatory Compliance with RAG

📌 Key Takeaways

  • 43-Point NDCG Jump: RiskEmbed improves NDCG@10 from 43% to 86% over its base model, dramatically outperforming general-purpose embeddings for risk management queries.
  • Matches Top Commercial Models: Achieves 88% Hit Rate at 5, matching the best commercial finance embedding while using a smaller 768-dimensional vector space.
  • 94 OSFI Regulatory Documents: The RiskData dataset spans 33 years of Canadian regulatory guidelines aligned with international Basel standards, covering credit, market, and operational risk.
  • Open-Source Availability: Both RiskEmbed and RiskData are freely available on Hugging Face, enabling any financial institution to deploy or customize for their regulatory environment.
  • Domain Specificity Matters: Even finance-specific embeddings underperform risk-specific ones, proving that sub-domain specialization within financial services yields measurable accuracy gains.

The Regulatory Retrieval Challenge in Financial Risk Management

Financial risk management operates at the intersection of massive regulatory complexity and the critical need for accurate, timely information retrieval. Risk professionals must navigate thousands of pages of regulatory guidelines—from Basel Committee frameworks to jurisdiction-specific requirements—to ensure their institutions maintain compliance while managing credit risk, market risk, liquidity risk, and operational risk effectively.

A research team from TD Bank’s Model Development Innovation group has addressed this challenge head-on with a paper that introduces both a purpose-built dataset (RiskData) and a finetuned embedding model (RiskEmbed) designed specifically for financial risk management information retrieval. Authored by Amin Haeri, Jonathan Vitrano, and Mahdi Ghelichi, the research demonstrates that domain-specific finetuning can dramatically improve retrieval accuracy in Retrieval-Augmented Generation (RAG) systems—a finding with profound implications for how financial institutions build AI-powered compliance tools.

The core problem is straightforward but critical: when a risk analyst queries an AI system about regulatory requirements, the system must retrieve the most relevant passages from regulatory documents. General-purpose language models, even those trained on financial data, frequently retrieve tangentially related content rather than the precisely relevant regulatory guidance. In a domain where incorrect or incomplete information can lead to regulatory violations and financial losses, this imprecision is unacceptable. As explored in the latest AI risk modeling frameworks, precision in financial AI applications is paramount.

Why General-Purpose Embeddings Fail for Risk Management

The paper makes a critical distinction between general financial terminology and risk management terminology. While general finance deals with concepts like P/E ratios, ROE, and market capitalization, risk management employs a highly specialized vocabulary—VaR (Value at Risk), PD (Probability of Default), LGD (Loss Given Default), RWA (Risk-Weighted Assets), and AML (Anti-Money Laundering)—that carries precise regulatory meaning fundamentally different from its surface-level interpretation.

This semantic gap means that embedding models trained on general financial corpora cannot adequately capture the nuanced relationships between risk management concepts. A query about “capital adequacy under stress scenarios” requires understanding not just the words but the entire regulatory framework of Basel III, OSFI guidelines, and institutional risk appetite frameworks. General-purpose models lack this contextual depth, leading to retrieval failures that compound through the RAG pipeline into inaccurate or misleading AI responses.

The researchers demonstrate this gap empirically. Their base model—Snowflake Arctic Embed-Medium, which ranks among the top performers on the Hugging Face MTEB leaderboard for general retrieval tasks—achieved only 43% NDCG@10 on risk management queries. This model, with 305 million parameters and strong general-purpose performance, fundamentally could not understand the domain-specific relationships in risk management data without finetuning. The gap between 43% and the finetuned 86% represents not just a statistical improvement but a qualitative leap from unreliable to production-ready retrieval quality.

Building RiskData: A Domain-Specific Regulatory Dataset

The creation of RiskData represents a methodical approach to building domain-specific AI training data for financial services. The researchers curated 94 regulatory guideline documents published by the Office of the Superintendent of Financial Institutions (OSFI) from 1991 to 2024, spanning more than three decades of regulatory evolution across credit risk, market risk, operational risk, and regulatory compliance frameworks.

To generate training pairs, the team employed Gemini 1 Pro to create 7,496 positive question-context pairs from these regulatory documents. Each pair consists of a question that a risk professional might ask and the relevant passage from the regulatory guidelines that answers it. Critically, all generated pairs underwent manual validation by specialized staff with expertise in risk management and regulatory compliance. Reviewers verified factual correctness, alignment with OSFI guidelines, and real-world relevance—a quality control step that distinguishes RiskData from synthetic datasets that lack human oversight.

The dataset’s coverage areas include insurance regulations and actuarial practices, securitization and risk management, regulatory compliance guidelines, and operational risk and financial modeling. This breadth ensures that the finetuned model can handle the full spectrum of risk management queries rather than excelling in only a narrow subset. The dataset is available on Hugging Face, enabling the broader financial services community to build upon this work.

Transform regulatory documents into interactive experiences that make compliance training engaging.

Try It Free →

RiskEmbed Architecture and Finetuning Methodology

RiskEmbed is built on the Snowflake Arctic Embed-Medium base model, chosen for its strong performance on general retrieval benchmarks and its 305 million parameter count that balances capability with computational efficiency. The model produces 768-dimensional embeddings—notably smaller than competitors like OpenAI’s Text-Embedding-3-Large (3,072 dimensions) and Cohere’s Embed-English-v3.0 (1,024 dimensions)—which reduces memory requirements and speeds up inference in production environments.

The finetuning process uses Multiple Negatives Ranking (MNR) loss, a training objective that maximizes similarity between positive question-context pairs while minimizing similarity with randomly sampled negative pairs. This approach is particularly effective for retrieval tasks because it teaches the model to distinguish between relevant and irrelevant documents in the embedding space, with the denominator explicitly excluding the positive example when computing the loss over negatives.

Training configuration was deliberately conservative: a batch size of 12, just 2 training epochs, and a 95/5 train-test split. The researchers found that additional epochs led to performance degradation due to overfitting—a finding that underscores the importance of domain-specific data quality over quantity. The dataset’s 7,496 carefully curated and validated pairs proved sufficient to transform a general-purpose retriever into a domain-specific expert with dramatically improved performance.

The evaluation framework employs four standard information retrieval metrics: MRR@10 (Mean Reciprocal Rank), MAP@100 (Mean Average Precision), NDCG@10 (Normalized Discounted Cumulative Gain), and HR@5 (Hit Rate at 5). Together, these metrics capture different aspects of retrieval quality—from how quickly the first relevant result appears to how well the entire ranked list serves the user’s information need.

Benchmark Results: 43-Point NDCG Improvement

The performance improvements achieved through finetuning are remarkable across every metric. MRR@10 jumped from 38% to 84%, a 46 percentage point increase indicating that relevant documents now appear dramatically higher in search results. MAP@100 improved from 39% to 84%, showing improved precision across the entire retrieval depth. And the headline NDCG@10 metric surged from 43% to 86%, demonstrating that finetuning not only finds the right documents but ranks them appropriately.

These improvements are not incremental—they represent a fundamental transformation of the model’s ability to understand and retrieve risk management content. A 43-point improvement in NDCG@10 means the difference between a system that frequently surfaces irrelevant regulatory passages and one that consistently delivers the precise guidance risk professionals need. In operational terms, this translates to faster compliance workflows, reduced risk of overlooking critical requirements, and more reliable AI-assisted decision-making.

The magnitude of improvement also validates the paper’s central thesis: that risk management represents a distinct sub-domain within finance where general-purpose and even general financial models cannot adequately serve. The base model’s 43% NDCG@10, despite being a top-performing general retriever, confirms that domain expertise cannot be substituted by general capability when the stakes are high. For institutions exploring AI-powered compliance tools, as discussed in the Bank of England analysis of AI in financial systems, these results provide a clear roadmap for building effective solutions.

RAG Pipeline Design for Financial Compliance

The paper describes a complete RAG architecture that illustrates how RiskEmbed fits into production compliance systems. The pipeline begins with document parsing, where regulatory guidelines are broken into semantically meaningful chunks. These chunks are then converted into dense vector representations using the embedding model and stored in a vector database for efficient similarity search.

At query time, the system employs a hybrid approach combining both lexical search (exact keyword matching) and semantic search (meaning-based similarity) to retrieve candidate passages. This hybrid strategy ensures that neither precise regulatory terminology nor conceptual understanding is sacrificed. The retrieved results are then combined into an augmented prompt that provides the LLM with relevant context for generating accurate, grounded responses.

For financial institutions, this architecture addresses a critical challenge: how to leverage the powerful generation capabilities of large language models while grounding their outputs in authoritative regulatory sources. Without effective retrieval, LLMs are prone to hallucination—generating plausible-sounding but incorrect regulatory guidance. By dramatically improving retrieval accuracy, RiskEmbed reduces this hallucination risk at its source, ensuring that the LLM has access to the correct regulatory context before generating any response.

The practical importance of this design extends beyond individual queries. Risk management teams often need to cross-reference multiple regulatory requirements, trace the evolution of specific guidelines over time, and identify potential conflicts between different regulatory frameworks. A RAG system powered by accurate, domain-specific embeddings enables these complex analytical workflows in ways that keyword search alone cannot support, reflecting approaches outlined in Federal Reserve guidance on model risk management.

Turn your compliance documentation into interactive knowledge bases your team will actually use.

Get Started →

Comparing RiskEmbed Against Leading Embedding Models

The benchmarking study compares RiskEmbed against five leading closed-source API-based embedding models, providing a comprehensive view of where domain-specific finetuning sits relative to commercial alternatives. The results are striking: RiskEmbed achieves an 88% Hit Rate at 5, matching VoyageAI’s Voyage-Finance-2 (88%)—the only competitor specifically designed for financial applications.

Critically, RiskEmbed outperforms several models with significantly larger embedding dimensions. OpenAI’s Text-Embedding-3-Large, with a 3,072-dimensional embedding space (four times larger than RiskEmbed’s 768 dimensions), achieves only 86% HR@5. Google’s Text-Embedding-004 (768 dimensions) reaches 84%, and Cohere’s Embed-English-v3.0 (1,024 dimensions) achieves 85%. MistralAI’s Mistral-Embed (1,024 dimensions) performs better at 87% but still falls short of RiskEmbed’s 88%.

The efficiency advantage is significant for enterprise deployment. Smaller embedding dimensions translate directly to lower storage costs, faster similarity computations, and reduced infrastructure requirements. A financial institution processing millions of regulatory queries daily would see meaningful cost savings from using 768-dimensional embeddings versus 3,072-dimensional alternatives, without sacrificing retrieval accuracy. This efficiency-accuracy combination makes RiskEmbed particularly attractive for production deployments where both performance and cost matter.

Perhaps most revealing is the comparison with Voyage-Finance-2, a model explicitly finetuned on general financial data. Despite its financial domain training, it achieves the same HR@5 as RiskEmbed—but RiskEmbed does so with a smaller embedding dimension. This parity demonstrates that risk management truly constitutes a distinct sub-domain: general financial training helps, but specialized risk management finetuning achieves equivalent results more efficiently.

Practical Deployment for Financial Institutions

For financial institutions considering deployment of AI-powered compliance and risk management tools, RiskEmbed offers several practical advantages. First, its open-source availability on Hugging Face eliminates the dependency on closed-source API providers, addressing data sovereignty concerns that prevent many financial institutions from sending sensitive regulatory queries to external services.

Second, the model’s relatively modest size (305 million parameters) means it can be hosted on-premises with moderate computational resources, fitting within the infrastructure constraints of most mid-to-large financial institutions. This self-hosting capability is critical for organizations operating under strict data governance requirements that prohibit sending internal documents or queries to third-party APIs.

Third, the open availability of the RiskData dataset enables institutions to further customize the model for their specific regulatory environment. A European bank could supplement the OSFI-based training data with EBA regulatory guidelines, while a US institution could add Federal Reserve and OCC documents. This extensibility transforms RiskEmbed from a fixed model into a starting point for institution-specific optimization, a strategy consistent with the approaches outlined in the Oliver Wyman analysis of AI in financial services.

The paper’s methodology also serves as a template for building domain-specific retrieval systems in adjacent areas. Insurance companies could apply the same approach to actuarial regulatory documents, investment firms to securities regulation, and central banks to monetary policy guidelines. The demonstrated effectiveness of relatively small, curated datasets (7,496 pairs) with minimal training (2 epochs) lowers the barrier to entry for institutions that want to build similar capabilities for their specific domain.

Open-Source Impact and Future Research Directions

The decision to open-source both RiskData and RiskEmbed represents a significant contribution to the financial AI community. By making these resources freely available, the TD Bank researchers enable a virtuous cycle: other institutions can use and improve upon the model, contributing back enhanced datasets and finetuning insights that benefit the entire ecosystem. This collaborative approach contrasts with the proprietary model development that characterizes much of financial AI, where institutional knowledge remains siloed and duplicated.

The researchers identify several promising directions for future work. Exploration of triplet loss and negative mining techniques could improve model robustness by teaching the embedding model to make finer-grained distinctions between similar but non-identical regulatory concepts. Tokenizer vocabulary updates with risk-specific terminology could further improve the model’s understanding of domain language, addressing the fundamental encoding level rather than just the representation level.

Expansion of RiskData to include additional regulatory sources—particularly international banking guidelines beyond OSFI—would strengthen the model’s cross-jurisdictional generalization. For global financial institutions operating under multiple regulatory regimes simultaneously, a model that understands the relationships and differences between OSFI, EBA, Fed, and Basel Committee requirements would be invaluable. This research establishes the foundation for building such comprehensive regulatory AI systems, demonstrating that the path from general-purpose AI to domain-expert AI is shorter than many institutions assume.

For the broader financial services industry, RiskEmbed represents a proof of concept that specialized, efficient AI models can match or outperform much larger general-purpose systems on domain-specific tasks. As institutions continue their AI transformation journeys, this finding argues strongly for investing in domain-specific data curation and model customization rather than relying solely on the largest available commercial models. The future of financial AI lies not in ever-larger models but in precisely targeted ones—a lesson that extends well beyond risk management to every specialized function within financial services.

Transform your risk management documentation into searchable, interactive knowledge experiences.

Start Now →

Frequently Asked Questions

What is RiskEmbed and how does it improve financial risk retrieval?

RiskEmbed is a finetuned embedding model built by TD Bank researchers specifically for financial risk management information retrieval. It improves NDCG@10 from 43% to 86%—a 43 percentage point increase—by training on 7,496 question-context pairs derived from 94 OSFI regulatory guidelines, enabling more accurate retrieval in RAG-based compliance systems.

Why do general-purpose embedding models fail for risk management queries?

General-purpose embedding models fail because financial risk management uses highly specialized terminology like VaR, PD, LGD, and RWA that differs significantly from general finance terms. Even finance-specific models like VoyageAI Voyage-Finance-2 cannot capture the nuanced regulatory context needed for accurate risk management retrieval without domain-specific finetuning.

How was the RiskData dataset created for financial risk management?

RiskData was created by curating 94 regulatory guidelines from OSFI spanning 1991 to 2024, covering credit risk, market risk, operational risk, and compliance. Gemini 1 Pro generated 7,496 positive question-context pairs from these documents, which were then manually validated by specialized staff with expertise in risk management and regulatory compliance.

How does RiskEmbed compare to OpenAI and Google embedding models?

RiskEmbed achieves 88% Hit Rate at 5, matching VoyageAI Voyage-Finance-2 and outperforming OpenAI Text-Embedding-3-Large (86%), MistralAI Mistral-Embed (87%), Google Text-Embedding-004 (84%), and Cohere Embed-English-v3.0 (85%)—all while using a smaller 768-dimensional embedding compared to competitors using up to 3072 dimensions.

Can RiskEmbed be used outside Canadian regulatory frameworks?

Yes. Although built on Canadian OSFI guidelines, the dataset aligns with international Basel Committee standards, making RiskEmbed applicable across jurisdictions including the United States and other countries following Basel accords. Both the model and dataset are open-sourced on Hugging Face for further customization.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

No credit card required · 30-second setup