—
0:00
Agentic AI for Financial Modeling and Model Risk Management Crews
Table of Contents
- The Rise of Agentic AI Systems in Financial Services
- Multi-Agent Architecture for Financial Modeling
- Human-in-the-Loop Design for Agentic Financial AI
- The Modeling Crew: From Data to Trained Models
- Model Risk Management Crew: Validation and Compliance
- Credit Card Fraud Detection: 99.9% Accuracy Case Study
- Credit Card Approval and Portfolio Credit Risk Results
- Memory Streams and Agent Knowledge Transfer
- Future Directions: Self-Improving Financial AI Agents
📌 Key Takeaways
- Dual-Crew Architecture: Purpose-built modeling and model risk management crews collaborate with specialized agents performing distinct tasks from data extraction through compliance validation.
- 99.9% Fraud Detection Accuracy: The agentic system’s CatBoost model outperformed AutoML XGBoost on recall (81.6% vs 70.4%) and F1 score (88.9% vs 82.1%) for credit card fraud detection.
- Human-in-the-Loop Integration: Every step includes human expert oversight through a context-aware module enabling instruction, feedback, and corrective intervention at each agent task.
- MRM Robustness Testing: The MRM crew autonomously identifies model weaknesses through adversarial inputs and distribution shifts, catching vulnerabilities that standard testing misses.
- Multi-LLM Orchestration: Agents leverage Llama3, DeepSeek-R1, and GPT-3.5 Turbo through CrewAI, with different models assigned based on task requirements like reasoning depth.
The Rise of Agentic AI Systems in Financial Services
Large language models have evolved far beyond text generation into sophisticated autonomous agents capable of complex reasoning, planning, and collaborative problem-solving. In financial services—an industry built on accurate modeling, rigorous validation, and regulatory compliance—agentic AI systems offer a transformative approach to workflows that have traditionally required extensive manual effort from teams of data scientists, model validators, and compliance specialists.
A new research paper by Izunna Okpala, Ashkan Golgoon, and Arjun Ravi Kannan demonstrates how agentic AI crews can effectively perform end-to-end financial modeling and model risk management tasks. Their approach goes beyond theoretical frameworks to deliver working implementations tested on real financial datasets, providing concrete evidence that multi-agent systems can handle the complexity and rigor demanded by financial services institutions.
The research builds on the CrewAI framework—a cutting-edge platform for orchestrating role-playing, autonomous AI agents—to construct two interconnected crews. The modeling crew handles everything from data extraction and exploratory analysis through model training and evaluation. The model risk management (MRM) crew then validates the modeling crew’s outputs through compliance checking, replication, conceptual soundness testing, and adversarial robustness evaluation. Together, these crews represent a complete pipeline that mirrors the organizational structure of financial institutions, where modeling teams and validation teams operate as independent but interconnected functions. As the Bain guide on agentic AI enterprise architecture outlines, this crew-based approach represents the next evolution of enterprise AI deployment.
Multi-Agent Architecture for Financial Modeling
The system architecture comprises two interconnected crews with carefully defined roles that promote modularity and enable both independent and collaborative operation. The modeling crew includes eight specialized agents: a Data Extraction Agent that loads and splits datasets, an EDA Agent that performs exploratory data analysis, a Feature Engineering Agent that creates preprocessing pipelines, a Meta-Tuning Agent for model selection and hyperparameter optimization, a Model Training Agent, a Model Evaluation Agent, a Documentation Writer, and a Judge Agent that oversees the entire process.
Each agent takes on a specific persona that matches its function. The Data Extraction Agent operates as a “Data Analyst,” the EDA Agent as a “Data Scientist,” and the Meta-Tuning Agent as a “Senior Machine Learning Engineer.” These role designations are not cosmetic—they shape the agent’s approach to problem-solving, the vocabulary it uses in its outputs, and the level of technical depth it brings to its tasks. The Judge Agent takes the role of “Manager,” examining how well its coworkers performed and providing an additional validation layer before human review.
The architecture is deliberately designed to mirror how financial institutions actually organize their modeling teams. Data analysts prepare data, data scientists explore and engineer features, machine learning engineers select and train models, and senior engineers evaluate results. By mapping this organizational structure onto an agentic system, the researchers create a workflow that financial professionals can understand, trust, and integrate into existing processes. The multi-LLM approach adds further sophistication: Llama3 and GPT-3.5 Turbo handle most operational tasks, while DeepSeek-R1 powers the judge agents and documentation writers due to its superior reasoning capabilities.
Human-in-the-Loop Design for Agentic Financial AI
Perhaps the most critical design decision in the system is its comprehensive human-in-the-loop (HITL) integration. In financial services, fully autonomous AI systems face both regulatory and practical barriers. Regulators require human oversight of model development and validation processes, and the consequences of AI errors in financial modeling—from incorrect risk assessments to compliance violations—demand that human experts remain actively engaged.
The HITL module allows human experts to provide instructions to agents at every step, select which agent performs each task, review intermediate results, and provide corrective feedback. This is not a simple approval gate at the end of the process but an active collaboration throughout. For example, during data extraction, a human expert might instruct the agent to “Drop the Time variable and split the original dataset using the 80/20 rule.” The agent executes the instruction, reports its results, and waits for further guidance before proceeding.
Knowledge transfer between agents is enabled through a context parameter inherited from the Task module. This allows each agent to access additional information about previous actions, ensuring continuity across the pipeline. The memory stream—an object that stores task delegations, execution timestamps, and information needed by collaborating agents—creates a persistent record that supports both agent collaboration and human oversight. This design ensures that no agent operates in isolation, and the human expert maintains a comprehensive view of the entire workflow’s progress, consistent with the OCC’s SR 11-7 guidance on model risk management requiring clear documentation and oversight throughout the model lifecycle.
Transform your financial research and model documentation into interactive experiences teams actually read.
The Modeling Crew: From Data to Trained Models
The modeling crew’s workflow follows a structured pipeline that mirrors industry best practices for machine learning model development. The process begins with the Data Extraction Agent, which loads datasets from specified sources, performs initial data splitting into training and test sets, and extracts representative samples when needed. The human expert provides specific instructions for data handling, such as sampling ratios and variable exclusions, ensuring the data preparation aligns with business requirements.
The EDA Agent then conducts in-depth exploratory data analysis, examining dataset shape, distributions, correlations, and potential anomalies. This step is critical for understanding the data characteristics that will influence model selection and performance. The Feature Engineering Agent follows, creating preprocessing pipelines that handle encoding, scaling, and transformation of variables. The interdependence between these agents illustrates the value of clear role designation and memory-based knowledge transfer—the Feature Engineering Agent builds on the EDA Agent’s findings, and the Meta-Tuning Agent subsequently uses the Feature Engineering Agent’s preprocessing pipeline.
The Meta-Tuning Agent represents a particularly sophisticated use of agentic AI. Tasked with model selection and hyperparameter tuning, this agent evaluates multiple machine learning algorithms—including CatBoost, AdaBoost, XGBoost, Gradient Boosting Machines, and Distributed Random Forests—to identify the optimal model and configuration for each specific dataset. The Model Training Agent then trains the selected model with the best-performing hyperparameters, followed by the Model Evaluation Agent, which calculates five key metrics: accuracy, F1 score, recall, precision, and area under the curve (AUC). Throughout this process, the memory stream chains results together, ensuring each agent can build on previous work toward the common goal.
Model Risk Management Crew: Validation and Compliance
The MRM crew serves as a safeguard team ensuring that the modeling crew operates as intended while upholding regulatory rules, business objectives, and modeling standards. This crew consists of six specialized agents: a Documentation Compliance Checker, a Model Replication Agent, a Conceptual Soundness Agent, an Outcome Analyzer, a Judge Agent, and a Documentation Writer.
The Documentation Compliance Checker is a particularly innovative component. This agent uses a Cache-Augmented Generation (CAG) framework to compare the steps and tasks handled by the modeling crew against an organizational modeling guide. It verifies that every required step—from data preparation through model evaluation—was completed according to institutional standards. The agent takes the role of “Data Scientist” and utilizes DeepSeek-R1 for its operations, bringing both domain knowledge and strong reasoning capabilities to compliance verification.
The Model Replication Agent accurately replicates the model selected and trained by the modeling crew, verifying that performance metrics align with the reported results. This independent replication provides a crucial check against errors, overfitting, or data leakage that might not be apparent from the modeling crew’s own evaluation. The Conceptual Soundness Agent validates the model’s conceptual framework, examining feature importance, interpretability, and compliance with both statistical best practices and regulatory requirements.
Perhaps most valuable is the Outcome Analyzer, which tests the trained model using adversarial inputs that simulate extreme conditions. This agent perturbs test data through multiplication and addition of fixed or randomized values, creating distribution shifts and outlier scenarios that stress-test model robustness. By automating adversarial testing, the MRM crew identifies vulnerabilities that manual validation processes might miss—or that resource-constrained teams might skip. This approach reflects the rigorous validation standards advocated by the Basel Committee’s principles for sound management of operational risk.
Credit Card Fraud Detection: 99.9% Accuracy Case Study
The first experimental use case applies the agentic framework to credit card fraud detection using a dataset of 284,807 transactions with 31 columns, where the “Class” column identifies fraudulent transactions. This is a classic imbalanced classification problem that tests the system’s ability to handle real-world complexity.
The agentic system selected CatBoost as the optimal model, achieving impressive performance metrics: 99.9% accuracy, 97.6% precision, 81.6% recall, and an 88.9% F1 score. These results are particularly noteworthy when compared against two benchmarks. The AutoML solution (using H2O with XGBoost) achieved 99.9% accuracy but only 70.4% recall and 82.1% F1 score—the agentic system’s CatBoost model outperformed on the metrics that matter most for fraud detection, where catching fraudulent transactions (recall) is more important than overall accuracy.
The researchers also compared against a top Kaggle solution, which achieved 100% accuracy, 100% precision, 99.4% recall, and 99.7% F1 score. However, they identified a critical flaw: the Kaggle solution applied SMOTE and random undersampling prior to the train-test split, resulting in data leakage that compromised the integrity of its results. This finding demonstrates a key advantage of the agentic approach—the MRM crew’s systematic validation process can identify methodological flaws that might go undetected in less rigorous workflows. The comprehensive guide to AI risk modeling frameworks provides additional context for understanding these validation challenges.
Turn complex model validation reports into interactive documents your stakeholders will actually review.
Credit Card Approval and Portfolio Credit Risk Results
The second use case tackles credit card approval prediction using merged application and credit record datasets. Data preparation involved removing duplicate IDs, calculating applicant age and years of work experience, filtering out applicants under 21 and those with null values, computing monthly loan payments, and categorizing the STATUS variable as “Good Debt” or “Bad Debt” based on days overdue. Pentaho Data Integration was used for initial processing, with the agentic system handling subsequent modeling.
The MRM crew’s Outcome Analyzer tested the trained model under both shifted inputs and adversarial conditions. The results were encouraging: the model exhibited no significant decline in performance with adversarial inputs, suggesting robust pattern capture. The top five performing features were Owned_Work_Phone, Owned_Mobile_Phone, Total_Family_Members, Housing_Type, and Job_Title, providing interpretable insights into the model’s decision-making process.
The third use case—portfolio credit risk modeling—produced the most revealing findings about the value of MRM crew validation. Using a dataset of 32,581 data points with 12 features, the agentic system and AutoML both achieved comparable initial performance (AutoML: 92.9% accuracy, 94.4% precision, 72.4% recall, 81.9% F1 using Distributed Random Forest). However, the MRM crew’s analysis revealed significant performance degradation under distribution shifts: accuracy dropped to 79.1%, F1 to 60.5%, and precision to 52.2%.
These findings demonstrate a critical vulnerability: the credit risk model may be susceptible to changes in input data distribution—a common real-world occurrence during economic shifts, policy changes, or population drift. Importantly, the model showed resilience to adversarial inputs (maintaining 92.8% accuracy), indicating that the vulnerability is specifically related to distributional changes rather than outlier sensitivity. This nuanced analysis—distinguishing between different types of model weakness—exemplifies the sophisticated validation that agentic MRM crews can provide, reflecting concerns raised in the Bank of England stability analysis of AI in financial systems.
Memory Streams and Agent Knowledge Transfer
The memory architecture underpinning the agent system represents a sophisticated approach to maintaining context and enabling collaboration in multi-step workflows. The memory stream is an object with specific capacity that stores task delegations in natural language, task execution timestamps, and information needed by collaborating agents. Its core attribute is the storage of interconnected interactions from different agents, creating a persistent knowledge graph of the entire modeling process.
When the Data Extraction Agent completes its task, its outputs—dataset locations, splitting ratios, preprocessing decisions—are stored in the memory stream and become accessible to subsequent agents. The EDA Agent can reference the exact data splits used, the Feature Engineering Agent can build on EDA findings, and the Meta-Tuning Agent can leverage the entire chain of decisions made before it. This chaining mechanism ensures consistency across the pipeline and eliminates the information loss that occurs when human teams pass work between members through informal channels.
The delegation mechanism adds another layer of sophistication. The human expert can delegate tasks to specific agents, redirect work when an agent’s output is unsatisfactory, and provide additional context that shapes how agents approach their tasks. This combination of automated knowledge transfer and human-directed delegation creates a flexible system that adapts to the specific requirements of each modeling project while maintaining the structured oversight that financial institutions require. The approach extends beyond what traditional Federal Reserve model governance guidance envisions, pointing toward a future where AI-human collaboration is the norm rather than the exception.
Future Directions: Self-Improving Financial AI Agents
The researchers identify several compelling directions for advancing agentic AI in financial services. Self-improving agents represent the next frontier—agents that enhance their initial prompts and adapt to roles not initially assigned to them through continuous learning from past interactions and feedback. An adaptive learning agent that improves its performance over time could eventually handle increasingly complex financial modeling tasks with decreasing human intervention, though always maintaining the human-in-the-loop safeguard for critical decisions.
Crew-generating agentic systems represent another promising direction. Rather than manually defining crew compositions and agent roles, future systems could dynamically create crews optimized for specific tasks. A credit risk modeling task might generate a different crew composition than a market risk assessment, with agents selected and configured based on the specific requirements of each project. This meta-level automation would further reduce the setup overhead for financial institutions deploying agentic systems.
The application of reinforcement learning to agentic workflows could enable agents to optimize their collaborative strategies over time. Currently, the workflow follows a predefined sequence; with RL, agents could learn when to seek human feedback, how to allocate computational resources across tasks, and how to prioritize competing objectives. Graph theory applications to agentic systems could model the complex relationships between agents, tasks, and knowledge dependencies, enabling more efficient workflow orchestration as explored in the Bain Technology Report on agentic AI transformation.
For the financial services industry, this research provides a practical blueprint for implementing agentic AI in core functions. The combination of specialized agent crews, human-in-the-loop oversight, memory-based knowledge transfer, and multi-LLM orchestration creates a system that is both powerful enough to handle complex modeling tasks and transparent enough to satisfy regulatory requirements. As institutions continue to explore AI-powered automation, the crew-based approach—with its clear role definitions, systematic validation, and human oversight—offers a path forward that balances innovation with the rigorous governance that financial services demands.
Transform your model documentation and validation reports into interactive experiences your team will engage with.
Frequently Asked Questions
What are agentic AI crews for financial modeling?
Agentic AI crews are teams of specialized LLM-powered agents that collaborate to perform complex financial modeling tasks. A modeling crew includes agents for data extraction, exploratory data analysis, feature engineering, model selection, hyperparameter tuning, training, evaluation, and documentation, all coordinated by a judge agent with human-in-the-loop oversight.
How does the model risk management crew validate AI models?
The MRM crew consists of specialized agents for documentation compliance checking, model replication to verify performance metrics, conceptual soundness validation examining feature importance and interpretability, outcome analysis using adversarial and shifted inputs to test robustness, and documentation writing. A judge agent and human expert provide final oversight.
What financial use cases were tested with agentic AI crews?
Three use cases were tested: credit card fraud detection with 284,807 transactions where CatBoost achieved 99.9% accuracy and 88.9% F1 score, credit card approval prediction using merged application and credit records, and portfolio credit risk modeling with 32,581 data points where the MRM crew identified performance degradation under distribution shifts.
Which LLMs power the financial modeling agents?
The agents are powered by multiple LLMs: Llama3 and GPT-3.5 Turbo handle most modeling and MRM tasks, while DeepSeek-R1 powers the judge agents and documentation writers due to its superior reasoning capabilities. The CrewAI framework orchestrates the multi-agent collaboration.
How does human-in-the-loop work in agentic financial modeling?
The human-in-the-loop module allows human experts to provide instructions to agents, select which agent performs each task, review intermediate results, and provide corrective feedback at each step. Knowledge transfer is enabled through a context parameter that lets agents access previous actions, while the judge agent provides an additional validation layer before human review.