Central Bank Language Models: How BIS AI Research Is Transforming Monetary Policy Analysis

📌 Key Takeaways

  • Domain adaptation works: CB-LMs correctly predicted 90 out of 100 central banking idioms versus just 60 for standard RoBERTa
  • Size isn’t everything: Smaller encoder-only CB-LMs outperform much larger generative LLMs on sentence-level monetary policy classification without fine-tuning
  • Task complexity matters: For longer, more complex texts with limited training data, large generative models like Llama-3 70B and ChatGPT-4 Turbo achieve superior results
  • Operational risks are real: Proprietary AI models raise confidentiality, replicability, and dependency concerns for central banks handling sensitive policy data
  • Collaboration is essential: The BIS urges central banks to share AI models, best practices, and development experiences to keep pace with rapidly evolving technology

What Are Central Bank Language Models

Central bank language models represent a significant breakthrough in the application of artificial intelligence to monetary policy analysis. Published as BIS Working Paper No. 1215, this research introduces CB-LMs — specialized encoder-only language models retrained on vast central banking corpora that consistently outperform general-purpose AI foundations in domain-specific natural language processing tasks. The implications for how central banks communicate, analyze policy sentiment, and process economic intelligence are profound and far-reaching.

At their core, CB-LMs are built upon two established foundational architectures: BERT, developed by Google with approximately 110 million parameters, and RoBERTa, developed by Meta with roughly 125 million parameters. What distinguishes CB-LMs from their parent models is the process of domain adaptation — unsupervised retraining on central bank-specific text that teaches these models the grammar, idioms, semantics, and structural patterns unique to monetary policy discourse. This approach mirrors successful domain-specific adaptations in other fields, such as BioBERT for biomedical literature and FinBERT for financial text analysis.

The training corpus assembled by BIS researchers is remarkable in scope: 37,037 research papers totalling 2.7 gigabytes and 18,345 speeches amounting to 0.34 gigabytes, all sourced through the BIS Central Bank Research Hub. Three distinct token sets were generated — speech-only, paper-only, and combined — resulting in six unique CB-LM variants that provide researchers with flexible tools for different analytical contexts. Understanding how AI is reshaping financial regulation helps contextualize the significance of this research within the broader fintech landscape.

BIS Working Paper 1215: Research Methodology

The methodology employed in BIS Working Paper 1215 reflects a rigorous, multi-layered approach to evaluating language model performance in central banking contexts. The researchers designed their study around masked language modelling, a training technique where tokens within text are randomly hidden and the model must predict them from surrounding context. This bidirectional understanding is precisely what makes encoder-only architectures particularly effective for classification and sentiment analysis tasks.

Six CB-LMs were created by combining two foundational models with three dataset configurations. Each BERT-based and RoBERTa-based model was trained separately on speech-only data, paper-only data, and the combined speech-plus-paper corpus. The selection of BERT and RoBERTa as base architectures was deliberate, driven by their broad acceptance in the NLP research community and computational feasibility given available GPU resources. This pragmatic approach ensures that the resulting models can be deployed by central banks without requiring prohibitively expensive computational infrastructure.

The evaluation framework consisted of three progressively complex benchmarks. First, a masked word test using 100 central banking idioms assessed basic domain comprehension. Second, sentence-level FOMC monetary policy stance classification tested practical analytical utility using 1,243 manually labelled sentences from Federal Open Market Committee statements spanning 1997 to 2010. Third, a complex news classification task using 237 items from the BIS internal daily newsletter evaluated performance on longer, more nuanced texts. Each benchmark was designed to isolate different aspects of model capability, from vocabulary recognition to contextual reasoning across extended passages.

The dataset for FOMC classification, drawn from the work of Gorodnichenko and colleagues published in 2023, provided a gold standard of human-labelled sentences categorized as hawkish, dovish, or neutral. An 80/20 train-test split was repeated 30 times to ensure statistical robustness, allowing researchers to report mean accuracy with confidence intervals rather than single-point estimates. This methodological rigor distinguishes the BIS research from many industry benchmarks that rely on single evaluation runs.

Domain Adaptation: Why Specialized Central Bank AI Outperforms

The most compelling evidence for the value of domain adaptation emerges from the masked word test, where CB-LMs correctly predicted 90 out of 100 central banking idioms. By comparison, the standard RoBERTa foundation managed only 60 correct predictions, and BERT achieved just 53. This dramatic improvement demonstrates that central banking language contains specialized vocabulary and contextual patterns that general-purpose training corpora — primarily Wikipedia and BookCorpus — simply cannot capture.

Consider the phrase “forward guidance” in a central banking context. A general-purpose model might associate these words with navigation or customer service instructions. A CB-LM, having processed thousands of central bank speeches and papers, understands that forward guidance refers to a specific monetary policy communication tool used by central banks to signal future interest rate intentions. This contextual precision cascades through every analytical task the model performs, from sentiment classification to topic extraction.

Performance improvements proved directly proportional to training dataset size, a finding consistent with scaling laws observed across the broader machine learning literature. Combined speech-plus-paper models performed best, followed by paper-only models, then speech-only variants. This hierarchy makes intuitive sense: research papers provide technical depth and formal argumentation structures, while speeches contribute the rhetorical patterns and communication strategies that central bankers employ when addressing markets and the public. The combination captures both dimensions of central banking discourse.

RoBERTa-based CB-LMs demonstrated statistically significant improvements over their foundation on the FOMC classification task, achieving approximately 84% mean accuracy compared to the foundation’s 81%. Interestingly, BERT-based CB-LMs did not show comparable improvement, likely due to overfitting caused by BERT’s smaller parameter space (15 million fewer parameters than RoBERTa) and its less extensive original training regime. This finding highlights an important nuance in domain adaptation: the base model’s capacity and pre-existing knowledge significantly influence the effectiveness of domain-specific retraining. For readers interested in how AI benchmarking extends to digital assets and portfolio management, these methodological insights have direct relevance.

Transform complex central banking research into interactive experiences your team will actually read.

Try It Free →

CB-LMs vs ChatGPT: Monetary Policy Classification

Perhaps the most striking finding of BIS Working Paper 1215 is the comparative performance of CB-LMs against state-of-the-art generative large language models on the FOMC sentence classification task. Without any fine-tuning, every generative LLM tested — including ChatGPT-4 Turbo, ChatGPT-3.5 Turbo, Llama-3, Mistral, and Mixtral — underperformed the encoder-only CB-LMs. Even ChatGPT-4 Turbo in zero-shot mode achieved only 71% accuracy, falling well below the RoBERTa foundation’s approximately 80% and the best CB-LM’s 84%.

This counterintuitive result challenges the prevailing narrative that larger, more expensive models always deliver superior performance. The architectural explanation is illuminating: generative decoder-only models are optimized for text generation, prioritizing broad context and coherence in producing next-token predictions. Encoder-only models, by contrast, are specifically designed to create rich contextual embeddings suited to classification tasks. When the goal is to categorize monetary policy sentiment rather than generate text, the encoder architecture holds a structural advantage.

Fine-tuning partially bridges the gap. Fine-tuned ChatGPT-3.5 Turbo achieved the highest overall accuracy at 88%, surpassing all CB-LMs. Llama-3 70B with Direct Preference Optimization and retrieval-augmented generation reached 85%. However, fine-tuning is neither straightforward nor universally beneficial. Mistral 7B’s accuracy actually dropped to 38% after supervised fine-tuning, and Llama-3 70B Instruct with random sampling similarly degraded to 38%. These unpredictable degradation patterns represent a significant operational risk for central banks that cannot afford unreliable analytical tools.

ModelConfigurationAccuracy
Fine-tuned ChatGPT-3.5 TurboSFT88%
Llama-3 70B (4-bit)DPO + retrieval85%
CB-LM RoBERTa+Paper+SpeechFine-tuned~84%
ChatGPT-4 TurboRetrieval81%
RoBERTa foundationFine-tuned~80%
ChatGPT-4 TurboZero-shot71%
ChatGPT-3.5 TurboZero-shot56%
Llama-3 8BZero-shot56%
Mistral 7BZero-shot50%

The retrieval-based in-context learning approach generally outperformed random sampling across models, suggesting that providing relevant examples within prompts helps generative models ground their responses in domain-appropriate patterns. However, combining fine-tuning with in-context learning did not conclusively enhance performance, indicating diminishing returns from stacking optimization strategies. This finding is directly relevant to central banks evaluating the cost-benefit tradeoff of increasingly complex model deployment pipelines.

FOMC Sentiment Analysis and NLP Benchmarks

The FOMC sentiment analysis task serves as the cornerstone benchmark in BIS Working Paper 1215 and reflects a broader trend in applying natural language processing to central bank communication analysis. The dataset comprises 1,243 sentences from FOMC statements issued between 1997 and 2010, each manually classified as hawkish (indicating tightening monetary policy), dovish (signalling easing), or neutral. This granular sentence-level annotation enables precise evaluation of model capabilities in distinguishing subtle policy signals.

The research situates CB-LMs within a rich lineage of NLP approaches to central bank communication. Early work relied on bag-of-words methods, where researchers like Acosta and Meade in 2015 and Ehrmann and Talmi in 2020 used simple word frequency counting to gauge policy sentiment. Topic modelling followed, with Hansen and colleagues in 2017 applying Latent Dirichlet Allocation to FOMC transcripts. More recently, deep learning approaches have expanded beyond text to include visual and audio channels of central bank communication, with researchers analyzing FOMC videos using convolutional neural networks and audio using speech recognition models.

CB-LMs advance this progression by offering transformer-based models that capture context far more effectively than bag-of-words or word-counting approaches. A key insight from the BIS research is that these models are more flexible and applicable to measuring various economic variables beyond monetary policy stance. Whereas dictionary-based methods require manually curated word lists that may miss evolving language patterns, CB-LMs learn contextual relationships that generalize across different analytical applications.

The statistical robustness of the evaluation methodology deserves emphasis. By repeating the 80/20 train-test split 30 times, the researchers accounted for variance introduced by random seeds, weight initialization, and data ordering — all factors that Dodge and colleagues identified in 2020 as significant sources of evaluation noise. This approach yields reliable confidence intervals rather than potentially misleading point estimates, providing central bank practitioners with a trustworthy basis for model selection decisions.

Cross-validation against market data adds another layer of credibility. The news classification dataset was cross-checked against futures-implied probability of rate decisions derived from Bloomberg data, with a market reaction indicator ranging from -376 to +368 percentage of standard 25 basis point moves priced in. This empirical grounding connects model predictions to observable market behavior, bridging the gap between academic NLP research and practical policy analysis.

Complex Tasks: Where Generative LLMs Excel

While CB-LMs dominate on sentence-level classification, the competitive landscape shifts dramatically when tasks become more complex. The BIS researchers evaluated all models on a second benchmark: classifying 237 US monetary policy news items from the BIS internal daily newsletter published between January 2015 and June 2023. These items averaged five sentences and approximately 67 words each — substantially longer and more contextually nuanced than individual FOMC sentences.

On this complex task, Llama-3 70B Instruct achieved 81% accuracy and ChatGPT-4 Turbo reached 80%, both without any fine-tuning. The best CB-LM, RoBERTa trained on combined speech and paper data, managed only approximately 65%. This dramatic reversal highlights a fundamental tradeoff in AI model selection for central banking: encoder-only models excel when training data is abundant and text units are short, while generative models with extended context windows hold decisive advantages on longer, more complex texts with limited labelled training examples.

The dataset characteristics explain this divergence. With only 237 items split across three categories (89 hawkish, 83 neutral, 65 dovish), the training set was substantially smaller than the FOMC benchmark. CB-LMs, which rely on supervised fine-tuning, struggled to learn robust classification boundaries from such limited examples. Large generative models, by contrast, can leverage their vast pre-training knowledge and extended context windows to interpret longer passages without requiring extensive task-specific training data. This represents a fundamental architectural advantage when labelled data is scarce — a common reality in many central banking analytical contexts.

The implications for operational deployment are significant. Central banks working with well-established datasets and standardized analytical tasks can achieve excellent results with smaller, more controllable CB-LMs. For novel analytical challenges, emerging topics, or tasks where labelling effort has been minimal, larger generative models may provide superior initial performance. The optimal strategy likely involves maintaining both types of models and deploying them based on task characteristics — a recommendation the BIS paper explicitly endorses.

Make BIS research accessible — turn dense working papers into engaging interactive experiences.

Get Started →

Risks of AI Adoption in Central Banking

BIS Working Paper 1215 provides a thorough assessment of the operational risks associated with AI adoption in central banking, distinguishing between challenges specific to proprietary models, open-source alternatives, and the domain-adapted CB-LMs themselves. This risk framework is essential reading for any central bank technology committee evaluating AI deployment strategies.

Proprietary models like ChatGPT present five principal concerns. First, confidentiality and privacy: sending sensitive monetary policy deliberations, draft communications, or market intelligence to external servers operated by commercial entities risks data leakage or misuse and may violate information management policies or data provider agreements. Second, cost: input and output charges accumulate rapidly for large-scale analytical operations, and fine-tuning is particularly data-intensive and expensive. Third, transparency: users lack access to model architecture and parameters, making it difficult to understand why the model reached a particular classification. Fourth, replicability: generative models produce variable results for identical requests, which is deeply problematic for policy analysis that demands consistent, auditable outputs. Fifth, dependency: provider changes to model versions, pricing, or terms of service can force costly process adaptations with no recourse.

Open-source models like Llama-3 address some of these concerns but introduce others. Only the largest variants (70 billion parameters and above) consistently perform well, requiring substantial GPU infrastructure that most central banks do not currently maintain. Public cloud computing reintroduces the same confidentiality concerns as proprietary models. Maintaining open-source deployments demands specialized in-house expertise for updates, customization, and security compliance — capabilities that may be scarce in central bank IT departments historically focused on traditional enterprise systems. Additionally, quantization techniques that reduce computational requirements (4-bit and 8-bit precision) may introduce subtle performance trade-offs that are difficult to characterize in advance.

CB-LMs offer a more controlled risk profile. Their smaller size (110-125 million parameters) means they can run on modest hardware within central bank infrastructure, maintaining data sovereignty. Their deterministic encoder-only architecture produces consistent outputs for identical inputs, supporting auditability requirements. However, they require labelled training data for each new task, and their performance degrades on complex, long-form analytical challenges where generative models hold advantages. The BIS researchers acknowledge these trade-offs transparently, providing central banks with the information needed to make informed deployment decisions aligned with their specific risk tolerances and operational requirements.

Policy Recommendations for Central Bank AI

The BIS paper concludes with five actionable policy recommendations that deserve attention from central bank technology strategists, policymakers, and the broader financial innovation and regulatory community. These recommendations reflect the nuanced understanding of model capabilities and limitations demonstrated throughout the research.

First, strategic model selection should be driven by task characteristics rather than model prestige or marketing. Smaller encoder-only models are preferable when training data is ample and task complexity is moderate. Generative LLMs should be reserved for complex contexts with limited training data and longer text inputs. This principle of task-appropriate model selection challenges the tendency to default to the largest available model and encourages more efficient resource allocation.

Second, domain adaptation demonstrably improves performance and should be pursued systematically. The consistent superiority of CB-LMs over their general-purpose foundations validates the investment in curating domain-specific training corpora and running computationally intensive retraining processes. Central banks with access to large internal text archives — speeches, minutes, research papers, and communications — should view these as strategic assets for AI development.

Third, operational considerations must guide deployment decisions. Confidentiality requirements, privacy regulations, transparency standards, replicability needs, and cost constraints all factor into optimal model selection. The paper provides a clear framework for evaluating these dimensions that central bank technology committees can adapt to their specific institutional contexts and regulatory environments.

Fourth, central banks should actively participate in communities of practice for AI development. The BIS Innovation Hub and related collaborative forums provide platforms for sharing development experiences, trained models, best practices, and AI tools. Given the rapid pace of technological evolution, no single institution can maintain comprehensive expertise across all relevant developments. Collaborative knowledge-sharing distributes the burden of staying current and accelerates collective capability development.

Fifth, CB-LMs should be understood as complementary to, not replacements for, human expertise. These models provide quantitative measures that augment the qualitative judgment of central bank officials in evaluating communication sentiment, policy positioning, and market signals. The human-AI partnership model — where models handle volume and consistency while humans provide contextual judgment and strategic interpretation — represents the most promising path forward for central bank NLP deployment.

Future of NLP in Financial Market Infrastructure

The publication of BIS Working Paper 1215 marks an inflection point in how financial market infrastructure institutions approach artificial intelligence. Central bank language models demonstrate that domain-specific AI development is not merely an academic exercise but a practical pathway to more effective monetary policy analysis, communication monitoring, and economic intelligence processing. As the technology continues to evolve, several trends are likely to shape the future landscape.

Multi-modal analysis represents the next frontier. The BIS research notes that both visual and audio channels of central bank communication affect financial markets, not just text. Future CB-LMs may incorporate speech prosody analysis, facial expression recognition from press conference videos, and cross-modal fusion techniques that integrate text, audio, and visual signals into unified sentiment assessments. Early work by Curti and Kazinnik applying convolutional neural networks to FOMC videos and by Gorodnichenko and colleagues analyzing FOMC audio suggests these multi-modal approaches are already gaining traction.

The democratization of central banking NLP represents another significant development. By publishing their methodology and findings, BIS researchers enable central banks of all sizes and resource levels to benefit from domain-adapted AI tools. Smaller central banks that lack the capacity to develop proprietary models can leverage shared CB-LMs, while larger institutions can build upon the published foundation to create even more specialized variants. This collaborative model aligns with the BIS’s broader mission of supporting financial stability through knowledge sharing and institutional cooperation.

Real-time policy monitoring capabilities are within reach. Current CB-LMs operate in batch mode on historical data, but the underlying technology supports near-real-time text classification that could enable continuous monitoring of central bank communications, financial news flows, and market commentary. Combined with automated alert systems, such capabilities could provide policymakers with early warnings of shifts in market expectations or emerging narrative themes that require attention.

The scaling laws documented in the BIS research — particularly the relationship between training corpus size and model performance — suggest that continued investment in central banking text archives will yield compounding returns as models improve. Central banks that begin systematically curating and digitizing their communication archives today will be best positioned to leverage next-generation language models when they emerge. The convergence of domain expertise, curated data assets, and advancing AI capabilities creates a virtuous cycle that promises to transform monetary policy analysis in the years ahead.

For financial professionals and researchers seeking to engage with this transformative research in depth, Libertify’s interactive library offers an innovative approach to consuming dense technical content. Rather than scrolling through static PDFs, readers can experience the complete BIS working paper as an interactive, navigable resource that highlights key findings and enables focused exploration of the methodology, results, and policy implications that matter most to their work.

Stop losing insights in 33-page PDFs. Transform research papers into experiences people actually finish.

Start Now →

Frequently Asked Questions

What are central bank language models (CB-LMs)?

Central bank language models (CB-LMs) are specialized encoder-only AI models retrained on central banking corpora including over 37,000 research papers and 18,000 speeches. Developed by BIS researchers, they outperform general-purpose models like BERT and RoBERTa on domain-specific NLP tasks such as monetary policy stance classification.

How do CB-LMs compare to ChatGPT for monetary policy analysis?

Without fine-tuning, CB-LMs outperform ChatGPT and other generative LLMs on sentence-level FOMC classification. CB-LMs achieve approximately 84% accuracy versus ChatGPT-4 Turbo’s 71% in zero-shot mode. However, fine-tuned ChatGPT-3.5 Turbo reaches 88% accuracy, exceeding CB-LMs on the same task.

Why do domain-adapted models outperform general AI in central banking?

Central banking language contains specialized idioms, terminology, and contextual nuances poorly represented in general training corpora like Wikipedia. Domain adaptation through masked language modelling on central bank texts teaches the model bidirectional understanding of this specialized vocabulary, yielding higher accuracy on classification tasks.

What risks do central banks face when using proprietary AI models?

Central banks face confidentiality risks from sending sensitive monetary policy data to external servers, cost concerns from fine-tuning charges, transparency issues since proprietary model architectures are hidden, replicability problems due to variable outputs, and dependency risks when providers change model versions or terms of service.

What is the BIS recommendation for AI model selection in central banking?

The BIS recommends strategic model selection based on task complexity and data availability. Smaller encoder-only models like CB-LMs are preferable when training data is ample and tasks are straightforward. Larger generative LLMs excel in complex contexts with limited training data and longer texts. Central banks should also share models and best practices through collaborative communities.

How large is the training corpus used for CB-LMs?

CB-LMs were trained on a corpus of 37,037 central bank research papers totalling 2.7 GB and 18,345 central bank speeches totalling 0.34 GB. This data was sourced via the BIS Central Bank Research Hub. Three model variants were created using speech-only, paper-only, and combined datasets.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

No credit card required · 30-second setup

Our SaaS platform, AI Ready Media, transforms complex documents and information into engaging video storytelling to broaden reach and deepen engagement. We spotlight overlooked and unread important documents. All interactions seamlessly integrate with your CRM software.