Central Bank Language Models: How BIS Research Is Reshaping Monetary Policy Analysis With AI

📌 Key Takeaways

  • 90% Domain Accuracy: Central bank language models correctly predicted masked central banking terms 90% of the time, versus just 53-60% for general-purpose models like BERT and RoBERTa.
  • Smaller Models Compete: Domain-adapted CB-LMs matched or outperformed ChatGPT and Llama-3 on FOMC sentiment classification tasks with adequate training data, achieving 84% accuracy.
  • Massive Training Corpus: The BIS trained CB-LMs on 37,037 research papers and 18,345 speeches — a 3 GB specialised corpus from the Central Bank Research Hub.
  • Confidentiality Advantage: Unlike proprietary AI services, CB-LMs run locally without sending sensitive monetary policy data to external servers.
  • Strategic Model Selection: The BIS recommends choosing between smaller domain-adapted models and larger generative LLMs based on task complexity, data availability, and institutional constraints.

Why Central Bank Language Models Matter Now

Central bank language models represent a paradigm shift in how monetary institutions process, analyse, and interpret their vast repositories of policy communications. The Bank for International Settlements (BIS) Working Paper No. 1215 introduces CB-LMs — specialised encoder-only language models retrained on central banking corpora — and demonstrates their remarkable superiority over general-purpose AI for domain-specific tasks.

The implications are profound. As central banks worldwide generate thousands of speeches, policy documents, and research papers annually, the ability to automatically classify monetary policy stance, detect hawkish or dovish signals, and analyse communication patterns becomes an essential institutional capability. Traditional bag-of-words approaches and manual analysis simply cannot keep pace with the volume and complexity of modern central bank communications.

This research arrives at a critical moment when financial institutions are grappling with fundamental questions about AI adoption: should they rely on commercial models like ChatGPT, deploy open-source alternatives, or invest in domain-specific solutions? The BIS paper provides empirical evidence that helps answer these questions, revealing that smaller, specialised models can match or surpass commercial giants on targeted tasks while offering substantial advantages in confidentiality, transparency, and cost-efficiency.

Inside CB-LMs: Architecture and Training Data

The architecture of central bank language models follows a two-phase development approach that mirrors established best practices in natural language processing. The first phase — domain adaptation — involves unsupervised learning on an extensive central banking text corpus. The second phase — fine-tuning — applies supervised learning on task-oriented datasets to refine model parameters for specific classification objectives.

The training corpus is substantial and carefully curated. The BIS assembled 37,037 research papers totalling 2.7 gigabytes and 18,345 speeches comprising 0.34 gigabytes, all sourced from the BIS Central Bank Research Hub. This repository, featured on RePEc, provides a comprehensive cross-section of central banking discourse spanning decades of policy deliberation, academic research, and institutional communication.

Six distinct CB-LMs were produced by combining two foundation models — BERT (developed by Google) and RoBERTa (developed by Meta) — with three dataset configurations: papers only, speeches only, and a combined papers-plus-speeches corpus. RoBERTa-based CB-LMs consistently outperformed their BERT-based counterparts, a finding attributable to RoBERTa’s 15 million additional parameters and its more robust pre-training approach that removes next-sentence prediction and uses dynamic masking.

The masked language modelling (MLM) technique underpins the domain adaptation phase. During training, tokens within the central banking corpus are randomly masked, and the model learns to predict them from surrounding context. This bidirectional understanding — analysing both left and right context simultaneously — is what gives encoder-only models their edge in classification tasks over decoder-only models that process text sequentially.

Central Bank Language Models vs General-Purpose AI

The performance gap between central bank language models and their general-purpose foundations is striking. In the masked word test — where models must predict obscured terms within 100 central banking idioms — CB-LMs achieved 90% accuracy. The foundation RoBERTa model managed just 60%, while foundation BERT scored only 53%. This 30-37 percentage point improvement demonstrates the transformative power of domain adaptation.

Consider expressions like “accommodative ___ policy” or “Basel ___ on Banking Supervision.” General-purpose models, trained on broad internet text, frequently misidentify these domain-specific completions. Central bank language models, having absorbed millions of pages of monetary policy discourse, correctly identify “monetary” and “Committee” with near-perfect reliability. This specialised understanding extends beyond simple vocabulary recognition to encompass nuanced contextual relationships unique to central banking.

The combined speech-and-paper training dataset produced the highest-performing models at 90% accuracy, followed closely by papers-only models at 89%. Speech-only models achieved 84-86%, suggesting that the more formal, technical language of research papers provides stronger training signal for domain adaptation than the comparatively varied style of public speeches.

These findings align with precedents in other specialised fields. BioBERT demonstrated similar gains in biomedical text processing, while FinBERT showed comparable improvements for financial sentiment analysis. The central banking domain, with its highly specialised terminology and consequential implications, represents an ideal candidate for domain adaptation.

Transform complex research papers into interactive experiences your team will actually engage with.

Try It Free →

FOMC Sentiment Analysis: A Real-World Benchmark

The Federal Open Market Committee (FOMC) sentiment classification task serves as the primary real-world benchmark for evaluating central bank language models. Using a dataset of 1,243 sentences from FOMC statements spanning 1997 to 2010, manually labelled by domain experts as hawkish, dovish, or neutral, the BIS researchers tested whether domain adaptation translates into meaningful improvements on practical monetary policy tasks.

The results confirmed the value of specialisation. The best-performing CB-LM — RoBERTa trained on the combined paper-and-speech corpus — achieved approximately 84% mean accuracy across 30 randomised train-test splits. Foundation RoBERTa reached roughly 81%, and foundation BERT approximately 80%. Interestingly, BERT-based CB-LMs did not improve over their foundation model, hovering around 78-79% — a finding that underscores the importance of choosing the right architectural foundation for domain adaptation.

FOMC sentiment analysis matters because monetary policy signals directly influence global financial markets. A single word change in an FOMC statement — from “patient” to “vigilant,” for example — can trigger billions of dollars in market movements. Automated, accurate classification of policy stance enables faster institutional response, more consistent analysis across time periods, and the ability to process volumes of communication that would overwhelm human analysts.

The 80/20 train-test split methodology, repeated with 30 different random dataset partitions, ensures statistical robustness. This approach accounts for the sensitivity of classification results to specific data configurations, a common concern when working with relatively small labelled datasets in specialised domains. The consistency of RoBERTa-based CB-LM superiority across these variations strengthens confidence in the domain adaptation approach.

Central Bank Language Models and Generative LLMs Compared

Perhaps the most consequential finding from the BIS research is the detailed comparison between central bank language models and state-of-the-art generative large language models. The study evaluated ChatGPT-3.5 Turbo, ChatGPT-4 Turbo, Llama-3 (8B and 70B variants), Mistral 7B, and Mixtral 7x8B across multiple configurations including zero-shot, fine-tuned, and in-context learning approaches.

On the FOMC sentiment task, ChatGPT-3.5 Turbo without fine-tuning achieved just 56% accuracy — far below the CB-LM’s 84%. However, when fine-tuned using OpenAI’s proprietary pipeline, ChatGPT-3.5 jumped to 88%, the highest score recorded. Llama-3 70B with DPO fine-tuning and retrieval-based in-context learning reached 85%, while ChatGPT-4 Turbo with retrieval-based learning achieved 81%.

These results reveal a nuanced picture. For organisations with access to quality fine-tuning infrastructure and sufficient labelled data, generative LLMs can match or slightly exceed CB-LMs on standard classification tasks. But the advantages of central bank language models become clear when considering practical deployment constraints: CB-LMs require no external API calls, protect data confidentiality, produce deterministic outputs, and cost a fraction of generative model inference.

The study also documented significant failure modes. Mistral 7B with supervised fine-tuning saw performance degrade from 50% to 38% — worse than random chance for a three-class problem. Llama-3 70B with random in-context learning dropped from 71% to 38%. These results demonstrate that sophisticated AI techniques can actively harm performance when applied incorrectly, making the relative simplicity and reliability of domain-adapted encoder models particularly attractive for risk-sensitive institutions.

Domain Adaptation: Why Specialisation Wins

The success of central bank language models illuminates a broader principle in artificial intelligence: domain adaptation consistently delivers disproportionate returns relative to its computational cost. Rather than training a model from scratch — which requires billions of parameters and enormous compute budgets — domain adaptation fine-tunes existing model weights on specialised text, achieving superior in-domain performance at a fraction of the cost.

The mechanism is straightforward. Foundation models like BERT and RoBERTa develop general linguistic understanding during pre-training on broad web corpora. Domain adaptation then refocuses this understanding on the specific vocabulary, syntactic patterns, and semantic relationships that characterise central banking discourse. The result is a model that retains general language competence while developing deep expertise in its target domain.

The BIS findings quantify this advantage precisely. On central banking idiom prediction, domain adaptation improved accuracy by 30 percentage points for RoBERTa (60% to 90%) and 37 points for BERT (53% to 90%). On FOMC classification, the improvement was more modest but still meaningful: approximately 3 percentage points for RoBERTa-based models. The larger gain on idiom prediction reflects the direct impact of vocabulary specialisation, while the classification gain reflects subtler improvements in contextual understanding.

This pattern holds across multiple specialised domains. Research on Federal Reserve communications has consistently shown that domain-specific approaches outperform general methods for monetary policy text analysis. The convergence of evidence from biomedicine, finance, law, and now central banking suggests that domain adaptation should be a default strategy for any institution seeking to apply NLP to specialised text collections.

Turn dense BIS working papers into engaging interactive documents your stakeholders will actually read.

Get Started →

Deploying Central Bank Language Models: Risks and Considerations

The BIS paper devotes significant attention to the practical challenges of deploying central bank language models and AI systems within institutional environments. These considerations extend well beyond model accuracy to encompass confidentiality, transparency, replicability, cost, and organisational capacity — factors that often determine whether a technically superior solution can actually be deployed.

Confidentiality emerges as the paramount concern. Central banks routinely process market-sensitive information, unreleased policy decisions, and proprietary economic data. Sending such material to external API providers — even encrypted in transit — creates data sovereignty risks that many institutions find unacceptable. CB-LMs, running entirely on local infrastructure, eliminate this exposure. No monetary policy text ever leaves the institution’s control perimeter.

Transparency and replicability present additional challenges for proprietary models. Commercial AI providers frequently update their models without notice, potentially altering outputs for identical inputs. The probabilistic nature of generative models means that the same query can produce different results across runs. For central banks, where analytical consistency and audit trails are regulatory requirements, this non-determinism is a significant liability. CB-LMs produce consistent, reproducible outputs that can be validated and audited systematically.

Cost considerations favour domain-adapted models for high-volume applications. While individual API calls to commercial models are inexpensive, central banks processing thousands of documents daily face substantial recurring costs for inference, fine-tuning data uploads, and custom model access. CB-LMs, once trained, run on institutional hardware with marginal cost approaching zero per inference — an important consideration for organisations with tight and publicly scrutinised budgets.

Open-Source vs Proprietary Models for Central Banking

The choice between open-source and proprietary central bank language models involves complex trade-offs that the BIS paper examines in detail. Open-source models like Llama-3 and Mistral offer full architectural transparency and the ability to run on institutional infrastructure, but they demand substantial computational resources and in-house expertise.

The BIS research found that only the largest open-source models — specifically Llama-3 70B — performed consistently well across varied tasks. The 8-billion-parameter variant and Mistral 7B showed highly variable results, with some configurations producing accuracy below random chance. This scaling dynamic means that institutions choosing open-source paths must invest in GPU infrastructure capable of running models with tens of billions of parameters, a non-trivial hardware commitment.

Internal cloud infrastructure requirements compound the challenge. Building stable, high-capacity computing environments for large language model inference requires specialised engineering talent, robust security configurations, and ongoing maintenance capacity. Many central banks, particularly in smaller economies, lack the IT infrastructure and staffing to support such deployments. The alternative — using public cloud services — reintroduces the confidentiality concerns that motivated moving away from proprietary APIs in the first place.

The BIS recommends a pragmatic approach: institutions should select models based on task complexity and data availability rather than defaulting to the largest available option. For well-defined classification tasks with adequate training data, smaller domain-adapted models offer the best balance of performance, cost, and operational simplicity. For more complex analytical challenges involving longer texts and limited labelled examples, larger generative models justify their additional resource requirements.

Future of Central Bank Language Models in Policy Analysis

The trajectory of central bank language models points toward increasingly sophisticated applications across monetary policy analysis, financial stability assessment, and institutional knowledge management. The BIS explicitly calls for a “community of practice” among central banks — a collaborative framework for sharing model development experiences, training datasets, and deployment best practices.

Several technical frontiers beckon. Multi-task learning could enable a single CB-LM to simultaneously classify policy stance, extract key economic indicators, and generate summary briefings. Cross-lingual models could extend analysis to central bank communications in languages beyond English, opening vast repositories of policy text from the European Central Bank, Bank of Japan, and dozens of other institutions that publish in multiple languages.

The integration of central bank language models with real-time data feeds presents perhaps the most transformative near-term opportunity. By connecting CB-LMs to live news streams, market data, and social media sentiment, central banks could build early warning systems that detect emerging monetary policy narratives before they crystallise into market movements. The BIS research on classifying US monetary policy news items — where CB-LMs achieved 65% accuracy on a substantially harder task involving longer texts and fewer training examples — suggests this direction is technically feasible even with current model capabilities.

Non-verbal communication analysis represents another frontier. The BIS paper references research showing that financial markets respond to non-verbal cues from Federal Reserve chairs during press conferences. Future CB-LMs could integrate text analysis with audio and video processing to provide holistic assessments of communication impact — though such multimodal approaches remain technically challenging and computationally expensive.

Key Lessons for Financial Institutions Adopting AI

The BIS Working Paper 1215 offers critical lessons that extend well beyond central banking to any financial institution evaluating AI adoption strategies. The overarching message is clear: domain adaptation delivers outsised returns, and institutional constraints must weigh as heavily as benchmark accuracy in technology selection decisions.

First, organisations should resist the temptation to equate model size with model quality. The CB-LM research demonstrates that a carefully adapted model with millions of parameters can match a trillion-parameter commercial model on targeted tasks. The key variable is not computational scale but alignment between training data and deployment context. A model trained on central banking text understands central banking language — no amount of general training on internet text can fully substitute for this specialisation.

Second, fine-tuning is not a guaranteed improvement. The BIS documented cases where sophisticated fine-tuning techniques actively degraded model performance. Mistral 7B’s accuracy dropped 12 percentage points after supervised fine-tuning, and Llama-3 70B lost 33 percentage points with random in-context learning. These failures highlight the importance of rigorous evaluation protocols and the danger of assuming that more complex approaches automatically yield better results.

Third, the deployment environment matters as much as the model itself. An AI system that achieves state-of-the-art accuracy but requires sending confidential data to external servers, produces non-reproducible outputs, or incurs unpredictable costs may be unsuitable for regulated financial institutions. The BIS framework of evaluating confidentiality, transparency, replicability, and cost-efficiency provides a practical template for technology assessment that prioritises institutional requirements alongside technical performance.

Finally, collaboration accelerates progress. The fast-evolving nature of AI technology makes it difficult for any single institution — even one as well-resourced as the BIS — to maintain cutting-edge capabilities independently. The call for a community of practice reflects a recognition that shared investment in foundational tools, datasets, and methodologies benefits the entire central banking ecosystem. Financial institutions considering AI adoption would be wise to seek similar collaborative frameworks within their own sectors.

Make BIS research papers and policy documents accessible to everyone — transform them into interactive experiences.

Start Now →

Frequently Asked Questions

What are central bank language models (CB-LMs)?

Central bank language models (CB-LMs) are specialised AI models retrained on central banking corpora including over 37,000 research papers and 18,000 speeches. Built on BERT and RoBERTa foundations, they achieve 90% accuracy on domain-specific tasks compared to 53-60% for general-purpose models.

How do CB-LMs compare to ChatGPT for monetary policy analysis?

CB-LMs match or outperform ChatGPT on simpler classification tasks like FOMC sentiment analysis, achieving 84% accuracy versus ChatGPT-3.5’s 56% without fine-tuning. However, larger generative models like GPT-4 (80%) and Llama-3 70B (81%) excel on more complex, longer-text scenarios.

What training data do central bank language models use?

CB-LMs are trained on a corpus of 37,037 research papers (2.7 GB) and 18,345 speeches (0.34 GB) sourced from the BIS Central Bank Research Hub. This domain-specific data enables the models to understand specialised monetary policy terminology and context.

Why are domain-adapted models better than general AI for central banking?

Domain-adapted models understand specialised terminology like “accommodative monetary policy” or “Basel Committee on Banking Supervision” that general models frequently misinterpret. They also offer advantages in confidentiality, transparency, replicability, and cost-efficiency for central bank operations.

What are the risks of central banks using proprietary AI models?

Risks include data confidentiality concerns when sending sensitive monetary policy information to external servers, lack of transparency in proprietary model architectures, replicability issues due to probabilistic outputs, and significant costs for fine-tuning and API usage.

What is FOMC sentiment analysis and why does it matter?

FOMC sentiment analysis classifies Federal Open Market Committee communications as hawkish, dovish, or neutral. It matters because monetary policy signals directly influence financial markets, and automated classification enables faster, more consistent analysis of policy direction.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

No credit card required · 30-second setup

Our SaaS platform, AI Ready Media, transforms complex documents and information into engaging video storytelling to broaden reach and deepen engagement. We spotlight overlooked and unread important documents. All interactions seamlessly integrate with your CRM software.