—

0:00

Chain-of-Retrieval: How Step-by-Step AI Reasoning Achieves 72.5% Accuracy on Complex Multi-Hop Questions

By Marcus Williams • March 29, 2026 • 15 min read

The Multi-Hop Reasoning Challenge
CoRAG’s Core Innovation
Technical Implementation
Training Methodology
Performance Benchmarks
Test-Time Scaling Strategies
KILT Benchmark Dominance
Retrieval Quality Improvements
Practical Applications
Comparison with Existing Methods
Limitations and Future Directions

Key Takeaways

State-of-the-art performance: Achieves 72.5% EM on 2WikiMultihopQA, outperforming 32B models with just 8B parameters
KILT benchmark leader: Sets new state-of-the-art on 8 of 9 knowledge-intensive tasks
o1-like reasoning for RAG: First approach to bring step-by-step reasoning paradigms to retrieval systems
Dramatic retrieval improvements: Increases retrieval recall by up to 28 points on challenging datasets
Predictable scaling: Log-linear performance-compute tradeoffs enable efficient resource allocation
Cross-architecture compatibility: Works across different model architectures and retriever types

The Multi-Hop Reasoning Challenge

Imagine asking an AI system: “Where did the star of Dark Hazard study?” This seemingly simple question requires multiple steps of reasoning: first identifying who starred in Dark Hazard, then finding information about that person’s education. Traditional retrieval-augmented generation (RAG) systems struggle with such multi-hop reasoning tasks because they perform only a single retrieval step before attempting to generate an answer.

A groundbreaking new approach called Chain-of-Retrieval Augmented Generation (CoRAG) changes this paradigm entirely. Published in a recent research paper, CoRAG introduces “o1-like” reasoning to information retrieval, training language models to retrieve and reason step by step rather than relying on single-shot retrieval.

The results are remarkable: using just an 8B parameter model, CoRAG achieves 72.5% exact match accuracy on the challenging 2WikiMultihopQA dataset — significantly outperforming Search-o1-32B, a 32B parameter model with access to Bing Search API that manages only 58.0% accuracy.

CoRAG’s Core Innovation

The fundamental insight behind CoRAG addresses a critical limitation in current RAG systems. As the researchers explain: “In multi-hop reasoning tasks, it is often unclear what information should be retrieved initially; decisions must be made based on the progressively evolving state of the reasoning process.”

Step-by-Step Information Retrieval

Consider how CoRAG handles the Dark Hazard example:

Query Decomposition: “What was the name of the star of Dark Hazard?” → Edward G. Robinson
Initial Retrieval Attempt: “Where did Edward G. Robinson go to college?” → No relevant information found
Dynamic Reformulation: “What college did Edward G. Robinson attend?” → City College of New York
Final Answer: City College of New York

This demonstrates three critical capabilities that distinguish CoRAG from conventional RAG systems: query decomposition, failure detection, and dynamic query reformulation.

Discover how to implement advanced retrieval systems with interactive learning experiences.

Explore RAG Implementation

Technical Implementation

CoRAG’s technical innovation centers on training language models to explicitly perform iterative retrieval rather than relying on in-context learning or few-shot prompting like previous approaches.

Retrieval Chain Generation

The core challenge is that most RAG datasets only provide a query Q and final answer A — no intermediate retrieval steps exist. CoRAG solves this through rejection sampling:

For each training instance, up to 16 retrieval chains are sampled
Each chain consists of sub-queries Q₁:L = {Q₁, Q₂, …, Q_L} and corresponding sub-answers A₁:L
Chain quality is assessed via log-likelihood of the correct answer conditioned on the chain
The chain with the highest log-likelihood score is selected for training

Multi-Task Training Framework

The model learns three simultaneous tasks using standard next-token prediction:

Sub-query prediction: Learning to generate appropriate follow-up questions
Sub-answer prediction: Extracting information from retrieved documents
Final answer prediction: Synthesizing information across the entire retrieval chain

This multi-task approach enables the model to develop sophisticated reasoning capabilities while maintaining end-to-end differentiability.

Training Methodology

CoRAG’s training process is remarkably efficient, requiring minimal computational resources while achieving state-of-the-art results.

Training Specifications

Base model: Llama-3.1-8B-Instruct
Training approach: Full-parameter fine-tuning for 1 epoch
Hardware requirements: 8 A100 GPUs
Training time: Under 6 hours for multi-hop QA, ~30 hours for KILT benchmark
Maximum sequence length: 3,072 tokens

Key Training Parameters

The researchers carefully tuned several critical parameters:

Sampling temperature: 0.7 for sub-query generation, 0 for sub-answer generation
Maximum chain length: Randomly selected from [1, 5]
Retrieval setup: E5-large retriever with top-5 documents per sub-query
Knowledge base: 36-million passage Wikipedia corpus

Performance Benchmarks

CoRAG’s performance across multi-hop QA datasets demonstrates significant improvements over existing methods, even when compared to much larger models with access to commercial search engines.

Multi-Hop QA Results

Using CoRAG-8B with L=10 and best-of-8 sampling:

2WikiMultihopQA: 72.5% EM (vs. 58.0% for Search-o1-32B) — +14.5 point improvement
HotpotQA: 56.3% EM (vs. 46.9% for DRAG) — +9.4 point improvement
MuSiQue: 30.9% EM (vs. 26.1% for ITER-RETGEN) — +4.8 point improvement
Bamboogle: 54.4% EM (vs. 56.0% for Search-o1-32B) — slight decrease attributed to dataset size and recency requirements

Remarkably, even with much cheaper greedy decoding at L=6, CoRAG achieves 70.6% EM on 2WikiMultihopQA — still outperforming the 32B Search-o1 model by 12.6 percentage points.

Efficiency Without Sacrificing Performance

When compared to a fine-tuned baseline without chain-of-retrieval augmentation using the same training data and retriever:

2WikiMultihopQA: 55.1% → 70.6% (+15.5 EM improvement)
MuSiQue: 17.4% → 27.7% (+10.3 EM improvement)

These results demonstrate that the improvement comes specifically from the chain-of-retrieval mechanism, not from better training data or retrieval systems.

Learn how to implement step-by-step reasoning in your AI applications with practical tutorials.

Master AI Reasoning

Test-Time Scaling Strategies

One of CoRAG’s most innovative aspects is its approach to test-time scaling, offering three distinct strategies that provide predictable performance-compute tradeoffs.

Scaling Strategy Options

Greedy Decoding: The simplest approach involving sequential generation of sub-queries and sub-answers. This method provides good performance with minimal computational overhead.

Best-of-N Sampling: Multiple retrieval chains are sampled at temperature 0.7, with the best chain selected using a penalty score based on the conditional log-likelihood of “No relevant information found.” Lower penalty scores indicate better information retrieval.

Tree Search: A sophisticated breadth-first search variant with rollouts. At each step, the search state expands by sampling several sub-queries, performing multiple rollouts per expansion, and retaining the state with the lowest average penalty score.

Performance-Compute Relationship

The researchers discovered that performance follows an approximately log-linear trajectory with token consumption, following the pattern: y = a × log(x + b) + c. This relationship holds for up to 128k tokens and provides several practical insights:

Predictable scaling: Organizations can estimate performance gains for given compute budgets
Diminishing returns: Increasing chain length yields substantial gains when L is small, with diminishing returns as L increases
Dataset-dependent optimization: Different datasets benefit from different scaling strategies

KILT Benchmark Dominance

Perhaps the most impressive demonstration of CoRAG’s capabilities comes from its performance on the KILT (Knowledge Intensive Language Tasks) benchmark, where it achieves state-of-the-art results on 8 of 9 tasks.

KILT Hidden Test Set Results

CoRAG-8B sets new records across diverse knowledge-intensive tasks:

AIDA (Entity Linking): 93.9% (vs. previous best 90.6%) — +3.3 point improvement
WnWi: 88.2% (vs. 87.4%) — +0.8 point improvement
WnCw: 76.7% (vs. 71.2%) — +5.5 point improvement
T-REx (Slot Filling): 88.0% (vs. 87.7%) — +0.3 point improvement
zsRE: 87.2% (vs. 85.3%) — +1.9 point improvement
Natural Questions (Open QA): 63.1% (vs. 62.3%) — +0.8 point improvement
HotpotQA: 60.6% (vs. 50.6%) — +10.0 point improvement
TriviaQA: 88.3% (vs. 84.6%) — +3.7 point improvement

The only task where CoRAG doesn’t achieve the top score is FEVER (Fact Verification), where it achieves 93.1% compared to Atlas-11B’s 93.5% — a marginal difference considering Atlas uses an 11B parameter model.

Retrieval Quality Improvements

CoRAG’s benefits extend beyond final answer accuracy to fundamental improvements in retrieval recall — the ability to find relevant information in the first place.

Retrieval Recall Improvements

Comparing CoRAG with standard E5-large retrieval:

2WikiMultihopQA: R@10 improves from 54.9% to 81.4% (+26.5 points)
Bamboogle: R@10 improves from 31.2% to 59.2% (+28.0 points)
MuSiQue: R@10 improves from 29.0% to 47.1% (+18.1 points)
HotpotQA: R@100 improves from 76.8% to 84.3% (+7.5 points)

The researchers note: “The improvements are particularly pronounced on more challenging datasets like MuSiQue and Bamboogle, where single-step retrieval struggles most.”

Cross-Retriever Compatibility

CoRAG demonstrates strong plug-and-play compatibility across different retrieval systems:

E5-base (weaker dense retriever): 2WikiMultihopQA improves from 53.1% to 70.8% EM
BM25 (sparse retriever): 2WikiMultihopQA improves from 49.1% to 62.6% EM

This compatibility suggests that “improvements to text retriever quality represent an orthogonal dimension that can further amplify CoRAG’s performance gains.”

Practical Applications

CoRAG’s architecture and performance characteristics make it particularly valuable for several categories of real-world applications.

Knowledge-Intensive Applications

Research and Fact-Checking Systems: CoRAG’s ability to trace reasoning steps and detect retrieval failures makes it ideal for applications requiring transparent, verifiable information gathering.

Educational Q&A Platforms: The step-by-step reasoning approach aligns well with educational goals, helping users understand how complex questions are broken down and answered.

Legal Document Analysis: Legal research often requires connecting information across multiple documents and precedents — exactly the type of multi-hop reasoning where CoRAG excels.

Scientific Literature Review: Researchers can leverage CoRAG’s iterative retrieval capabilities to systematically explore connections between different studies and findings.

Build powerful knowledge applications with advanced retrieval architectures and step-by-step reasoning.

Explore Knowledge AI

Weak-to-Strong Generalization

CoRAG also demonstrates promising characteristics for cost-effective deployment:

Reduced data generation costs: Using Llama-3.2-3B for chain generation with Llama-3.1-8B for training achieves nearly identical performance (69.9 vs. 70.6 EM)
GPT-4o distillation benefits: Higher-quality retrieval chains from stronger models can improve performance to 75.1 EM on 2WikiMultihopQA
Cross-architecture compatibility: Validation on Qwen3-4B and Qwen3-8B shows consistent improvements of over 10 EM points across all datasets

Comparison with Existing Methods

CoRAG’s approach represents a significant departure from existing RAG methodologies, addressing fundamental limitations in current approaches.

Conventional RAG Limitations

Single-Step Retrieval: Traditional RAG performs one retrieve-then-generate cycle, which cannot handle queries requiring multiple pieces of information.

Few-Shot Dependence: Methods like FLARE, IRCoT, and Auto-RAG rely on few-shot prompting rather than explicit training, limiting their ability to learn complex retrieval strategies.

Model Size Requirements: Systems like DRAG/IterDRAG and Search-o1 require much larger models (32B+ parameters) or commercial search engines to achieve competitive performance.

CoRAG’s Distinguishing Advantages

As the researchers emphasize: “Rather than solely relying on the model’s in-context learning capability or distillation from proprietary models, we advocate for explicitly training language models to retrieve step by step.”

This explicit training approach enables:

Efficient parameter utilization: Achieving superior performance with smaller models
Predictable scaling behavior: Clear performance-compute tradeoffs
Robust cross-domain performance: Consistent improvements across diverse knowledge-intensive tasks

Limitations and Future Directions

Despite its impressive performance, CoRAG acknowledges several important limitations that point toward future research directions.

Current Limitations

Short-Form Focus: CoRAG has been primarily tested on short, easy-to-verify answers. Real-world applications often require long-form generation with more complex evaluation metrics.

Evaluation Scope: The simplifications in scaling analysis (equal weighting of prompt and generated tokens, ignoring retrieval costs) warrant more rigorous treatment for production deployments.

Hallucination Risk: As the authors note, “The inherent risk of hallucination persists and warrants careful monitoring in practical deployments.”

Future Research Directions

Dynamic Compute Allocation: The observation that single-hop tasks show minimal gains from increased chain length suggests opportunities for adaptive compute allocation based on query complexity.

Advanced Search Strategies: Tree search during data generation using weaker LLMs could leverage the weak-to-strong generalization properties while reducing costs.

Reinforcement Learning Integration: The concurrent work on Search-R1 demonstrates the potential for RL-based approaches to retrieval tool use, which could complement CoRAG’s explicit training methodology.

Long-Form Generation: Extending CoRAG to support long-form content generation while maintaining factual accuracy represents a crucial next step for practical applications.

Implementation Considerations

For organizations considering CoRAG adoption, several practical factors merit attention:

Resource Requirements

Training infrastructure: 8 A100 GPUs for efficient training
Inference scaling: Configurable compute-performance tradeoffs
Storage needs: Large-scale document corpora for retrieval

Integration Strategies

Retriever compatibility: Works with existing dense and sparse retrievers
Architecture flexibility: Proven across multiple model architectures
Scaling adaptability: Three distinct strategies for different use cases

The combination of strong performance, reasonable resource requirements, and flexible deployment options positions CoRAG as a practical advancement for knowledge-intensive AI applications requiring sophisticated reasoning capabilities.

Frequently Asked Questions

What is Chain-of-Retrieval Augmented Generation (CoRAG)?

CoRAG is an AI approach that trains language models to retrieve and reason over relevant information step by step, rather than performing a single retrieval before generating an answer. It introduces o1-like reasoning paradigms to retrieval-augmented generation, enabling models to decompose complex queries, detect retrieval failures, and dynamically reformulate search queries.

How does CoRAG compare to standard RAG methods?

CoRAG significantly outperforms standard RAG by achieving 72.5% EM on 2WikiMultihopQA compared to 58.0% for Search-o1-32B (a 32B parameter model with Bing Search access). It achieves state-of-the-art results on 8 of 9 KILT benchmark tasks using only 8B parameters, demonstrating superior efficiency and effectiveness.

What makes CoRAG’s test-time scaling unique?

CoRAG offers three scaling strategies: greedy decoding (fastest), best-of-N sampling (balances performance and cost), and tree search (highest quality). Performance follows a log-linear relationship with token consumption, providing predictable performance-compute tradeoffs up to 128k tokens.

How does CoRAG handle multi-hop reasoning tasks?

CoRAG excels at multi-hop reasoning by breaking complex questions into sub-queries, retrieving information for each step, and adapting based on what it finds. For example, when asked ‘Where did the star of Dark Hazard study?’, it first identifies the star (Edward G. Robinson), then searches for his educational background, reformulating queries until it finds the answer (City College of New York).

What are the practical applications of CoRAG?

CoRAG is particularly valuable for knowledge-intensive applications requiring complex reasoning: research and fact-checking systems, educational Q&A platforms, legal document analysis, scientific literature review, and any application where questions require combining information from multiple sources. It’s also compatible with different retrievers and shows strong cross-architecture generalizability.

Master Advanced AI Reasoning

Explore cutting-edge retrieval systems and multi-hop reasoning implementations through interactive learning experiences. Build AI applications that can think step by step and handle complex information needs.

Start Learning Today

Key Takeaways

Chain-of-Retrieval: How Step-by-Step AI Reasoning Achieves 72.5% Accuracy on Complex Multi-Hop Questions

Table of Contents

Key Takeaways

The Multi-Hop Reasoning Challenge

CoRAG’s Core Innovation

Step-by-Step Information Retrieval

Technical Implementation

Retrieval Chain Generation

Multi-Task Training Framework

Training Methodology

Training Specifications

Key Training Parameters

Performance Benchmarks

Multi-Hop QA Results

Efficiency Without Sacrificing Performance

Test-Time Scaling Strategies

Scaling Strategy Options

Performance-Compute Relationship

KILT Benchmark Dominance

KILT Hidden Test Set Results

Retrieval Quality Improvements

Retrieval Recall Improvements

Cross-Retriever Compatibility

Practical Applications

Knowledge-Intensive Applications

Weak-to-Strong Generalization

Comparison with Existing Methods

Conventional RAG Limitations

CoRAG’s Distinguishing Advantages

Limitations and Future Directions

Current Limitations

Future Research Directions

Implementation Considerations

Resource Requirements

Integration Strategies

Frequently Asked Questions

What is Chain-of-Retrieval Augmented Generation (CoRAG)?

How does CoRAG compare to standard RAG methods?

What makes CoRAG’s test-time scaling unique?

How does CoRAG handle multi-hop reasoning tasks?

What are the practical applications of CoRAG?

Master Advanced AI Reasoning

About this Experience

Related Experiences

Company

Product

Resources