0:00

0:00





How to Add Fill-in-the-Middle Capability to Large Language Models Without Sacrificing Performance

📌 Key Takeaways

  • FIM-for-Free Property: Training with up to 90% FIM-transformed data preserves all autoregressive capabilities at zero measurable cost across all benchmarks and model scales.
  • SPM Format Superiority: Suffix-prefix-middle ordering outperforms prefix-suffix-middle through better key-value caching and consistently higher benchmark scores.
  • Context-Level Implementation: Applying FIM transformation after document chunking prevents fragmentation and significantly improves infilling quality over document-level approaches.
  • Character-Level Robustness: Character-based span selection handles arbitrary cursor positions gracefully, critical for production coding assistants and document editing.
  • Finetuning Inefficiency: Retrofitting FIM onto existing models costs up to 50% of original pretraining compute—build it in from the start instead.

Why Infilling Matters for Production AI Systems

Enterprise AI deployment faces a critical gap between theoretical model capabilities and real-world workflow demands. Traditional autoregressive language models excel at left-to-right text generation, but enterprise applications require bidirectional completion—generating missing content between existing text fragments.

Consider code completion systems where developers position cursors mid-function to add documentation. Or document editing workflows where teams need to insert explanations between existing paragraphs. Traditional language models force awkward workarounds—concatenating prefix and suffix as prompts, hoping the model generates appropriate middle content.

Fill-in-the-Middle (FIM) capability addresses this directly by teaching models to understand and generate within bidirectional context. Unlike specialized infilling models that sacrifice general capabilities, the breakthrough FIM-for-free approach enables both autoregressive and infilling competence in a single model. This research demonstrates that enterprise teams can add FIM capability without compromising existing performance, fundamentally changing the cost-benefit analysis for production AI systems.

The implications extend beyond convenience. Recent studies show FIM-capable models reduce inference complexity by eliminating the need for multiple specialized models. A single FIM model handles import statement generation, function completion, and inline documentation—replacing model zoos with unified serving infrastructure.

The FIM-for-Free Property — Adding a Capability at Zero Cost

The most counterintuitive finding in transformer optimization research is that adding infilling capability costs nothing. Across 8 model scales from 50M to 6.9B parameters, training with up to 90% FIM-transformed data preserves all autoregressive benchmarks within measurement error. This challenges fundamental assumptions about capability trade-offs in model training.

Traditional machine learning intuition suggests that dividing training attention between two tasks—left-to-right generation and infilling—should degrade performance on at least one. The FIM-for-free property demolishes this assumption through systematic evaluation across HumanEval, HellaSwag, LAMBADA, StoryCloze, PIQA, Winograd, WinoGrande, DROP, and QuAC benchmarks.

The mechanism behind this “free” capability lies in the structural relationship between autoregressive and infilling tasks. During FIM training, models learn to attend to both prefix and suffix context when generating middle content. This bidirectional attention pattern enhances understanding of document structure and context dependencies, potentially improving autoregressive performance through better context modeling.

For enterprise training decisions, this finding represents a paradigm shift. AI teams selecting model architectures no longer face capability trade-offs. FIM training becomes standard practice, not an optimization decision. The research validates this across domains—code and natural language—suggesting broad applicability beyond specialized use cases.

How the FIM Data Transformation Works

FIM training transforms standard documents through systematic rearrangement of text segments. The process begins with original document segmentation into three components: prefix (beginning), middle (content to be generated), and suffix (ending). These segments are then rearranged with sentinel tokens marking boundaries.

The transformation algorithm selects span boundaries through configurable strategies. Document-level FIM applies transformation before text chunking, while context-level FIM transforms after chunking—a distinction that significantly impacts training effectiveness. The span selection method determines which text becomes the “middle” to be generated: line-level selection targets complete lines, token-level selects arbitrary token sequences, and character-level enables subtoken boundaries.

Sentinel token design varies between implementations. The standard approach uses three special tokens: `

` for prefix, `` for suffix, and `` for middle content. Advanced implementations use numeric sentinels or learned embeddings. The key requirement is unambiguous segment identification during both training and inference.

Two primary formatting variants emerge: PSM (prefix-suffix-middle) and SPM (suffix-prefix-middle). PSM follows intuitive ordering: show prefix, show suffix, generate middle. SPM reverses prefix and suffix: show suffix, show prefix, generate middle. While PSM seems more logical, SPM delivers superior performance through computational advantages during inference.

Transform your static documents into interactive training materials that teach FIM concepts hands-on

Try It Free →

SPM vs PSM — Why Token Ordering Matters for Inference Efficiency

The choice between SPM and PSM formatting carries profound implications for production deployment efficiency. While both approaches achieve similar training loss, SPM consistently outperforms PSM in sampling-based benchmarks and enables significant inference optimizations through key-value caching strategies.

During real-time code completion, users continuously modify prefix text while the suffix remains static. In PSM format, each prefix change invalidates the entire key-value cache, forcing complete recomputation. SPM format processes suffix first, allowing prefix changes without cache invalidation. This architectural advantage translates to measurable latency reductions in production systems.

Benchmark evidence confirms SPM superiority across all three infilling evaluation tasks: single-line completion, multi-line completion, and random span infilling. At the 6.9B parameter scale, SPM achieves 0.751 single-line pass rate versus PSM’s approximately 0.72 (interpolated from joint training results). The performance gap widens with model scale, suggesting SPM benefits compound with increased capacity.

Joint SPM+PSM training reveals positive transfer effects. Models trained with 50% SPM and 50% PSM achieve roughly equivalent performance to pure SPM training at 90% FIM rate. This suggests that exposure to both formatting variants enhances the model’s understanding of bidirectional dependencies, though SPM remains the preferred inference format for production deployment.

For enterprise inference infrastructure, SPM format enables sophisticated caching strategies. Suffix content can be pre-processed and cached across sessions, reducing computational load when users modify prefix content repeatedly—a common pattern in coding assistants and document editing applications.

Choosing the Right FIM Rate for Your Training Pipeline

The FIM rate—percentage of training examples subjected to infilling transformation—represents the primary hyperparameter for enterprise FIM deployment. Research across 8 model scales provides definitive guidance: 50-90% FIM rates optimize both autoregressive preservation and infilling capability development.

Lower FIM rates (15%) underutilize infilling potential. While preserving autoregressive performance, these conservative rates produce models with limited infilling competence—defeating the purpose of FIM training. The research demonstrates that aggressive FIM rates up to 90% maintain autoregressive capabilities while maximizing infilling performance.

The 90% threshold appears universal across model scales and domains. At 100% FIM rate, autoregressive performance begins degrading as models lose exposure to standard left-to-right patterns. This suggests an optimal training mixture: 90% FIM-transformed data with 10% standard autoregressive examples to maintain left-to-right competence.

Domain-specific considerations may warrant rate adjustment. Code completion applications might benefit from higher FIM rates due to the structured nature of programming languages. Natural language applications might prefer lower rates to maintain fluency in standard generation tasks. However, the research validates 90% as a robust default across both domains.

Training schedule design enables dynamic FIM rate adjustment. Some implementations start with lower FIM rates and increase gradually, allowing models to learn basic autoregressive patterns before introducing infilling complexity. Others maintain constant FIM rates throughout training. Evidence suggests constant 90% FIM rates achieve optimal results without schedule complexity.

Context-Level FIM — A Simple Implementation Change With Outsized Impact

The distinction between document-level and context-level FIM implementation appears subtle but produces dramatic quality differences. Document-level FIM applies transformation before text chunking for model context windows. Context-level FIM applies transformation after chunking, ensuring prefix, middle, and suffix segments coexist within the same context window.

Document-level implementation creates fragmented FIM examples when long documents exceed context window limits. The prefix might appear in one chunk, the suffix in another, with the middle spanning multiple chunks. This fragmentation confuses models during training—they cannot learn proper prefix-suffix relationships when segments appear in isolation across training steps.

Context-level FIM guarantees coherent training examples. By applying transformation after chunking, each training step presents complete prefix-middle-suffix relationships within the model’s attention span. This simple implementation change produces consistent improvements across all model scales and benchmarks, despite minimal differences in perplexity metrics.

The implementation requires minimal code changes. Instead of transforming entire documents and then chunking, the training pipeline chunks documents first, then applies FIM transformation to selected chunks. This preserves document structure while ensuring training coherence. Most enterprise training frameworks can accommodate this change without architectural modifications.

Performance improvements from context-level FIM appear in sampling benchmarks rather than perplexity metrics—a pattern that recurs throughout FIM research. This highlights the importance of evaluation methodology in assessing model improvements and suggests that traditional language modeling metrics may inadequately capture practical model capabilities.

See how context-level transformations improve your model training documentation and visualization

Get Started →

Character-Level Span Selection for Production Robustness

Enterprise production systems demand robustness to arbitrary user inputs—including cursor placement mid-word or mid-token. Traditional token-level or line-level span selection creates brittleness when users position cursors at unexpected locations. Character-level span selection addresses this fundamental deployment challenge.

Token-level span selection aligns with model tokenization but fails when users place cursors within tokens. Line-level selection handles structured text well but performs poorly on arbitrary spans. Character-level selection introduces natural subtoken boundaries during training, teaching models to handle cursor placement anywhere within text.

Benchmark results validate character-level superiority for random span infilling. At the Medium model scale with PSM formatting, character-level selection achieves 0.321 pass rate on random spans versus 0.102 for token-level and 0.015 for line-level. This dramatic improvement comes with minimal cost on structured tasks—character-level selection remains competitive on single-line and multi-line completion.

The robustness benefit extends beyond numerical performance to user experience quality. Models trained with character-level spans handle typos, partial words, and mid-token edits gracefully. This eliminates common failure modes in production coding assistants where models generate malformed completions when users edit within tokens.

Implementation complexity for character-level selection varies by framework. Some tokenizers require modification to support character-aligned boundaries. Others handle character-level spans naturally through byte-pair encoding alignment. The engineering investment typically pays dividends through reduced production debugging and improved user satisfaction in real deployment scenarios.

Why Finetuning Existing Models for FIM Is Surprisingly Expensive

The economics of adding FIM capability to existing models challenges conventional finetuning assumptions. While general capability addition through finetuning often requires modest computational investment, FIM retrofitting demands extraordinary resources—up to 50% of original pretraining compute at aggressive learning rates.

This inefficiency stems from the fundamental nature of FIM capability. Unlike task-specific skills that can be learned through limited exposure, infilling requires deep architectural changes to attention patterns and context modeling. Existing models must unlearn ingrained left-to-right biases while developing bidirectional understanding—a cognitively expensive process.

The research evaluated 16 finetuning configurations across learning rates, FIM rates, and token budgets. Only the most aggressive combination—90% FIM rate, 1.0× learning rate, and 50B tokens (half the pretraining budget)—achieved parity with baseline pretrained FIM performance. Conservative finetuning approaches failed to develop meaningful infilling competence.

This finding has direct implications for enterprise model development strategies. Teams should integrate FIM capability during initial pretraining rather than attempting post-hoc addition. The computational cost differential makes building FIM from scratch more economical than retrofitting existing models, especially when considering the operational complexity of managing multiple model versions.

The “ossification hypothesis” explains this inefficiency. Models develop deep architectural biases during pretraining that resist modification through finetuning. These biases appear beneficial for primary tasks but create resistance to capability addition. Understanding this pattern helps enterprise teams make informed build-versus-buy decisions for model development.

Perplexity Lies — Why You Need Sampling-Based Evaluation

Traditional language modeling evaluation relies heavily on perplexity metrics—the average negative log probability of held-out sequences. While perplexity provides useful signal for model development, FIM research reveals a critical limitation: perplexity differences often fail to predict practical performance differences in infilling tasks.

Across multiple ablations in the research, negligible perplexity differences (often <0.001 nats/token) correspond to large gaps in sampling-based benchmarks. Context-level versus document-level FIM shows minimal perplexity separation but consistent sampling advantages. SPM versus PSM formatting exhibits similar patterns—equivalent perplexity with divergent practical performance.

This evaluation gap has profound implications for enterprise model selection and optimization. Teams relying primarily on perplexity metrics may miss significant practical improvements or select suboptimal configurations. Comprehensive evaluation frameworks must incorporate task-specific sampling benchmarks alongside traditional metrics.

The research establishes three sampling-based evaluation tasks: single-line infilling (completing individual code lines), multi-line infilling (generating code blocks), and random span infilling (arbitrary text completion). These benchmarks use unit-test verification for objective scoring, eliminating subjective evaluation challenges common in generation tasks.

For scaling laws research, this finding suggests caution in extrapolating from perplexity trends. Models that appear equivalent under perplexity scaling may exhibit different practical capabilities. Future model development should weight sampling-based metrics more heavily when optimizing for production deployment rather than research benchmarks.

Inference Strategies for Reliable Infilling in Production

Production FIM deployment requires sophisticated inference strategies to handle common failure modes. The most frequent issue involves models failing to generate the end-of-turn (EOT) token, creating infinite generation loops or content that doesn’t properly connect to the suffix. EOT-aware best-of-n sampling addresses this fundamental challenge.

The EOT-aware strategy generates multiple completion candidates and preferentially selects those that produce the EOT token, indicating natural completion. Candidates are then reranked by likelihood to select the highest-quality completion among valid options. This approach directly addresses the primary failure mode while maintaining generation quality.

Prompt engineering significantly impacts infilling reliability. Structured formatting with numbered items, clear sections, and explicit boundaries constrains model output and improves EOT generation rates. The research demonstrates dramatic improvement when prompts include organizational structure rather than unstructured text blocks.

Token budget management requires careful consideration in production systems. FIM models must balance prefix and suffix context with generation budget. Strategies include dynamic prefix truncation, suffix prioritization for context retention, and adaptive generation limits based on content type. These optimizations prevent context overflow while preserving essential bidirectional information.

Failure mode detection enables graceful degradation in production systems. When FIM generation fails or produces low-quality output, systems can fallback to autoregressive completion or flag content for human review. This hybrid approach maintains system reliability while leveraging FIM capabilities when effective.

Build interactive demonstrations of your FIM inference strategies and failure handling approaches

Start Now →

Recommended Configuration for Enterprise FIM Training

Based on comprehensive research across scales, domains, and configurations, enterprise teams should adopt the following FIM training specifications for optimal results. These recommendations balance implementation simplicity with performance optimization, providing a robust starting point for production deployments.

FIM Rate: 90% of training examples should undergo FIM transformation, with 10% remaining as standard autoregressive examples. This aggressive rate maximizes infilling capability while preserving left-to-right generation competence. Lower rates underutilize FIM potential; higher rates risk autoregressive degradation.

Formatting: Use SPM (suffix-prefix-middle) format exclusively, or combine with 50% PSM for joint training if computational budget allows. SPM provides superior caching efficiency and consistently higher benchmark performance. Joint training offers minimal additional benefit at increased complexity.

Implementation Level: Apply FIM transformation at the context level, after document chunking rather than before. This ensures coherent prefix-middle-suffix relationships within model attention spans and provides consistent quality improvements over document-level implementation.

Span Selection: Use character-level span selection for maximum production robustness. While token-level and line-level approaches work for structured content, character-level selection handles arbitrary cursor positions and partial token edits gracefully—essential for real-world deployment scenarios.

Training pipeline integration should minimize implementation complexity while maximizing effectiveness. Most enterprise frameworks can accommodate these settings through configuration changes rather than architectural modifications. The resulting models handle both autoregressive and infilling tasks effectively, simplifying inference infrastructure and reducing operational complexity.

What FIM-for-Free Tells Us About Multi-Capability Model Training

The FIM-for-free phenomenon extends beyond infilling to fundamental questions about capability acquisition in large language models. The finding that models can acquire complex bidirectional understanding without sacrificing existing capabilities challenges traditional assumptions about training trade-offs and suggests new research directions.

This principle may generalize to other capability pairs. If autoregressive and infilling competence can coexist without conflict, other seemingly orthogonal capabilities might also combine beneficially. Future research directions include steerable generation, reasoning under uncertainty, and multi-modal understanding—all areas where capability combination could reduce model complexity.

The methodology demonstrated here provides a template for evaluating capability combination. Systematic ablation across model scales, comprehensive benchmark evaluation, and careful attention to evaluation metric limitations create a replicable framework for future capability research. This approach helps distinguish genuine capability synergies from measurement artifacts.

For enterprise AI strategy, the FIM-for-free finding suggests aggressive capability integration during initial training phases. Rather than developing specialized models for distinct tasks, teams might pursue unified models with multiple capabilities acquired simultaneously. This approach reduces model zoo complexity while potentially improving individual capability performance through synergistic effects.

The broader implications extend to alignment research and safety considerations. If models can acquire multiple capabilities simultaneously, alignment techniques must ensure robustness across all capability modes. Similarly, safety evaluation must consider interaction effects between capabilities rather than treating them as independent concerns. Ongoing alignment research increasingly recognizes these multi-capability challenges as fundamental to safe AI deployment.

Frequently Asked Questions

What is Fill-in-the-Middle (FIM) capability in language models?

FIM is a training technique that teaches autoregressive language models to infill text by rearranging document segments during training. Instead of only generating left-to-right, FIM models can generate missing content between existing prefix and suffix text, making them ideal for code completion and document editing.

Does adding FIM capability hurt the model’s original performance?

No. The FIM-for-free property shows that training with up to 90% FIM-transformed data preserves all original autoregressive capabilities at zero measurable cost across benchmarks like HumanEval, HellaSwag, and LAMBADA.

What’s the difference between PSM and SPM formatting for FIM?

PSM arranges segments as prefix-suffix-middle, while SPM uses suffix-prefix-middle. SPM is superior because it enables better key-value caching when the prefix changes during typing, reducing inference latency for real-time applications like coding assistants.

Can I add FIM capability to an existing trained model?

While possible, finetuning existing models for FIM is extremely expensive, requiring up to 50% of original pretraining compute. It’s more cost-effective to include FIM capability during pretraining from the start using the FIM-for-free approach.

What’s the optimal FIM rate for enterprise model training?

Research shows 50-90% FIM rate is optimal. This preserves all autoregressive capabilities while maximizing infilling performance. Higher rates (100%) begin to degrade left-to-right generation, while lower rates (15%) underutilize the FIM capability.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

No credit card required · 30-second setup