DOCREWARD: How a New AI Reward Model Outperforms GPT-5 at Judging Document Professionalism

📌 Key Takeaways

  • Superior Performance: DocReward achieves 82.3% accuracy, outperforming GPT-5 by 14.6 points on document professionalism evaluation
  • Innovative Training: Uses textual-quality-agnostic framework with 117K paired documents sharing identical content but different formatting
  • Practical Applications: Integrates with reinforcement learning to improve both open-source and closed-source document generation models
  • Cross-Domain Robustness: Maintains high performance across 32 domains and multiple languages with minimal degradation
  • Visual Focus: Analyzes rendered document images rather than text, learning structural and stylistic patterns for professional presentation

Why Document Structure and Style Matter More Than You Think

In the rapidly evolving landscape of AI-powered document generation, we’ve achieved remarkable breakthroughs in content quality. Large language models can now produce sophisticated research papers, comprehensive reports, and technical documentation that rivals human expertise. Yet there’s a critical gap that most AI researchers have overlooked: the visual structure and style that make documents professional, readable, and engaging.

Consider two documents with identical text content—one featuring proper headings, consistent formatting, balanced whitespace, and professional layout, while the other presents the same information in a wall of text with inconsistent styling. The difference in readability, perceived authority, and user engagement is dramatic. This isn’t just about aesthetics; it’s about communication effectiveness and professional credibility.

The challenge becomes even more pronounced in agentic workflows, where AI agents autonomously generate complex documents. While these systems excel at producing high-quality textual content, they consistently neglect the visual structure and formatting that transform a good document into a great one. This oversight represents a significant limitation in current AI document generation pipelines.

Research from institutions like Stanford’s Human-Centered AI Institute has consistently shown that document presentation quality significantly impacts reader comprehension and engagement. Studies indicate that well-formatted documents can improve information retention by up to 40% compared to poorly formatted counterparts, making document professionalism not just an aesthetic concern but a crucial factor in communication effectiveness.

The Core Problem: AI Can Write Well but Can’t Format Well

The fundamental issue lies in how current language models approach document generation. Most AI systems treat formatting as an afterthought, focusing primarily on textual coherence and factual accuracy. While these are undoubtedly important, they represent only part of what makes a document truly professional and effective.

Existing evaluation methods for document quality typically rely on textual metrics—BLEU scores, semantic similarity, factual consistency—but completely ignore structural and stylistic elements. This creates a blind spot in AI training: models optimize for content quality while remaining oblivious to presentation quality. The result is AI-generated documents that may be factually correct and well-written but appear unprofessional due to poor formatting choices.

Traditional approaches to addressing this gap have involved rule-based post-processing or template-driven systems. However, these solutions are rigid, domain-specific, and fail to capture the nuanced understanding of document professionalism that human readers intuitively possess. Research published by ACM Digital Library demonstrates that rule-based formatting systems achieve only 60-70% accuracy in professional document assessment compared to human evaluators. What’s needed is a more sophisticated approach that can evaluate and guide document presentation with the same rigor applied to content generation.

The Textual-Quality-Agnostic Framework: A Clever Training Paradigm

The breakthrough innovation of DocReward lies in its textual-quality-agnostic training framework. This approach elegantly solves the challenge of isolating structural and stylistic evaluation from content quality assessment. The key insight is to train the model using document pairs that share identical textual content but differ only in their visual presentation and formatting.

In mathematical terms, this framework can be expressed as a preference learning problem where the model learns to distinguish between high-professionalism and low-professionalism versions of the same content. By keeping the text constant while varying the formatting, the model is forced to learn purely visual and structural signals rather than relying on textual quality cues.

This approach represents a significant departure from traditional document evaluation methods. Instead of trying to disentangle content quality from presentation quality in naturally occurring documents, the framework explicitly controls for content quality during training. The result is a model that develops a sophisticated understanding of document professionalism based solely on visual structure, typography, layout, and formatting choices.

Transform your documents from good to exceptional with professional formatting and interactive experiences.

Try It Free →

Building DOCPAIR: A 117K Dataset of Professional Document Pairs

The creation of the DOCPAIR dataset represents a massive undertaking in data engineering and curation. With 117,108 document pairs spanning 32 domains and 267 document types, this dataset provides the foundation for training a robust document professionalism evaluation model.

The dataset construction process involved three carefully orchestrated phases, drawing from established methodologies in computational linguistics and document processing. Phase 1 focused on curating high-quality source documents from government corpora and CommonCrawl, ensuring a diverse foundation of professional content. The researchers leveraged standards from the National Institute of Standards and Technology for document quality assessment criteria. The selection criteria emphasized documents that already demonstrated high professional standards, providing reliable examples of quality formatting and structure that meet industry benchmarks.

Phase 2 employed AI agents to expand textual content into full documents and then refine these documents through iterative improvement processes. This expansion phase ensured that the dataset captured a wide range of document types and formatting scenarios while maintaining content consistency within each pair.

Phase 3 implemented a sophisticated ranking system using human-verified heuristics and oracle-based annotation to establish clear professionalism preferences. The resulting dataset covers diverse domains including government documents (32.2%), educational materials (28.6%), non-profit communications (9.6%), medical documentation (5.7%), and scientific publications (5.0%), among others. This diversity ensures that the trained model can generalize across different professional contexts and document types.

How DOCREWARD Works: Architecture and Training

DocReward builds upon the robust foundation of Qwen2.5-VL, a state-of-the-art vision-language model, by adding a specialized regression head optimized for document professionalism evaluation. This architectural choice enables the model to process rendered document page images directly, capturing visual and structural information that would be lost in text-only approaches.

The training process employs Bradley-Terry preference loss, a mathematical framework specifically designed for pairwise comparison tasks. This loss function enables the model to learn relative professionalism rankings rather than absolute scores, which aligns well with the inherently comparative nature of document quality assessment.

One of the most interesting findings from the development process was that image-only input significantly outperformed approaches that included OCR text and bounding box information. With image-only input, the 7B parameter model achieved 87.94% accuracy compared to 84.41% when additional textual features were included. This result confirms that the model successfully learns to evaluate visual and structural cues rather than relying on text-based semantic understanding.

The training methodology emphasizes the model’s ability to focus on elements like heading hierarchy, font consistency, whitespace utilization, alignment patterns, and overall visual balance—all crucial components of professional document presentation that are typically overlooked by text-focused AI systems.

Benchmark Results: Crushing GPT-5 by 14.6 Points

The performance results on DOCPAIRBENCH, a human-annotated benchmark of 1,443 document pairs, demonstrate DocReward’s significant superiority over existing approaches. The model achieves an impressive 82.3% accuracy, substantially outperforming all tested baselines including the most advanced closed-source models.

The performance gap is particularly striking when compared to leading commercial models. DocReward outperforms GPT-5 by 14.6 percentage points (82.3% vs 67.7%), GPT-4o by 27.4 percentage points (82.3% vs 54.9%), and Claude Sonnet 4 by 28.1 percentage points (82.3% vs 54.2%). Even the smaller DocReward-3B variant achieves 80.6% accuracy, still surpassing all commercial alternatives.

The best-of-N evaluation provides additional validation of the model’s practical utility. In scenarios where multiple document versions are generated and DocReward selects the best option, the win rate reaches 60.8% compared to only 16.9% losses and 22.3% ties. This represents a substantial improvement over random selection (24.6% win rate) and competitive performance against human-guided selection methods.

These results demonstrate that specialized training on document professionalism can achieve superior performance compared to general-purpose models, even when those models have significantly more parameters and training data. The focused approach of DocReward proves that domain-specific optimization can overcome the advantages of scale in certain specialized tasks.

See how professional document formatting can transform your content’s impact and engagement.

Get Started →

Practical Impact: Best-of-N Selection and Reinforcement Learning

The practical applications of DocReward extend far beyond evaluation benchmarks. The model serves as a powerful tool for improving document generation in real-world AI workflows through two primary mechanisms: best-of-N selection and reinforcement learning integration.

In best-of-N scenarios, DocReward enables AI systems to generate multiple document versions and automatically select the one with the highest professionalism score. This approach provides immediate quality improvements without requiring model retraining or architectural changes. Organizations can integrate DocReward into existing document generation pipelines to enhance output quality with minimal implementation overhead.

The reinforcement learning applications demonstrate even more impressive results. When integrated with Group Relative Policy Optimization (GRPO), DocReward serves as a reward signal to train models for better document formatting. For Qwen2.5-Coder, this integration improved success rates from 30% to 100% while dramatically improving human quality rankings from 4.58 to 2.84.

Perhaps most remarkably, DocReward can improve even closed-source models through training-free GRPO techniques. When applied to GPT-4o, the approach increased success rates from 52% to 78% and improved human rankings from 3.18 to 2.02. This capability enables organizations to enhance commercially available models without requiring access to model parameters or training infrastructure.

The combined reward formulation, which integrates rule-based constraints with DocReward scoring, provides a balanced approach that ensures both correctness and professionalism. This hybrid methodology addresses the common challenge of maintaining content quality while improving presentation quality in AI-generated documents.

Robustness: Cross-Domain and Cross-Lingual Generalization

One of the most impressive aspects of DocReward is its robustness across different domains and languages. The model demonstrates remarkable generalization capabilities that suggest deep understanding of universal professionalism principles rather than memorization of training data patterns.

In out-of-domain evaluations, DocReward shows only a 4.8 percentage point performance drop, maintaining 77.5% accuracy compared to its 82.3% in-domain performance. This level of robustness significantly exceeds that of commercial alternatives, with GPT-5 achieving only 68.4% accuracy in the same out-of-domain scenarios.

The cross-lingual robustness results are equally impressive. When evaluated on French, Spanish, and Russian documents, DocReward maintains 77.9% accuracy—only a 4.4 percentage point drop from its English performance. This minimal degradation compares favorably to GPT-4o’s 7.4% drop and GPT-5’s 7.3% drop in cross-lingual scenarios.

These robustness metrics suggest that DocReward has learned fundamental principles of document professionalism that transcend specific domains and languages. The model appears to capture universal visual design principles, typography standards, and layout conventions that apply across different professional contexts and linguistic traditions.

The strong inter-annotator agreement of 83.4% (Cohen’s Kappa) provides additional validation that the model’s professionalism assessments align with consistent human judgments. This consistency supports the model’s practical deployment in diverse organizational contexts where reliable, objective document evaluation is essential.

What the Model Actually Looks At: Attention Map Analysis

Understanding how DocReward makes its professionalism assessments provides valuable insights into document design principles and model interpretability. Attention map analysis reveals that the model focuses on specific visual elements that human readers also associate with professional presentation.

The model demonstrates strong attention to heading hierarchies and numbering systems, recognizing that clear information architecture is fundamental to document professionalism. It also shows sensitivity to consistent typography choices, proper spacing between elements, and balanced use of whitespace—all elements that contribute to visual appeal and readability.

Table formatting receives particular attention from the model, which learns to evaluate grid alignment, border consistency, and header differentiation. Similarly, the model focuses on page headers and footers, recognizing their role in establishing document structure and navigation aids for readers.

Perhaps most importantly, the attention analysis confirms that the model processes structural rather than semantic information. The model’s focus areas align with visual design principles rather than content topics, validating the textual-quality-agnostic training approach and demonstrating that the model has successfully learned to evaluate presentation independent of content.

This visual processing capability positions DocReward as a powerful tool for understanding and improving document design. Organizations can leverage the model’s insights to develop better formatting guidelines, template designs, and automated document improvement systems.

Limitations and Future Directions

While DocReward represents a significant breakthrough in document professionalism evaluation, it’s important to acknowledge current limitations and areas for future development. The model currently provides only scalar professionalism scores without natural language explanations or specific improvement recommendations.

Future research directions include developing interpretable reward models that can provide diagnostic feedback about specific formatting issues and improvement suggestions. Such capabilities would enhance the model’s utility in educational contexts and automated document improvement systems.

Another area for expansion involves extending the evaluation framework beyond traditional document types to include presentations, web pages, and interactive content. As digital communication evolves, the principles of visual professionalism must adapt to new mediums and interaction paradigms.

Additionally, research into personalized professionalism models could account for industry-specific standards, organizational style guides, and cultural preferences. Such customization would enable more targeted applications in diverse professional contexts while maintaining the core benefits of automated quality assessment.

The integration of DocReward with emerging AI workflows presents exciting opportunities for comprehensive document generation systems that seamlessly combine content creation, fact-checking, formatting optimization, and quality assurance into unified pipelines.

Ready to transform your documents with AI-powered professional formatting and interactive experiences?

Start Now →

Why This Matters for the Future of AI Document Generation

The implications of DocReward extend far beyond academic research into practical applications that will shape the future of AI-powered communication. As organizations increasingly rely on AI for document generation, the ability to ensure professional presentation becomes a critical competitive advantage.

In enterprise contexts, DocReward enables automated quality assurance for AI-generated reports, proposals, and communications. This capability reduces the need for manual formatting review while ensuring consistent professional standards across all AI-generated content.

The model’s integration potential with existing enterprise AI systems promises to revolutionize document workflow automation. Organizations can build comprehensive pipelines that generate, format, review, and optimize documents with minimal human intervention while maintaining high professional standards.

Looking forward, the principles demonstrated by DocReward will likely influence the development of next-generation AI models that inherently consider presentation quality alongside content quality. This evolution represents a maturation of AI systems from tools that simply generate text to comprehensive communication assistants that understand the full spectrum of effective document creation.

For professionals working with AI-generated content, DocReward provides a glimpse into a future where document quality encompasses both what is said and how it is presented. As AI capabilities continue to advance, the integration of sophisticated formatting evaluation and improvement will become an essential component of any serious document generation system.

The success of DocReward demonstrates that specialized AI models can achieve superior performance in focused domains, even when competing against much larger general-purpose systems. This finding aligns with recent research from Nature Machine Intelligence showing that domain-specific optimization often outperforms scale-based improvements in specialized tasks. This has broader implications for AI development strategy and suggests that targeted optimization will remain valuable even as general AI capabilities continue to improve, particularly in professional applications where precision and reliability are paramount.

Frequently Asked Questions

What is DocReward and how does it work?

DocReward is an AI reward model built on Qwen2.5-VL that evaluates document professionalism by analyzing rendered page images. It uses a textual-quality-agnostic framework, meaning it focuses purely on visual structure and style rather than content quality. The model outputs scalar scores to assess formatting, layout, and professional presentation.

How does DocReward outperform GPT-5?

DocReward achieves 82.3% accuracy on the DocPairBench benchmark, outperforming GPT-5 by 14.6 percentage points (GPT-5: 67.7%). It also beats GPT-4o by 27.4 points and Claude Sonnet 4 by 28.1 points. This superior performance comes from specialized training on document pairs with identical content but different formatting.

What is the textual-quality-agnostic framework?

This innovative training approach uses document pairs that share identical textual content but differ only in formatting and visual presentation. By removing textual quality as a variable, the model learns to evaluate purely structural and stylistic elements like headings, spacing, layouts, and professional presentation standards.

What applications does DocReward have in AI workflows?

DocReward can be integrated into agentic workflows for document generation, used for best-of-N selection to choose the best formatted version from multiple AI-generated documents, and employed in reinforcement learning pipelines to improve both open-source and closed-source models’ document formatting capabilities.

How robust is DocReward across different domains and languages?

DocReward shows excellent robustness with only a 4.8 percentage point drop in out-of-domain evaluation (82.3% → 77.5%) and minimal degradation across languages, dropping just 4.4% for French, Spanish, and Russian documents. This outperforms the cross-lingual robustness of GPT-4o and GPT-5.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

No credit card required · 30-second setup