Apple Intelligence Foundation Models 2025: On-Device and Server AI Architecture Explained

📌 Key Takeaways

  • Dual model architecture: Apple deploys a ~3B-parameter on-device model optimized for Apple silicon and a PT-MoE server model for Private Cloud Compute, both supporting 16 languages and image understanding.
  • KV-cache sharing innovation: The on-device model reduces memory usage by 37.5% through a two-block architecture where the second block reuses key-value caches from the first.
  • 2-bit quantization: Quantization-aware training compresses on-device model weights to 2-bit precision while maintaining competitive quality against larger open baselines.
  • New developer framework: A Swift-centric Foundation Models API gives developers guided generation, constrained tool calling, and LoRA adapter fine-tuning in a few lines of code.
  • Privacy by design: No user personal data used for training, on-device processing by default, and Private Cloud Compute with end-to-end encryption for server requests.

Apple Foundation Models 2025 Technical Report Overview

The Apple Intelligence Foundation Models Tech Report 2025, published alongside the Worldwide Developers Conference in June 2025, represents Apple’s most detailed public disclosure of the AI architecture powering Apple Intelligence across iPhone, iPad, Mac, and Apple’s cloud infrastructure. This 27-page technical paper—authored by nearly 400 Apple researchers—documents the engineering decisions, architectural innovations, and training methodologies behind the next generation of Apple’s foundation language models.

At its core, the apple foundation models 2025 report describes two complementary AI systems designed for fundamentally different deployment contexts. The first is a compact ~3 billion parameter on-device model engineered to run efficiently on Apple silicon, enabling AI features without requiring an internet connection or sending user data to external servers. The second is a scalable server-based model built on a novel Parallel-Track Mixture-of-Experts (PT-MoE) transformer architecture, designed specifically for Apple’s Private Cloud Compute platform.

Both models represent significant advances over their 2024 predecessors. They now support 16 languages (expanded from English-only initial deployment), understand images and text inputs, can execute tool calls to interact with apps and services, and demonstrate improved reasoning capabilities. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines—a remarkable achievement given Apple’s aggressive optimization for efficiency and privacy. For context on how other technology leaders approach AI model development, the Stanford AI Index 2025 provides industry-wide benchmarks.

On-Device Model Architecture: 3B Parameters on Apple Silicon

The on-device component of the apple foundation models 2025 represents a masterclass in computational efficiency. Running a ~3 billion parameter language model on a mobile device with limited memory and battery presents engineering challenges that differ fundamentally from cloud deployment. Apple’s approach centers on two architectural innovations: KV-cache sharing and 2-bit quantization-aware training.

KV-Cache Sharing addresses one of the primary memory bottlenecks in transformer inference. Apple divided the on-device model into two blocks: Block 1 contains 62.5% of the total transformer layers with full key-value projections, while Block 2 contains the remaining 37.5% of layers with the key and value projections removed entirely. Every KV cache in Block 2 is directly shared with those generated by Block 1, reducing the memory footprint of the attention mechanism by over one-third without proportional quality degradation.

This architectural decision reflects a sophisticated understanding of attention patterns in language models. Research has shown that deeper layers in transformers often exhibit attention patterns similar to earlier layers, making it possible to reuse cached representations rather than computing new ones. Apple’s implementation demonstrates that this insight can be exploited at production scale to deliver meaningful memory savings on devices where every megabyte matters.

2-Bit Quantization-Aware Training takes model compression further by training the model to perform well at extremely low numerical precision. Traditional models use 16-bit or 32-bit floating-point numbers; Apple’s approach trains the model with the knowledge that inference will occur at 2-bit precision, allowing the training process to learn weight representations that are robust to quantization noise. The result is a model roughly 8x smaller than its full-precision equivalent with minimal quality loss—essential for fitting into the memory constraints of iPhone and iPad devices.

The on-device model also incorporates an efficient vision backbone based on ViTDet-L with 300 million parameters, enabling image understanding capabilities directly on the device. Apple introduced a novel Register-Window (RW) mechanism that allows a global register token to interact with distinct local windows of an image before contributing to overall context aggregation, capturing both fine-grained details and broader visual information efficiently.

PT-MoE Server Architecture for Private Cloud Compute

For tasks requiring more computational power than on-device processing can provide, Apple developed the PT-MoE (Parallel-Track Mixture-of-Experts) architecture—a novel transformer design that combines three complementary innovations to deliver high quality at competitive computational cost on Private Cloud Compute infrastructure.

Track Parallelism organizes the model’s capacity into parallel processing tracks that can handle different aspects of input processing simultaneously. Unlike sequential architectures where each layer must fully process before the next begins, track parallelism enables concurrent computation paths that converge at strategic points in the network. This design is particularly well-suited to Apple’s custom server hardware, where parallel execution units can be fully utilized.

Mixture-of-Experts Sparse Computation ensures that only a subset of the model’s total parameters activate for any given input. Rather than running every computation through the entire model, expert routing mechanisms direct each input token to the most relevant expert modules. This means the model can maintain the knowledge capacity of a much larger dense model while requiring only the computational cost of activating a fraction of its parameters per inference step.

Interleaved Global-Local Attention alternates between full-context global attention layers and sliding-window local attention layers. The global attention layers provide the model with awareness of the entire input context, while local attention layers focus on nearby tokens with lower computational cost. By omitting positional embeddings (NoPE) in the global attention layers, the model achieves better length generalization, avoiding out-of-distribution position issues when processing long contexts. This interleaved design substantially reduces KV cache size for long-context inference while maintaining quality.

The server model uses a standard Vision Transformer (ViT-g) with 1 billion parameters for its vision backbone, providing higher-fidelity image understanding than the on-device model. The combination of PT-MoE architecture with this larger vision encoder enables the server model to handle more complex multimodal tasks that exceed the on-device model’s capabilities.

Turn technical research papers into interactive experiences your engineering team will actually read and discuss.

Try It Free →

Multimodal Vision and Language Understanding

The apple foundation models 2025 mark a significant expansion in multimodal capabilities. Both the on-device and server models now natively process images alongside text, enabling features like visual search, document understanding, image-based calendar event creation, and contextual image analysis across Apple’s ecosystem.

The vision encoder architecture follows a two-stage training pipeline. In the first stage, Apple applies CLIP (Contrastive Language-Image Pre-training) to pre-train the vision backbone using more than 6 billion image-text paired data, including synthetic captions and web-crawled alt-text. This provides the vision backbone with robust visual grounding capabilities. The second stage jointly trains the vision backbone with a vision-language adaptation module and a compact 302M parameter language model decoder to align image features with the language model’s representation space.

The vision-language adaptation module compresses visual features into a fixed number of image tokens matching the language model’s token dimension. It combines a transformer layer, a linear projection layer, and a 3×3 convolutional layer to capture both global and local visual information. An average pooling layer further compresses the features, ensuring efficient integration with the language processing pipeline without overwhelming the model’s context window.

Training data for visual understanding spans multiple categories: over 10 billion high-quality image-text pairs from web crawling, 175 million interleaved image-text documents containing over 550 million images, more than 5 billion synthetically captioned image pairs, text-rich image data including PDFs and documents, and domain-specific datasets for fields like science and healthcare. This diversity ensures the models can handle the wide range of visual inputs users encounter daily. The breadth of Apple’s multimodal training data mirrors trends documented in the CNCF Cloud Native Survey 2025 showing increasing demand for AI infrastructure that handles diverse data types.

Training Data Strategy and Responsible AI Sourcing

Apple’s approach to training data for the apple foundation models 2025 reflects the company’s distinctive position on privacy and intellectual property. The tech report details a multi-source data strategy that explicitly excludes user personal data while building comprehensive training datasets from web content, licensed corpora, and high-quality synthetic data.

The web data pipeline uses Applebot, Apple’s web crawler, to source pre-training data spanning hundreds of billions of pages across an extensive range of languages, locales, and topics. Apple emphasizes adherence to robots.txt protocols and provides fine-grained opt-out controls for web publishers. The 2025 models introduce several pipeline improvements: enhanced crawling with headless rendering for dynamic websites, expanded scale covering more mathematical and programming content, LLM-assisted content extraction for complex documents, and refined filtering that replaces aggressive heuristic rules with model-based quality signals.

Licensed data partnerships complement web-sourced content, providing high-quality curated datasets that meet Apple’s quality and legal standards. The report explicitly states that Apple does not use users’ private personal data or user interactions when training foundation models—a differentiating claim in an industry where many competitors train on user-generated content.

Synthetic data generation plays an increasingly important role. Apple developed an in-house image captioning model capable of producing captions at different detail levels, from keyword lists to paragraph-length descriptions. For text-rich image data, the team generated transcription and question-answer pairs from PDFs, documents, infographics, tables, and charts. For domain-specific knowledge, teacher models synthesize training examples based on curated image sets. This synthetic data strategy addresses the fundamental challenge that web-crawled data, while vast, often lacks the structured quality needed for specific capabilities.

The tokenizer was expanded from 100,000 to 150,000 vocabulary items to better support the 16 languages now covered by Apple Intelligence, achieving adequate representation quality for many additional languages with just 50% more tokens—an efficient scaling of multilingual capability.

Pre-Training and Post-Training Pipeline

The apple foundation models 2025 undergo a sophisticated multi-stage training pipeline that progresses from broad pre-training through increasingly focused post-training phases. This pipeline is designed to first build general language and visual understanding, then refine behavior for Apple Intelligence’s specific use cases.

Pre-training establishes the models’ foundational capabilities across language understanding, generation, reasoning, and visual comprehension. The recipe has evolved to support more languages and a wider array of features. For the vision encoder, pre-training occurs in two stages: contrastive CLIP pre-training on 6+ billion image-text pairs at 448×448 resolution, followed by joint training with a language model decoder using enriched data including interleaved image-text documents and domain-specific visual data.

Supervised Fine-Tuning (SFT) adapts the pre-trained models for instruction following and task-specific behaviors. Apple uses carefully curated datasets that demonstrate desired response patterns across the range of Apple Intelligence features, from text composition and summarization to tool calling and image analysis.

Reinforcement Learning from Human Feedback (RLHF) further aligns model outputs with human preferences. The 2025 report introduces a new asynchronous reinforcement learning platform that enables more efficient training at scale. This platform supports parallel policy optimization across multiple reward signals, allowing Apple to simultaneously optimize for response quality, safety, helpfulness, and adherence to Apple’s responsible AI guidelines.

The post-training pipeline incorporates direct preference optimization techniques that enable more stable and efficient alignment than traditional RLHF approaches. Apple’s implementation integrates safety training throughout the post-training process rather than treating it as a separate phase, ensuring that safety behaviors are deeply embedded in the models’ learned representations rather than superficially applied.

Share AI research with your team in an engaging format — transform technical papers into interactive experiences.

Get Started →

Foundation Models Framework for Apple Developers

Perhaps the most significant announcement accompanying the apple foundation models 2025 tech report is the new Foundation Models framework—a Swift-centric API that gives third-party developers direct access to Apple’s on-device language model. This framework transforms Apple Intelligence from a closed system into a platform that any iOS, iPadOS, or macOS developer can build upon.

The framework exposes three core capabilities. Guided Generation allows developers to specify output schemas that constrain the model’s generation to follow structured formats. Rather than parsing free-form text output, developers define the expected response structure and the model generates output that conforms to it—dramatically simplifying integration into typed Swift applications.

Constrained Tool Calling maps natural language user inputs to predefined function calls. Developers define a set of available tools with parameter schemas, and the model determines which tool to invoke and with what parameters based on the user’s request. This enables natural language interfaces for any app functionality without developers needing to build their own intent classification systems.

LoRA Adapter Fine-Tuning allows developers to customize the on-device model’s behavior for their specific use cases without modifying the base model weights. Low-Rank Adaptation (LoRA) adds small trainable parameter matrices alongside the frozen base model, enabling task-specific behavior with minimal additional memory overhead. This means developers can create specialized AI capabilities—a medical terminology assistant, a legal document analyzer, or a code review bot—that run entirely on-device.

The framework’s Swift-native design reflects Apple’s commitment to making AI accessible to the broader developer community. Rather than requiring Python expertise or machine learning knowledge, developers can integrate foundation model capabilities using familiar Swift patterns and a few lines of code. This democratization of on-device AI could catalyze an ecosystem of AI-powered applications that maintain Apple’s privacy-first principles by processing all data locally. For how enterprise AI infrastructure is evolving to support such frameworks, the McKinsey Global Institute 2025 analysis offers broader context.

Privacy-First AI: Private Cloud Compute Architecture

Apple’s approach to the apple foundation models 2025 is inseparable from its privacy architecture. While many competitors process AI requests on standard cloud infrastructure, Apple designed Private Cloud Compute (PCC) as a purpose-built platform that extends the security properties of Apple devices into the cloud.

The privacy architecture operates on a tiered principle. When possible, AI processing happens entirely on-device using the 3B parameter model, with no data leaving the user’s device. When tasks require more computational power, requests are routed to Private Cloud Compute servers running the PT-MoE model. These servers process requests with several critical guarantees: user data is not stored after processing, Apple cannot access the content of user requests, and independent security researchers can verify the integrity of the system.

The on-device processing tier handles a remarkable range of tasks given the model’s compact size. Text composition, email summarization, notification prioritization, basic image analysis, and many tool-calling operations run entirely locally. Only when the on-device model determines that a request exceeds its capabilities does processing escalate to PCC—and the user is informed when this occurs.

Apple’s Responsible AI approach encompasses content filtering and locale-specific evaluation, ensuring models behave appropriately across different cultural and regulatory contexts. The 2025 tech report describes safety measures integrated throughout the training and deployment pipeline rather than applied as a final post-processing step. This design philosophy—building safety into the foundation rather than bolting it on—resonates with broader industry trends toward responsible AI development.

For organizations evaluating enterprise AI deployment, Apple’s privacy architecture provides an instructive model of how powerful AI capabilities can be delivered while maintaining stringent data protection standards. The separation of on-device and private cloud processing, combined with the Foundation Models framework, suggests a future where AI-powered applications routinely process sensitive data without compromising user privacy.

Benchmark Performance and Human Evaluations

The apple foundation models 2025 tech report provides extensive benchmark results demonstrating that both the on-device and server models match or surpass comparably sized open baselines. This is a notable achievement given the aggressive optimization for efficiency and privacy that constrains Apple’s design choices.

The on-device model, despite its 2-bit quantization and KV-cache sharing optimizations, performs competitively against open models of similar parameter counts running at full precision. The report presents results across standard NLP benchmarks including reading comprehension, reasoning, code generation, and mathematical problem-solving. The model also demonstrates strong multilingual performance across the 16 supported languages, with quality metrics that approach larger models on many tasks.

The server model, leveraging the PT-MoE architecture’s ability to maintain large knowledge capacity with efficient inference, achieves results competitive with significantly larger dense models. The mixture-of-experts approach means the model can match the quality of models requiring 2-3x more compute per inference, translating to lower latency and cost on Private Cloud Compute infrastructure.

Human evaluations complement automated benchmarks by assessing qualities that metrics struggle to capture: response helpfulness, fluency, safety, and adherence to user intent. The tech report describes evaluation protocols involving human raters across multiple languages and task types, with results indicating strong preference for Apple’s models over comparably sized alternatives on subjective quality dimensions.

The multimodal capabilities are evaluated separately, with benchmarks covering image captioning, visual question answering, document understanding, and image-grounded reasoning. Both models demonstrate the ability to extract meaningful information from images and integrate visual understanding with language processing, with the server model showing particular strength on complex multimodal tasks.

Perhaps most significantly, the benchmark results validate Apple’s approach of optimizing for the deployment constraint first and then engineering quality within those constraints. Rather than building the largest possible model and hoping to compress it later, Apple designed architectures that are inherently efficient, then trained them to maximize capability within those efficiency boundaries. This approach yields models that may not top every leaderboard but deliver consistently strong performance across all deployment scenarios that matter for Apple Intelligence users.

Turn AI research papers and technical reports into interactive experiences — boost knowledge sharing across your organization.

Start Now →

Frequently Asked Questions

What are the Apple Intelligence Foundation Models 2025?

The Apple Intelligence Foundation Models 2025 consist of two multilingual, multimodal language models: a ~3 billion parameter on-device model optimized for Apple silicon using KV-cache sharing and 2-bit quantization-aware training, and a scalable server model built on a novel Parallel-Track Mixture-of-Experts (PT-MoE) transformer architecture running on Apple’s Private Cloud Compute platform. Both support 16 languages, image understanding, and tool calling.

How does Apple’s on-device AI model work?

Apple’s ~3B parameter on-device model runs directly on Apple silicon through two key innovations: KV-cache sharing, which divides the model into two blocks where the second block reuses key-value caches from the first (reducing memory by 37.5%), and 2-bit quantization-aware training that compresses model weights while maintaining quality. It uses a ViTDet-L vision backbone with 300M parameters for image understanding.

What is Apple’s PT-MoE server architecture?

PT-MoE (Parallel-Track Mixture-of-Experts) is Apple’s novel server model architecture that combines three innovations: track parallelism for processing different data streams, mixture-of-experts sparse computation where only relevant expert modules activate for each input, and interleaved global-local attention that alternates between full context and sliding window attention to reduce KV cache size while maintaining quality.

How does Apple protect user privacy with AI models?

Apple protects privacy through multiple mechanisms: the on-device model processes requests locally without sending data to servers; Private Cloud Compute handles server-side requests with end-to-end encryption and no data retention; Apple does not use users’ private personal data or interactions for training; and content filtering and locale-specific evaluation provide safety guardrails.

What is the Foundation Models framework for developers?

The Foundation Models framework is a new Swift-centric API that gives developers direct access to Apple’s on-device language model. It exposes guided generation (structured output following schemas), constrained tool calling (mapping natural language to function calls), and LoRA adapter fine-tuning (customizing model behavior for specific tasks), all accessible with just a few lines of Swift code.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

No credit card required · 30-second setup

Our SaaS platform, AI Ready Media, transforms complex documents and information into engaging video storytelling to broaden reach and deepen engagement. We spotlight overlooked and unread important documents. All interactions seamlessly integrate with your CRM software.