NVIDIA RTX Blackwell GPU Architecture: Built for Neural Rendering

By Sofia Chen
·
March 14, 2026
·
14 min read

Understanding the RTX Blackwell Architecture
GB202 GPU and Streaming Multiprocessor Design
GDDR7 Memory Subsystem and Bandwidth
5th-Generation Tensor Cores and FP4 Support
4th-Generation RT Cores and Mega Geometry
DLSS 4 Multi-Frame Generation Technology
Neural Shaders and AI-Powered Graphics
AI Management Processor and Power Efficiency
RTX 5090 Specifications and Performance
Implications for Gaming and Professional Workflows

📌 Key Takeaways

21,760 CUDA Cores: The RTX 5090 packs 170 SMs with 5th-gen Tensor Cores delivering unprecedented AI throughput for neural rendering workloads.
DLSS 4 Multi-Frame Generation: Generates multiple AI frames per rendered frame, boosting performance up to 2x over DLSS 3 with transformer-based models.
1,792 GB/s Memory Bandwidth: 32 GB of GDDR7 memory with PAM3 signaling delivers nearly 2x the bandwidth of the RTX 4090.
Mega Geometry: New cluster-based acceleration structures reduce BVH processing by 100x, enabling full-fidelity ray tracing with Nanite-style engines.
Neural Shaders: Small neural networks run directly inside programmable shaders, powering RTX Neural Materials, Neural Texture Compression, and RTX Neural Faces.

Understanding the RTX Blackwell Architecture

The NVIDIA RTX Blackwell GPU architecture represents a fundamental shift in how graphics processors handle rendering workloads. Named after David H. Blackwell, the pioneering American mathematician known for contributions to probability theory and game theory, this architecture places neural rendering at the center of its design philosophy. Unlike previous generations that treated AI acceleration as a supplementary feature, RTX Blackwell was engineered from the ground up to make neural networks a first-class citizen in the graphics pipeline.

At its core, the RTX Blackwell architecture builds upon foundational AI technologies introduced in Turing, Ampere, and Ada Lovelace GPUs, but takes a decisive leap forward. The architecture enables next-generation AI-powered gaming and professional applications to reach new levels of graphical realism, interactivity, and design capability. NVIDIA’s vision is clear: the age of neural rendering has arrived, and Blackwell is the hardware platform purpose-built to deliver it. As explored in our guide to AI-accelerated computing, the convergence of AI and graphics processing represents one of the most significant paradigm shifts in computing history.

The key features of the NVIDIA RTX Blackwell architecture include new SM features built for neural shading, MaxQ power efficiency improvements, 4th-generation RT Cores with Mega Geometry support, 5th-generation Tensor Cores with FP4 capabilities, DLSS 4 with multi-frame generation, RTX Neural Shaders, an AI Management Processor (AMP), GDDR7 memory, and a suite of new video and display features. Each of these innovations works in concert to deliver what NVIDIA describes as a “significant AI TOPS increase per frame,” marking an inflection point where neural rendering techniques generate the majority of on-screen pixels.

The GeForce RTX 5090, RTX 5080, RTX 5070 Ti, and RTX 5070 are the first graphics cards based on the RTX Blackwell architecture. The flagship RTX 5090 uses the GB202 GPU, while the RTX 5080 employs the GB203, and the RTX 5070 series utilizes the GB205. All three GPU variants share the same underlying architecture but are configured to serve different market segments and usage models, from enthusiast gaming to professional content creation.

GB202 GPU and Streaming Multiprocessor Design

The GB202 GPU sits at the apex of the RTX Blackwell family, representing NVIDIA’s most powerful consumer graphics processor to date. The full chip includes 12 Graphics Processing Clusters (GPCs), 96 Texture Processing Clusters (TPCs), 192 Streaming Multiprocessors (SMs), and a 512-bit memory interface with sixteen 32-bit memory controllers. This massive silicon design houses 24,576 CUDA Cores, 192 RT Cores, 768 Tensor Cores, and 768 Texture Units in its full configuration.

Each GPC contains a dedicated Raster Engine, two Raster Operations (ROPs) partitions with eight individual ROP units each, and eight TPCs. Every TPC includes one PolyMorph Engine and two SMs. The full GB202 GPU features 128 MB of L2 cache, while the GeForce RTX 5090 implementation specifically includes 96 MB of L2 cache—a substantial pool of fast cache memory that benefits all applications, particularly complex operations like path tracing.

The Blackwell SM architecture has been redesigned with neural shading as a primary design objective. Each SM contains 128 CUDA Cores, one 4th-generation RT Core, four 5th-generation Tensor Cores, 4 Texture Units, a 256 KB Register File, and 128 KB of configurable L1/Shared Memory. A critical architectural change in Blackwell involves the full unification of INT32 cores with FP32 cores, effectively doubling throughput for many integer instructions compared to Ada. This enhancement directly accelerates address generation workloads crucial for neural shading operations.

The number of texture units has increased significantly from 512 in the GeForce RTX 4090 to 680 in the GeForce RTX 5090. This increase drives bilinear-filtered texel rates to 1,636.76 Gigatexels per second, up from 1,290.2 Gigatexels per second in the previous generation. Additionally, the Blackwell SM doubles the performance of point-sampling textures per cycle compared to Ada, which accelerates texture access algorithms such as Stochastic Texture Filtering used with new Neural Texture Compression methods. According to NVIDIA’s developer documentation, this architectural evolution represents the transition from SMs “designed and optimized for standard shaders” to SMs “designed and optimized for neural shaders.”

GDDR7 Memory Subsystem and Bandwidth

The NVIDIA RTX Blackwell GPU architecture introduces GDDR7 as a new ultra-low voltage memory standard that fundamentally changes how GPU memory subsystems operate. GDDR7 employs PAM3 (Pulse Amplitude Modulation with 3 levels) signaling technology, a significant departure from the PAM4 signaling used in GDDR6X. This transition, while counterintuitive in reducing signal levels, actually delivers substantial advantages in signal-to-noise ratio that enable higher clock speeds and greater overall bandwidth.

The GeForce RTX 5090 ships with 32 GB of GDDR7 memory running at 28 Gbps across a 512-bit interface, delivering 1,792 GB/sec of peak memory bandwidth. This represents a 78% increase over the RTX 4090’s 1,008 GB/sec, while also increasing the frame buffer from 24 GB to 32 GB. The GeForce RTX 5080 utilizes 30 Gbps GDDR7 memory, achieving 960 GB/sec of peak bandwidth. NVIDIA’s collaboration with the JEDEC technology association helped establish PAM3 as the foundational high-frequency signaling technology for the GDDR7 standard.

The move from PAM4 to PAM3 signaling, combined with an innovative pin-encoding scheme, allows GDDR7 to achieve a significantly enhanced signal-to-noise ratio. The standard also doubles the number of independent channels with minimal I/O density overhead. With increased channel density, improved PAM3 SNR, advanced equalization schemes, a reengineered clocking architecture, and enhanced I/O training, GDDR7 delivers substantially higher bandwidth while simultaneously improving energy efficiency—a critical factor for power-constrained systems like gaming laptops.

An important addition in the GDDR7 implementation is built-in ECC (Error Correction Code) capability directly on the DRAM die. ECC is always enabled on GeForce RTX GPUs with GDDR7 memory, supporting single-bit error correction with no performance penalty. This eliminates the need for a software toggle, as the protection is always active. RTX Blackwell GPUs with GDDR7 also support EDR (Error Detection and Replay) technology, similar to previous GPUs with GDDR6X, further enhancing data integrity. For professionals working with large datasets, this combination of bandwidth and reliability is transformative—explore our enterprise data processing guide for more on high-performance computing workflows.

Transform complex technical whitepapers into interactive experiences your audience will actually engage with.

Try It Free →

5th-Generation Tensor Cores and FP4 Support

Tensor Cores have been the backbone of AI acceleration in NVIDIA GPUs since their introduction in the Volta architecture, and the 5th-generation Tensor Cores in RTX Blackwell deliver the most significant upgrade yet. These specialized compute cores handle the matrix multiply and accumulate operations fundamental to deep learning, and the Blackwell implementation supports FP16, BF16, TF32, INT8, FP8, and introduces new FP4 and FP6 capabilities. The addition of FP4 support can effectively double AI throughput while halving memory requirements compared to FP8.

The practical impact of FP4 support is substantial. Consider Black Forest Labs’ FLUX.dev model: at FP16 precision, it requires over 23 GB of VRAM, limiting it to only the RTX 4090, RTX 5090, and professional GPUs. With FP4 quantization, the same model requires less than 10 GB, making it accessible on a much broader range of GeForce RTX GPUs. Performance improvements are equally dramatic—the RTX 4090 generates images with FLUX.dev in 15 seconds using FP16, while the RTX 5090 with FP4 completes the same task in just over five seconds.

RTX Blackwell also includes support for the new Second-Generation FP8 Transformer Engine, mirroring capabilities found in NVIDIA’s datacenter-class Blackwell GPUs. This cross-pollination between consumer and data center architectures ensures that models developed and optimized for cloud inference can run efficiently on desktop hardware. The NVIDIA TensorRT Model Optimizer enables advanced quantization methods that maintain virtually no quality loss when converting models from FP16 to FP4, making the performance gains practical for real-world applications.

The 5th-generation Tensor Cores deliver double the throughput of FP8 operations compared to previous generation, a specification that directly translates to faster AI inference for DLSS, neural shaders, and local AI model execution. With 680 Tensor Cores in the RTX 5090 (compared to 512 in the RTX 4090), the aggregate AI compute throughput reaches levels previously achievable only in data center environments. This democratization of AI compute power has profound implications for game developers, content creators, and researchers alike.

4th-Generation RT Cores and Mega Geometry

Ray tracing has been central to NVIDIA’s GPU strategy since the Turing architecture, and the 4th-generation RT Cores in RTX Blackwell deliver the most comprehensive upgrade to ray tracing hardware to date. The fundamental improvement is straightforward: Blackwell provides double the throughput for ray-triangle intersection testing compared to Ada’s 3rd-generation RT Cores. This directly translates to higher ray tracing performance in every application, from simple reflections to full path tracing.

Beyond raw intersection performance, the Blackwell RT Cores introduce Mega Geometry—a transformative technology aimed at dramatically increasing the geometric detail possible in ray-traced applications. The core problem Mega Geometry solves is the incompatibility between modern level-of-detail systems like Unreal Engine 5’s Nanite and real-time ray tracing. Nanite updates geometry by incrementally replacing small batches of approximately 128 triangles called clusters, but the frequent BVH (Bounding Volume Hierarchy) rebuilds this requires have historically overwhelmed ray tracing implementations.

Mega Geometry introduces Cluster-level Acceleration Structures (CLAS)—new BVH construction capabilities that treat clusters of up to 256 triangles as first-class primitives. Because each CLAS represents roughly 100 triangles, BVH processing is reduced by two orders of magnitude compared to traditional triangle-based methods. A game engine can cache CLAS and reconstruct affected BVHs from them when LOD switches occur, enabling smooth level-of-detail transitions in fully ray-traced environments.

For high object count scenes, Mega Geometry introduces the Partitioned Top-Level Acceleration Structure (PTLAS). Instead of rebuilding the entire TLAS from scratch every frame, PTLAS exploits the fact that most objects are static between frames. The application manages partitions and updates only those that have changed, delivering massive runtime savings for large open-world games. Mega Geometry is supported across DirectX 12 (via NVAPI), Vulkan vendor extensions, and OptiX 9.0, and remarkably, the API-level features work on all RTX GPUs back to Turing, while Blackwell’s specialized cluster engines provide hardware acceleration with up to 2x intersection rates.

The Blackwell RT Core also introduces hardware-accelerated Linear Swept Spheres (LSS)—a new primitive for rendering hair, fur, grass, and strand-like objects. LSS constructs geometry by sweeping spheres across space in linear segments with variable radii, replacing the computationally expensive custom intersection shaders previously required for curve primitives. Hair rendering with LSS is approximately 2x faster than the Disjoint Orthogonal Triangle Strips (DOTS) method while requiring about 5x less VRAM to store the geometry. As noted in our real-time rendering techniques overview, these hardware primitives eliminate the quality-versus-performance trade-offs that previously limited strand rendering in real-time applications.

DLSS 4 Multi-Frame Generation Technology

DLSS has been NVIDIA’s flagship AI rendering technology since its introduction, progressively evolving from basic super resolution to include frame generation and ray reconstruction. DLSS 4, exclusive to RTX Blackwell, introduces multi-frame generation—the ability to generate multiple AI frames for every single traditionally rendered frame. This technology boosts frame rates up to 2x over DLSS 3/3.5 while maintaining or exceeding native image quality with low system latency.

The multi-frame generation approach represents a paradigm shift in how DLSS operates. Previous DLSS Frame Generation in RTX 40-series GPUs generated one additional frame between each rendered frame. DLSS 4 on Blackwell can generate multiple intermediate frames, effectively multiplying the frame rate by a higher factor. The key enabler is the increased Tensor Core throughput in Blackwell, which provides enough AI compute headroom to run the generation models multiple times within a single frame interval.

DLSS 4 also transitions from Convolutional Neural Network (CNN) models to transformer-based AI models for both Super Resolution (SR) and Ray Reconstruction (RR). Transformer models deliver superior image quality with better temporal stability, sharper details, and more accurate reconstruction of ray-traced lighting effects. The transformer-based DLSS Super Resolution produces higher quality upscaling with fewer artifacts, while the transformer-based Ray Reconstruction more accurately denoises and reconstructs missing details in path-traced scenes, as documented in NVIDIA’s official DLSS technology page.

Deep Learning Anti-Aliasing (DLAA) also benefits from the transformer model upgrade, providing improved edge quality at native resolution. The combination of multi-frame generation with transformer-based SR and RR means that RTX Blackwell GPUs can deliver dramatically higher frame rates with better visual quality than any previous generation—a seemingly contradictory achievement made possible by the maturation of neural rendering techniques.

Turn dense GPU architecture whitepapers into interactive learning experiences your audience will actually complete.

Get Started →

Neural Shaders and AI-Powered Graphics

Neural Shaders represent perhaps the most forward-looking innovation in the RTX Blackwell architecture. This technology brings small neural networks directly into programmable shaders, enabling a new era of graphics innovation where AI inference happens at the per-pixel level during rendering. Rather than running AI as a post-process or separate pass, Neural Shaders integrate machine learning models into the real-time graphics pipeline itself.

The immediate applications of Neural Shaders span several groundbreaking technologies. RTX Neural Materials enables film-quality material representation in real time by using compact neural networks to encode complex material properties that would be prohibitively expensive to compute with traditional shader code. Instead of storing massive texture sets for physically-based materials, a small neural network can represent the complete material response, dramatically reducing VRAM usage while increasing visual fidelity.

RTX Neural Texture Compression (NTC) uses neural networks to compress texture data far more efficiently than traditional block compression methods. The compressed textures require significantly less VRAM while maintaining higher visual quality, and the decompression happens in real-time within the shader. This technology pairs with the Blackwell SM’s doubled point-sampling texture performance and Stochastic Texture Filtering to deliver a complete neural texture pipeline.

The Neural Radiance Cache (NRC) applies neural networks to global illumination, learning and storing the radiance field of a scene to provide realistic indirect lighting without the massive ray counts traditionally required. NRC dramatically reduces the computational cost of high-quality global illumination, making path-traced quality lighting feasible in real-time applications. RTX Skin uses neural networks to render incredibly lifelike translucent materials, particularly human skin, where subsurface scattering effects are notoriously difficult to approximate with traditional techniques.

RTX Neural Faces takes neural rendering to the final frontier of visual realism—human faces. By combining neural networks with traditional rendering, this technology produces highly detailed facial rendering that captures the subtle play of light across skin pores, fine hair, and subsurface structures. Shader Execution Reordering (SER) 2.0, enhanced in Blackwell with twice the efficiency of Ada’s implementation, plays a critical role in Neural Shader performance by organizing parallel threads for maximum hardware utilization during neural workloads. According to research published by leading computer graphics researchers, these neural shader techniques represent a fundamental shift from hand-authored rendering algorithms to learned representations.

AI Management Processor and Power Efficiency

The AI Management Processor (AMP) is a fully programmable context scheduler built directly into the GPU, designed to offload scheduling of GPU contexts from the system CPU. Implemented using a dedicated RISC-V processor located at the front of the GPU pipeline, AMP provides faster scheduling with lower latency than prior CPU-driven methods. This is crucial for modern AI-enhanced gaming, where multiple AI models—speech recognition, translation, vision, animation, behavior systems—need to share GPU resources simultaneously with graphics workloads.

AMP aligns with the Microsoft Hardware-Accelerated GPU Scheduling (HAGS) architecture, allowing the GPU to manage its own memory and task scheduling more efficiently. By reducing the back-and-forth communication between CPU and GPU, AMP delivers smoother frame rates, reduced stuttering, and better multitasking. For Large Language Models, AMP reduces time-to-first-response; for games, it prioritizes rendering work to prevent frame drops. The result is a significantly improved quality-of-service across diverse workload combinations.

Power efficiency receives equally significant attention in RTX Blackwell through new MaxQ technologies. Advanced power gating with split power rails provides fine-grained control and delivery of power to different on-chip subsystems, allowing unused portions of the GPU to be effectively powered down at a granular level. Accelerated frequency switching enables clocks to adjust to dynamic workloads 1,000 times faster than previous architectures—a capability that ensures the GPU operates at optimal frequency for each workload phase without the latency of traditional clock transitions.

Low Latency Sleep is another MaxQ innovation that improves battery life in laptop implementations by allowing the GPU to enter deep sleep states more quickly and wake more responsively. The combination of these power management features is particularly significant given the RTX 5090’s 575W TGP (Total Graphics Power)—a substantial increase from the RTX 4090’s 450W. The MaxQ technologies ensure that this power is used efficiently, with every watt directed toward productive computation. The manufacturing process remains TSMC 4nm (4N NVIDIA Custom Process), while the PCIe interface upgrades to Gen 5 for increased host communication bandwidth.

RTX 5090 Specifications and Performance

The GeForce RTX 5090 stands as the flagship implementation of the RTX Blackwell architecture, and its specification sheet tells a compelling story of generational advancement. With 21,760 CUDA cores across 170 SMs (up from 16,384 cores and 128 SMs in the RTX 4090), the RTX 5090 delivers a 33% increase in shader processing resources. The GPU boost clock of 2,407 MHz, combined with the architectural improvements to the SM, delivers substantially higher effective throughput per clock cycle.

Ray tracing performance sees a dramatic leap, with 170 4th-generation RT Cores delivering 317.5 RT TFLOPS compared to 191 RT TFLOPS in the RTX 4090—a 66% improvement in raw ray tracing compute. The 680 5th-generation Tensor Cores, up from 512 in Ada, provide the AI compute throughput necessary for DLSS 4 multi-frame generation, neural shaders, and local AI model execution. The combination of higher core counts and architectural improvements means that real-world gaming performance gains are typically even larger than the raw specification increases suggest.

Specification	RTX 3090	RTX 4090	RTX 5090
GPU	GA102 (Ampere)	AD102 (Ada)	GB202 (Blackwell)
CUDA Cores	10,496	16,384	21,760
Tensor Cores	328 (3rd Gen)	512 (4th Gen)	680 (5th Gen)
RT Cores	82 (2nd Gen)	128 (3rd Gen)	170 (4th Gen)
RT TFLOPS	69.5	191	317.5
Memory	24 GB GDDR6X	24 GB GDDR6X	32 GB GDDR7
Memory Bandwidth	936 GB/s	1,008 GB/s	1,792 GB/s
Memory Interface	384-bit	384-bit	512-bit
L2 Cache	6 MB	72 MB	96 MB
TGP	350W	450W	575W
PCIe	Gen 4	Gen 4	Gen 5

The memory subsystem upgrade from 24 GB GDDR6X to 32 GB GDDR7 with a wider 512-bit interface delivers nearly double the bandwidth (1,792 vs 1,008 GB/sec) while adding 33% more capacity. The L2 cache grows from 72 MB to 96 MB, and the PCIe interface upgrades from Gen 4 to Gen 5. Video encoding and decoding capabilities also advance, with the 9th-generation NVENC and 6th-generation NVDEC adding hardware support for 4:2:2 chroma subsampling in H.264 and H.265—a critical feature for professional video workflows where color accuracy during editing is paramount.

Implications for Gaming and Professional Workflows

The NVIDIA RTX Blackwell GPU architecture’s impact extends far beyond raw performance metrics. For game developers, Mega Geometry eliminates the long-standing compromise between geometric detail and ray tracing quality. Engines like Unreal Engine 5 can now ray trace Nanite geometry at full fidelity, enabling shadows, reflections, and indirect illumination at quality levels previously impossible in real-time. Linear Swept Spheres transform hair and fur rendering from an approximation exercise into a high-fidelity, hardware-accelerated operation.

For creative professionals, the combination of 32 GB GDDR7, FP4 Tensor Core support, and the AI Management Processor creates a workstation-class AI compute platform in a consumer form factor. Local execution of large generative AI models becomes practical—models that previously required cloud inference or professional GPUs can now run on a desktop RTX 5090. The 9th-generation NVENC with 4:2:2 support, combined with DisplayPort 2.1b output, serves video professionals who demand maximum color fidelity throughout their pipeline.

Neural Shaders open entirely new creative possibilities. Game developers can implement film-quality materials, photorealistic skin rendering, and neural radiance caching without the massive memory and compute budgets these techniques traditionally required. The transition from hand-authored shader code to learned neural representations will likely accelerate as the tooling and developer ecosystem around Neural Shaders matures. NVIDIA’s investment in making these capabilities available across DirectX 12, Vulkan, and OptiX ensures broad adoption potential.

The broader industry trend that RTX Blackwell exemplifies is the convergence of traditional graphics rendering with AI inference. Each generation of RTX GPUs has shifted more of the rendering workload from deterministic shader programs to AI models, and Blackwell represents the tipping point where neural rendering techniques become the primary driver of visual quality improvement. Image quality is now improving faster than Moore’s Law by using neural rendering, and the Blackwell architecture provides the hardware foundation for this trend to accelerate further in the years ahead. For organizations looking to share complex technical documentation like GPU whitepapers with their stakeholders, Libertify’s interactive document platform transforms static PDFs into engaging experiences that drive real comprehension.

Make your GPU architecture documentation and technical whitepapers interactive — boost engagement by 3x.

Start Now →

Frequently Asked Questions

What is the NVIDIA RTX Blackwell GPU architecture?

The NVIDIA RTX Blackwell GPU architecture is NVIDIA’s latest graphics processor design built for neural rendering. It features 5th-generation Tensor Cores, 4th-generation RT Cores, DLSS 4 with multi-frame generation, GDDR7 memory, and the new AI Management Processor (AMP) for coordinating AI and graphics workloads simultaneously.

How many CUDA cores does the RTX 5090 have?

The GeForce RTX 5090, powered by the GB202 GPU, features 21,760 CUDA cores across 170 Streaming Multiprocessors. The full GB202 chip contains 24,576 CUDA cores across 192 SMs, but the RTX 5090 uses a slightly trimmed configuration with 170 active SMs.

What is DLSS 4 multi-frame generation?

DLSS 4 multi-frame generation is an AI-powered technology exclusive to RTX Blackwell GPUs that generates multiple frames for every traditionally rendered frame. It boosts frame rates up to 2x over DLSS 3 while maintaining native image quality and low system latency, using new transformer-based AI models.

What is Mega Geometry in RTX Blackwell?

Mega Geometry is an RTX technology that dramatically increases geometric detail in ray-traced applications. It enables game engines like Unreal Engine 5 with Nanite to ray trace geometry at full fidelity using Cluster-level Acceleration Structures (CLAS) and Partitioned Top-Level Acceleration Structures (PTLAS), reducing BVH processing by two orders of magnitude.

What memory does the RTX 5090 use and how much bandwidth does it provide?

The GeForce RTX 5090 features 32 GB of GDDR7 memory with a 512-bit interface running at 28 Gbps, delivering 1,792 GB/sec of peak memory bandwidth. GDDR7 uses PAM3 signaling technology for improved signal-to-noise ratio and energy efficiency compared to previous GDDR6X memory.

What are Neural Shaders in RTX Blackwell?

Neural Shaders bring small neural networks directly into programmable shaders, enabling real-time AI-powered graphics techniques. They power RTX Neural Materials for film-quality assets, Neural Texture Compression (NTC) for reduced VRAM usage, Neural Radiance Cache for realistic lighting, RTX Skin for lifelike translucent materials, and RTX Neural Faces for detailed facial rendering.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

Transform Your First Document Free →

No credit card required · 30-second setup