NVIDIA's Design of AI Factory and Agentic AI

NVIDIA's Strategic Shift Beyond a GPU Company to an AI Factory Designer as Seen in the GTC 2025 and 2026 Taipei Keynotes

17 min read
product-developmentaiNVIDIAGTC 2026GTC 2025

This article aims to further explore the themes of AI Factory and Agentic AI, which were prominently discussed in the GTC 2025 and Taipei 2026 keynotes.

The core message from NVIDIA's GTC 2025 keynote was not merely the introduction of faster GPUs. It was a declaration that artificial intelligence has transitioned from being an experimental technology that consumes resources to becoming a production infrastructure that generates real revenue.

While GTC 2025 showcased the direction of AI infrastructure, Jensen Huang explained at GTC Taipei 2026 what kind of computing architecture would operate on it. AI has now entered the stage of Agentic AI, where it plans, uses tools, and performs tasks autonomously.

The change is more significant than anticipated. The computing architecture itself is App 중심에서 Agent 중심으로 이동, and as a result, data centers are being redefined from server rooms with multiple GPUs to 'AI Factories'. This includes the entire industrial setup for mass-producing tokens, encompassing power, cooling, networking, and scheduling software.

In the past, the focus of computing infrastructure competition was on securing higher computational performance (FLOPS). In the AI era, the key KPI has shifted to 'producing response tokens quickly and cheaply'.

As global data center capital expenditure (CAPEX) heads towards a trillion-dollar market, NVIDIA is no longer just a chip manufacturer. Beyond GPUs, it designs networks, cooling systems, optical communications, data center software, and robotic platforms, positioning itself as the architect of the next-generation AI production line.


1. Paradigm Shift in Computing

The Limits of General Computing and the Shift to Accelerated Computing

For decades, the focus of computing was on general-purpose CPUs, executing human-written code sequentially and calculating according to set rules. However, with the explosion of large-scale data and AI workloads, traditional computing methods have reached their limits in terms of efficiency and power consumption.

AI models require massive parallel computation. They need to process thousands or tens of thousands of operations simultaneously, making accelerators like GPUs much more suitable than CPUs. Consequently, data centers are rapidly transitioning from CPU-centric structures to GPU-based Accelerated Computing.

Tokens per Watt

Simply put, this concept is about "how many tokens AI can produce for every unit of electricity cost". The biggest constraint facing AI data centers is power supply. Large AI clusters consume power equivalent to a city. However, the maximum power that can be legally and physically supplied to a single data center building from a power plant or national grid is limited.

  • Past Approach: If more AI computation was needed, more servers and power were added.
  • Current Constraint: The supply of Watt itself is limited due to grid constraints.

Ultimately, the competition in AI infrastructure has shifted to "how many tokens can be produced within limited power". From this perspective, the competition in GPU performance is not just about benchmark performance but is akin to a productivity competition in a token production factory.

In the 2026 GTC, Jensen Huang repeatedly referred to AI data centers as Token Factories. The purpose of data centers now is to produce as many valid tokens (output) as possible within limited power and cost. The importance of FLOPS itself has not disappeared. However, FLOPS has shifted from being the goal to being a means to produce more tokens.

The Rise of Agentic AI

Three-Stage Evolution of the AI Paradigm

The role of AI is rapidly changing.

  • Information Retrieval: The stage of 'finding' existing information from databases.
  • Inference/Generation: The stage of understanding context and 'creating' new words and images.
  • Action/Execution (Agentic AI): The stage of autonomously planning and using tools to 'solve' for goals.

Agentic AI requires much more computation because it involves multiple reasoning processes or repeated tool usage.

Explosive Increase in Inference Computation

With the emergence of reasoning models that solve complex problems by dividing them into multiple stages, the inference computation required at the service stage has increased dramatically. While past AI services were at the 질문 답변 level, they now repeat the 계획 탐색 검증 재추론 실행 process.

NVIDIA stated at GTC 2025 that with the emergence of these reasoning-based AI workloads, the demand for computation needed for inference is being re-evaluated at up to 100 times the previous expectations. This is not merely about the increase in model size but is more of a signal that the way AI is used is changing.

In the early AI industry, most costs were concentrated on model training, but now real-time inference performance has become more important.

Key Metrics That Have Gained Importance

  • TTFT (Time To First Token)
    The time it takes for a user to receive the first response token after making a request.

  • Throughput
    How many tokens can be generated per second.

  • Tokens per Watt
    The number of tokens that can be produced per unit of power.

From Apps to Agents

At GTC 2026, Jensen Huang described the future computing architecture quite directly. In the past, computers were Application + Operating System structures. Users would run programs and repeatedly click and input.

In contrast, future computing will have the following structure:

Model (LLM)
    
Harness (오케스트레이션)
    
Tools (DB, Python, Browser, CAD )
    
Memory (Short / Long-term)

Users no longer run apps to manipulate them directly. They convey their intentions, and agents autonomously call tools, set intermediate plans, and perform tasks. This signifies a shift in the interface itself from a computing environment where apps are clicked to an environment where tasks are delegated. From NVIDIA's perspective, the more agents there are, the more inference is required, which ultimately leads to increased demand for GPUs.


2. Innovation in Hardware Architecture

Blackwell and AI Factory-Centric Design

The key emphasis from NVIDIA at GTC 2025 was on keeping GPUs busy. In large-scale AI infrastructure, GPUs are the most expensive resource, but in actual data centers, bottlenecks occur due to data loading, GPU communication delays, heat, and cooling limitations, causing GPUs to often idle.

NVIDIA redesigned the entire system in the Blackwell generation to reduce these bottlenecks. The major change is that the design focus has shifted towards improving the overall efficiency of the AI Factory.

Enhanced AI-Based Rendering in Consumer GPUs

The changes in the Blackwell architecture are affecting not only data centers but also consumer GPUs.

Notably, in the GeForce RTX 50 series, the proportion of AI rendering based on DLSS 4 (Multi Frame Generation) has significantly expanded.

Traditional graphic rendering involved the GPU directly calculating every pixel. However, recently, the graphics pipeline is changing towards a direction where AI predicts and supplements intermediate frames and details based on some frame and pixel information.

This not only improves FPS but also shows that the method of graphic processing itself is changing towards more efficient use of computing resources.

Overcoming Power with System Design

In an AI Factory, the most expensive cost is not the GPU but often power and cooling costs. Especially, large-scale reasoning model inference pushes GPU utilization to extremes, making system efficiency very important.

NVIDIA has set producing more tokens within the same power as a core goal in the Blackwell generation.

Limitations of the Hopper Generation

The Hopper-based infrastructure of the previous generation achieved great results in large-scale AI training and inference, but as the scale increased, the following problems arose:

  • Increased cost of data movement between GPUs
  • Network bottlenecks
  • Limitations of air cooling
  • Increased power waste when expanding system scale

Especially as the number of GPUs increases, the cost of moving data becomes greater than the actual computation.

Approach of Blackwell NVL72

The representative system of the Blackwell generation, NVL72, focuses on reducing these bottlenecks.

  • Connects 72 GPUs like a single large accelerator computer
  • Ultra-fast interconnect based on NVLink
  • Large-scale liquid cooling
  • Integration with inference orchestration software

NVIDIA claims that when the Blackwell-based system is combined with NVIDIA Dynamo, it can improve inference throughput by up to 40 times compared to the Hopper generation.


3. Evolution of Software and Infrastructure

NVIDIA Dynamo: The Orchestration Layer of the AI Factory

When viewing AI infrastructure as a giant factory (AI Factory), an operating system to efficiently control it is essential. NVIDIA has developed and open-sourced NVIDIA Dynamo, a data center-scale OS that intelligently and dynamically distributes and manages the core workloads of AI accelerator infrastructure, the Prefill1 and Decode2 stages.

Dynamo OS orchestrates large-scale computation workloads to be evenly distributed across the infrastructure, ensuring the token generation speed perceived by users while simultaneously pushing the throughput of the entire AI factory to the Pareto frontier3. When combined with the Blackwell NVLink 72 system, this Dynamo OS can deliver up to 40 times the inference performance improvement compared to the Hopper under the same power conditions.

Silicon Photonics and Eliminating Network Bottlenecks

Communication Issues Between GPUs

To train the latest AI or run agents on a large scale, tens of thousands of GPUs need to be connected like a giant computer. Previously, copper wires (electrical signals) were mainly used as communication lines to connect GPUs and exchange data, but as the data volume exploded, problems arose.

Electricity generates heat and weakens signals due to resistance as the distance increases. Connecting tens of thousands of GPUs with copper wires wastes hundreds of megawatts (enough power for an entire small city) on running communication lines and cooling fans.

Co-Packaged Optics (CPO)

To solve this, NVIDIA unveiled a network strategy based on Silicon Photonics. This is a technology that transmits data using light (optical signals) instead of electricity.

The focus of the photonics strategy unveiled at GTC 2025 is closer to network switches rather than the GPU itself.

In the existing structure, Switch ASIC 트랜시버 광케이블 stages had to be passed. However, Co-Packaged Optics (CPO) aims to reduce signal conversion costs, power loss, and heat generation, and secure higher bandwidth by integrating optical communication modules close to the Switch ASIC4.

This indicates that NVIDIA has started to focus more on making connections between GPUs more efficient than on making GPUs faster.

MRM (Micro Ring Resonator Modulator)

One of the core technologies of silicon photonics is MRM.

Simply put, it is like a miniature optical switch that turns light signals on and off very quickly. A very small ring structure is created on silicon, and specific wavelengths of light are allowed to pass or blocked using minute voltage changes. This allows ultra-high-speed data transmission with very low power.

NVIDIA emphasizes a 1.6 Tb/s connection speed in its photonics-based network and presents it as one of the key technologies for future AI Factory scale expansion.

Advanced Packaging and Supply Chain

This photonics strategy requires close integration not only of GPUs but also of semiconductor manufacturing, optical components, advanced packaging, and cooling systems. Especially, TSMC's advanced packaging technology is expected to play an important role in NVIDIA's next-generation system design.

The 'AI Factory' that NVIDIA repeatedly mentions is ultimately a competition of the entire supply chain system.


4. Three-Year Roadmap

At GTC 2025, NVIDIA went beyond simple product announcements to reveal a very specific roadmap for infrastructure supply over the coming years.

This holds significance beyond mere marketing announcements. AI infrastructure investment is a business involving CAPEX in the tens to hundreds of billions of dollars. Cloud providers and data center operators place great importance on long-term supply predictability. Knowing when and what performance infrastructure will be available is directly linked to customers' investment plans.

For this reason, NVIDIA is making its annual cadence strategy of releasing new architectures every year even clearer.

[2025 하반기]
Blackwell Ultra (메모리, Inference 성능 강화)
        
[2026 하반기]
Vera Rubin (HBM4, NVLink 확장)
        
[2027 하반기]
Rubin Ultra (초대형 NVLink 확장)
        
[차세대]
Feynman

1) Connecting GPUs on a Larger Scale

As AI model sizes grow, improving the performance of a single GPU alone is not enough. The ultimate goal is to make them operate like a giant single computer. In other words, the competition is shifting from GPU competition to cluster architecture competition.

2) Vera CPU

The role of the CPU in the AI Agent era is changing from the past. Existing CPUs were designed for humans. They were structured to handle second-level responses like web servers, VMs, and general applications. However, Agentic AI is different. Agents call tools, read databases, search memory, and perform reasoning again. The longer the GPU waits during this process, the higher the overall cost.

The Vera CPU unveiled at the 2026 GTC is designed to address this issue. While the GPU performs inference, the Vera CPU handles agent runtime orchestration, data access, and retrieval. The role of the CPU is shifting from general-purpose computing to maximizing GPU utilization.


5. Redefining Personal Computers

From Computers Running Apps to Computers Running Agents

At the 2026 GTC Taipei, it was repeatedly emphasized that the computing interface itself is changing.

For decades, PCs were structured for users to run applications. Users would open browsers, write documents, and modify code in IDEs. Operating systems managed multiple apps, and users directly worked through GUIs.

However, in the Agentic AI era, users no longer worry about which app to open. They describe their goals, and agents autonomously plan, call various tools, and perform tasks. This means that computers are starting to shift from an app execution environment to an agent execution environment.

RTX Spark and Personal AI Computer

In line with these changes, NVIDIA unveiled a new computing direction called RTX Spark, which has a structure that allows for continuous agent inference locally.

By unveiling a system based on the N1X SoC in collaboration with MediaTek, NVIDIA suggested a direction for processing AI workloads (local LLM inference, code generation and execution, personal memory-based AI Assistant, etc.) locally. Particularly, by utilizing the large VRAM and CUDA ecosystem based on RTX GPUs, they emphasized an environment where personal agents operate continuously without cloud calls.

This strategy structurally resembles the data center strategy. If the AI Factory is a large-scale token production factory, RTX Spark can be considered a personal AI workstation scaled down to the individual level.

Why NVIDIA is Talking About PCs Again

The core of NVIDIA's current strategy remains the AI Factory. Most of the actual revenue and CAPEX occur in data center infrastructure, with Vera Rubin, networking, optical communications, and inference optimization being the focus of announcements.

Nevertheless, there is a reason why PCs were brought to the forefront again in 2026. Agentic AI requires much more interaction than expected. If all requests are processed in the cloud, latency and inference costs increase. Ultimately, some inference is likely to be processed locally. Especially for AI that needs to maintain context continuously, like personal assistants, code agents, and creative tools, local execution has significant advantages.

NVIDIA seems to see this as the starting point of a new PC replacement cycle. In this direction, future personal computing will compete based on the ability to run competitive AI Agents locally.


6. Physical AI and Humanoid Robots

The Ecosystem of Physical AI Understanding Inertia and Friction

Until now, most AI has learned from digital data such as text, images, and videos. However, the real world is much more complex. Robots need to understand physical laws like inertia, friction, and gravity.

NVIDIA is now emphasizing Physical AI, which understands and acts in the real world, as the next stage of AI.

NVIDIA as a Robot Platform

Interestingly, NVIDIA does not aim to become a direct robot manufacturer. Like iOS or Android in the smartphone industry, it aims to dominate the common platform of the robotics industry.

What NVIDIA intends to provide is a simulation environment, training data generation, physics engine, inference infrastructure, etc. Manufacturers create the robot's body, while NVIDIA creates the brain.

Providing Virtual Infrastructure for Data Collection

The most challenging problem in robot learning is data. Collecting data from the real world is slow, expensive, and risky. NVIDIA offers a way to practice robots billions of times in a virtual world before experiencing trial and error in reality.

  • Omniverse: A digital twin simulation platform with a physical environment similar to reality.
  • Cosmos: Generates high-quality virtual data needed for robot training on a large scale.

This can be applied across autonomous driving, logistics, manufacturing, and humanoid robots.

Cosmos 3

At the 2026 GTC Taipei, NVIDIA highlighted Cosmos 3, which predicts changes in the physical environment and generates action plans.

Cosmos 3 uses a Mixture-of-Transformers structure. It combines a reasoning transformer that understands and infers the environment with a generation transformer that creates actions and world states. It also features an omnimodel structure that handles multiple data types, including text, video, and actions.

Groot N1

At GTC 2025, NVIDIA open-sourced Groot N1, a general-purpose Foundation Model5 for humanoid robots trained in virtual environments. While Cosmos 3 is a foundational model for understanding and simulating physical environments, Groot N1 is closer to the action execution layer of humanoid robots.

Robot manufacturers can perform fine-tuning on this foundation to suit characteristics such as logistics or home use. This is similar to the trend of Base Model + Fine-tuning structure in the language model market.

GPU-Based Physics Simulation

NVIDIA also unveiled the Newton physics engine, developed in collaboration with Google DeepMind, Disney Research, and others. This model allows robots to learn in environments closer to real-world physics.

Existing game engine-level physics models struggled to handle issues like tactile feedback, collision response, and balance maintenance. The Newton physics engine enables precise tactile feedback and fine motor control learning for robots.


Conclusion

What NVIDIA showcased at GTC 2025 was the beginning of an industrial structure called the AI Factory. They defined themselves as a company that designs the entire production line of a factory where money is made, from the entire data center, power cooling architecture, network storage, infrastructure software OS, to the robot ecosystem.

And at GTC Taipei 2026, it became clear that the application layer to be built on top of it is Agentic AI. AI is now evolving into a structure that plans directly, calls tools, and calls other AIs again. This change leads to an increase in inference computation and is changing the structure of data centers themselves.

Therefore, NVIDIA does not remain a GPU company. It designs GPUs, CPUs, networks, storage, cooling, data center operating software, robot simulation, and enterprise Agent runtime. Because the entire way AI makes money is changing.


Additional References


Footnotes

  1. Prefill
    The initial stage where the LLM reads the entire input prompt and calculates the internal state.

  2. Decode
    The stage where the model generates actual response tokens one by one.

  3. Pareto frontier (파레토 프론티어) The optimal technical boundary where no specific metric can be improved without sacrificing another. Here, it means the ideal balance point between 'speed' and 'throughput'.

  4. ASIC (Application-Specific Integrated Circuit) A dedicated semiconductor chip that delivers packets at ultra-high speed within a network switch.

  5. Foundation Model
    A general-purpose model that can be utilized for various downstream tasks through large-scale pre-training.

NVIDIA's Design of AI Factory and Agentic AI | Code & Chain