← Back to blog
LoongForgeVLAGR00TTraining

Training Cycle Halved: LoongForge End-to-End Optimization for GR00T N1.6 Delivers 2.3× Throughput

2026-06-02 · The LoongForge Team

To address IO stalls, communication overhead, and inefficient operator scheduling in GR00T N1.6 VLA model training, Baidu Baige's LoongForge delivers end-to-end system-level optimization, achieving up to 2.3× training throughput and shortening the overall training cycle by 56.6%.

Official website: https://baidu-baige.github.io/LoongForge/

GitHub: https://github.com/baidu-baige/LoongForge


1. Background: the capability leap and challenges of GR00T N1.6 as an embodied-intelligence foundation

As humanoid robots accelerate toward industrialization, Vision-Language-Action (VLA) models have become a core technical path for embodied intelligence, thanks to their ability to connect perception, understanding, and action end-to-end. Among the embodied-intelligence foundation models, NVIDIA's open-source GR00T N series stands out as a representative core technology stack for humanoid-robot scenarios and is widely used in robotic intelligence training and R&D deployment.

Released in 2025, GR00T N1.6 further revamps both the model architecture and the action-generation paradigm, significantly strengthening end-to-end intelligent control of humanoid robots. The model uses Cosmos-Reason-2B as its multimodal vision-language perception core, and introduces a 32-layer DiT backbone for action generation, jointly modeling first-person robot video, proprioceptive state, and natural-language instructions as a shared policy representation—unifying perception, understanding, and action control.

The deep DiT enables high-precision modeling of long action sequences and substantially improves intelligent-control quality, but it also turns model training into a workload that is both compute- and communication-intensive, with high training cost and difficulty.

According to the official configuration, the pre-training stage uses a global batch size of 16,384 and runs roughly 300K steps on 1,024 H100 GPUs. Even fine-tuning on a downstream task on a single node takes several days. Data IO stalls, multi-GPU communication overhead, and inefficient training scheduling all combine to make GR00T N1.6 training expensive and slow, hindering rapid model iteration.

2. Solution overview: LoongForge end-to-end system-level optimization

To further improve GR00T N1.6 training efficiency, the Baidu Baige team applied system-level optimization and deep refactoring across the full training pipeline, on top of the in-house, open-source full-modal training framework LoongForge.

Targeting the characteristics of VLA training, LoongForge focuses on three directions: data IO pipeline, communication-computation overlap, and training scheduling:

Compared to the official training implementation, LoongForge ultimately delivers up to 2.3× training throughput and reduces the overall training cycle by 56.6%. So how exactly does LoongForge release more GPU compute and accelerate GR00T N1.6 training? Below we systematically break down the key ideas and technical implementation.

3. Inside the 2.3× speedup: three engineering optimizations

To unlock GR00T N1.6's training potential, we did not stop at simple parameter tuning, but performed system-level optimization at three layers: data IO pipeline, communication-computation overlap, and training scheduling.

Optimization 1: IO pipeline — asynchronous data prefetch

GR00T N1.6 data preprocessing involves CPU-heavy operations such as video decoding, image augmentation, and multimodal encoding. In the Lerobot framework, the GPU spends a large fraction of time waiting on data—a classic IO stall.

Baseline: data processing and forward executed serially
Baseline: data processing and forward executed serially

LoongForge decouples data production from GPU training via a three-level asynchronous pipeline:

While the GPU computes the current batch, the next batch is being transferred, the one after is being preprocessed, and an even later batch is being read—forming a complete pipeline overlap.

Optimized full data pipeline
Optimized full data pipeline

With this IO-pipeline optimization, GPU data-wait time is dramatically compressed and IO stalls are largely hidden.

Optimization 2: communication-computation overlap — fine-grained overlap driven by the Megatron Distributed Optimizer

When training GR00T N1.6 in the Lerobot framework, the following bottlenecks are typical:

To resolve these bottlenecks, LoongForge integrates the Megatron Distributed Optimizer. Through contiguous gradient buffers and hook-based communication scheduling, parameter sync, gradient reduction, and model computation are deeply interleaved, achieving fine-grained communication-computation overlap, hiding communication overhead, and lifting training throughput.

Comparison of baseline vs. LoongForge core optimization features:

Baseline vs. LoongForge core optimization features
Baseline vs. LoongForge core optimization features

Specific measures include:

Per-layer parameter prefetch in forward
Per-layer parameter prefetch in forward
Bucket-granularity communication-computation overlap
Bucket-granularity communication-computation overlap

Optimization 3: training-scheduling optimization — Per-Microbatch CUDA Graph for GR00T N1.6

Python scheduling and GPU kernel launch overhead are often invisible bottlenecks in large-model training. In VLA training such as GR00T N1.6, where forward, backward, and loss compute trigger a large number of tiny kernels, eager-mode launching keeps amplifying CPU overhead and prevents the GPU from being fully utilized.

CUDA Graph is a widely used optimization in the GPU ecosystem that uses graph replay to cut Python scheduling and kernel launch overhead.

Rather than reusing a generic CUDA Graph flow as-is, LoongForge adapts it for the actual GR00T N1.6 VLA training pipeline: stable, repeated forward/backward compute paths are captured into a CUDA Graph, while logic that benefits from flexibility—such as random-noise sampling and dynamic input handling—stays on the eager path. We also redraw the capture-and-replay boundaries to fit gradient accumulation across multiple microbatches and DDP overlap timing, enabling CUDA Graph to run stably in real multi-GPU GR00T N1.6 training.

CUDA Graph execution modes: Eager vs. Full-Iteration vs. Per-Microbatch
CUDA Graph execution modes: Eager vs. Full-Iteration vs. Per-Microbatch

In LoongForge, GR00T N1.6 training keeps three execution paths—Eager, Full-Iteration CUDA Graph, and Per-Microbatch CUDA Graph—each serving a different stage and scenario:

Compared to Full-Iteration, Per-Microbatch is no longer about chasing "a bigger graph" but about "drawing the right boundaries": random-number logic such as beta.sample and torch.randn stays on the eager path, so randomness is not frozen by capture and loss alignment is preserved; gradient sync points stay on appropriate microbatch boundaries, so the CUDA Graph optimization does not break the original communication-computation overlap rhythm.

To make CUDA Graph robust on the real GR00T N1.6 workload, we performed graph-safe refactoring on the vision encoder, language backbone, and action DiT components: static buffer reuse, fixed-shape padding, cached positional encodings and window indices, avoidance of dynamic allocation during capture, and replacement of certain non-capturable operators—so CUDA Graph runs reliably in real multi-GPU training.

With this optimization, GR00T N1.6 training gains close to 1.5× throughput in multi-GPU settings. Per-Microbatch mode keeps performance close to Full-Iteration while offering better loss alignment, providing a more efficient and stable execution path for GR00T N1.6 training.

4. Performance results: training cycle halved, time reduced by 56.6%

On an 8×A800 (80G) node, we trained GR00T N1.6 on the Libero dataset. Across the full training task, the overall training cycle is reduced by 56.6%, significantly speeding up model R&D and experimental iteration.

Performance comparison across optimization stages:

GR00T N1.6 performance comparison across optimization stages
GR00T N1.6 performance comparison across optimization stages

5. Summary: raising effective GPU utilization

By applying system-level optimization across training scheduling, communication-computation overlap, and the data IO pipeline, we significantly cut Python scheduling overhead, communication waits, and data-supply idling, moving the GPU from "passive waiting" to "continuous computation". Without changing the model architecture, we deliver 2.3× speedup and a 56.6% shorter training cycle, substantially improving model iteration efficiency and R&D rhythm.

These optimizations are now integrated into the full-modal training framework LoongForge. We welcome researchers and developers in embodied intelligence to explore more efficient VLA training together.

🔗 Original article: https://mp.weixin.qq.com/s/OxDi4424J5Oy1sv_fuLkVA
← Previous: Multimodal Heterogeneous Parallel Training All posts →