Case StudyVLMLLaVA

128 GPUs, 4 days — Baidu Baige powers LLaVA-OneVision-1.5, a new record in multimodal training efficiency

2025-10-15 · The LoongForge Team

Editor's note: This article documents how LLaVA-OneVision-1.5 was trained on Baidu Baige's AIAK training framework — the predecessor of LoongForge. LoongForge grew out of real production workloads like this one, and today's multimodal training capabilities — including ViT × LLM heterogeneous parallelism — come directly from the engineering practice described here.

Training a high-performance vision large model has historically meant high cost and heavy engineering lift. In late September, LLaVA-OneVision-1.5 — released jointly by Inspiration Lab and LMMs-Lab — rewrote that picture.

This 8B multimodal model, capable of visual understanding and image-text dialogue, completed pretraining in just 4 days on 128 A800 GPUs, while matching top-tier large models on public benchmarks — proving the value of non-brute-force performance gains.

Behind this breakthrough is not only the high-performance AI infrastructure of Baidu Baige's AI compute platform, but also the AIAK training framework built into the platform — from adapting mainstream model architectures to multi-dimensional distributed-training acceleration, it provides the engineering productivity that makes the model's efficient landing possible.

Even more notable for the industry: LLaVA-OneVision-1.5 breaks through the "weights-only" limitation of traditional open-source. In the traditional model, releasing only weights is like handing developers a finished black box — there's no visibility into the source and filtering of training data, no clarity on hyperparameters or parallelism strategy, no reference for data cleaning and evaluation. Developers can only "use it," not optimize or iterate on it. They often can't even reproduce the reported performance, let alone innovate on top. Small and mid-sized teams have nowhere to start.

As one of the first multimodal models in the industry to achieve end-to-end open source, LLaVA-OneVision-1.5 fully opens 85M pretraining + 22M instruction data across the full range of scenarios, training configs such as hyperparameters and parallelism strategy, and optimization details including data-cleaning and evaluation logs — plus a one-click reproducible path.

This kind of openness lets researchers, enterprises, and university teams directly restructure, validate, and extend the model — genuinely turning multimodal AI from "the preserve of giants" into a reusable, innovable public asset for the whole industry.

Paper: LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Code: https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5
Tech report: https://arxiv.org/abs/2509.23661
Data / Models: HuggingFace Collection
Demo: HuggingFace Space

1. LLaVA-OneVision-1.5: high performance at low cost, democratizing multimodal AI with end-to-end open source

LLaVA-OneVision-1.5's lead is not only in metrics. It builds a new "high performance + low cost" paradigm for multimodal models through high-quality data, a clean and efficient architecture, a compact training recipe, and extreme engineering optimization — all amplified by fully open-sourcing the entire pipeline.

High-quality data: coverage, balance, and task generalization

An 85M pretraining + 22M instruction data matrix, integrating 8 heterogeneous sources across images, documents, OCR, mathematical reasoning, and more. Concept-balanced sampling supplements rare concepts and removes noise, preventing the model from becoming lopsided and ensuring generalization across modalities and tasks.

Both datasets are released alongside the model, so developers don't have to collect and label data from scratch. They can be used directly for training or iteration, removing the classic "no data available" pain point of weights-only releases.

A clean, efficient architecture: RICE-ViT visual encoder balances detail perception and training efficiency

The in-house RICE-ViT visual encoder precisely captures fine-grained information such as table cells and small text in documents. Paired with a lightweight vision-language alignment layer, it simplifies the cross-modal fusion path — keeping perception sharp while reducing training load, delivering both accuracy and efficiency.

Architecture details and code implementation are part of the open-source release. Unlike weights-only releases where architectural details are opaque, even small and mid-sized teams can now quickly build a multimodal baseline — no need to reinvent complex structures from scratch.

A compact three-stage training strategy: grow the model efficiently

A three-stage recipe — "image-text base alignment → balanced knowledge injection → instruction-time reinforcement" — with clear objectives and no redundant iteration. This accelerates the model's growth from "seeing" to "doing" and underpins the low-cost training outcome.

Hyperparameters, task split, and iteration cadence of each stage are fully recorded in the open-source scripts. In contrast to traditional weights-only releases where the training process is opaque, developers can reproduce every step, or fine-tune the strategy for their own needs — significantly cutting R&D cycles.

Extreme engineering optimization: efficiency as the cost breakthrough

Offline data packing (11x padding compression), hybrid parallelism, and other compute-allocation optimizations let 128 A800 GPUs complete 8B pretraining in just 4 days — evidence that algorithm-engineering co-design can deliver both high performance and low cost.

The core tools that enabled this (data packing scripts, parallelism configs, etc.) are all open-sourced. Unlike weights-only releases where engineering know-how can't be reused, developers can apply this playbook directly on their own compute — no need to re-solve the engineering puzzles.

2. Baidu Baige AI compute platform: the power source of extreme efficiency

LLaVA-OneVision-1.5's breakthrough was only possible thanks to Baidu Baige's full-stack support — from high-performance AI infrastructure to extreme engineering productivity — providing end-to-end capability from compute to training system, helping teams achieve extreme efficiency within finite budgets.

High-performance infrastructure: a stable foundation for large-model training

Training of LLaVA-OneVision-1.5 was hosted on the GPU compute clusters provided by Baige. In the distributed setting of 128 A800 GPUs, Baige's high-bandwidth interconnect and elastic scheduling maximize utilization and throughput, making it realistic for an 8B model to complete full-parameter training on 85M samples in 4 days.

Full-pipeline engineering productivity: the AIAK framework accelerates multimodal training

On the training side, the LLaVA-OneVision-1.5 team relied heavily on Baige's AIAK-Training-LLM framework. It fully supports multimodal models across training phases, and through hybrid parallelism, communication-computation overlap, data packing, and other acceleration techniques, it improves training fluency and resource utilization across the board. As a result, LLaVA-OneVision-1.5 saw multi-x improvements in training efficiency and significant cost reduction — a solid foundation for fast iteration.

AIAK-Training-LLM is built on Megatron. It is a purpose-built AI acceleration toolkit from Baidu Baige for large-model training, designed to help developers run large-scale distributed training efficiently, significantly improving training performance and resource utilization.

AIAK supports pretraining and fine-tuning for mainstream model scenarios — LLMs, multimodal understanding models, video generation models, etc. — with compatibility across Qwen, LLaMA, DeepSeek, QwenVL, InternVL, QianfanVL, LLaVA-OneVision (LLaVAOV), and Wan families. On top of that, users can flexibly build custom model architectures on AIAK and run training efficiently.

On the performance side, AIAK is deeply optimized per model structure, shipping hybrid parallelism, communication-computation overlap, cost-effective memory management, FP8 low-precision training, operator fusion, high-performance optimizers, and more. Across model types, MFU (Model FLOPs Utilization) improves by an average of 30%, delivering industry-leading training performance.

AIAK is deeply integrated into the Baige platform — users can pull prebuilt training images directly. The complete training code and configuration around LLaVA-OneVision-1.5 is fully open-sourced, and more tools and optimizations will keep being released to further lower the bar for large-model R&D.

3. Every team can build its own AI model

LLaVA-OneVision-1.5's success is a direct expression of Baidu Baige's "fast, stable, efficient" core value — high-performance AI infrastructure anchors the compute foundation, while extreme engineering productivity connects "data processing → model training → efficiency tuning" end-to-end, cutting training cycles and R&D cost.

Today, whether you're a resource-constrained research institute, an efficiency-focused enterprise team, or a startup developer exploring innovation, you can spin up AI model R&D on Baige: quickly stand up a high-performance training environment, accelerate multimodal training with the AIAK toolchain, and even reuse the LLaVA-OneVision-1.5 open-source recipe — iterating a custom AI model tailored to your business at controlled cost.

Note: The AIAK-Training-LLM framework referenced in this article is the predecessor of the now open-source LoongForge. LoongForge carries forward AIAK's core capabilities around ViT × LLM heterogeneous parallelism, data packing, and communication-computation overlap, and evolves them further for multimodal / embodied scenarios.

📖 Source (WeChat, Chinese): 128 卡 4 天时间！百度百舸助力 LLaVA-OneVision-1.5 刷新多模态大模型训练效率纪录
Project page: LLaVA-OneVision-1.5 on GitHub

Next: Announcing LoongForge →