๐Ÿ‰ Part of the Baidu-Baige Loong open-source series

LoongForge

A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models โ€” built on Megatron-LM with native NVIDIA GPU & Kunlun XPU support

๐Ÿงฉ Easy
One framework, broad coverage
Full coverage of mainstream open-source LLMs, VLMs, MoE, diffusion, and VLA models. Ready-to-run configs and launch scripts included.
โšก Efficient
Up to ~5ร— training speedup
Deep performance optimizations โ€” fused kernels, adaptive FP8, MoE A2A overlap, and multimodal pipeline scheduling.
๐Ÿ’Ž Multi-chip
NVIDIA GPU & Kunlun XPU
Native heterogeneous hardware support โ€” one framework, minimal migration between GPU and XPU.

๐Ÿ”ฅ Latest News

All posts โ†’

โœจ Key Features

A quick tour of what sets LoongForge apart

MoE
Tri-Stream Overlap
MoE EP comm ร— compute ร— offload in parallel โ€” higher throughput than upstream.
Multimodal
Heterogeneous Parallelism
Independent TP / PP / DP per model component.
Multimodal
Disaggregated Training
Decoupled ViT / LLM scheduling kills pipeline bubbles.
Multimodal
DP Load Balancing
Fixes packing-induced imbalance at cluster scale.
Performance
Adaptive FP8
Per-operator FP8 decisions by GEMM shape.
Performance
Fused Operators
FusedDSA / Sparse MLA kernels for end-to-end speedup.
Performance
ChunkPipe
Chunked long-sequence pipelining toward million-length contexts.
Training
Pretrain + SFT + LoRA
One codebase covers key training stages.
Usability
Model Composition
Swap ViT ร— LLM for VLMs via YAML.
Usability
HF โ†” Megatron
Bidirectional checkpoint conversion + online HF load/save.

๐Ÿ“Š Benchmark

Measured in v0.1.0 on A800 across LLM, VLM, and VLA workloads

Pi0.5 VLA
1.65ร—
GR00T N1.6 VLA
1.60ร—
Qwen3-VL-30B-A3B VLM
1.45ร—
Qwen3-30B-A3B MoE
1.16ร—
1.0ร— baseline
~5ร—
DeepSeek-V3.2 ยท DSA operator-level optimizations
Validated on reduced-layer configuration

๐Ÿ’Ž Hardware Compatibility

One codebase, two silicon stacks โ€” production-ready on NVIDIA GPU and Baidu Kunlun XPU

NV

NVIDIA GPU

Built on the community Megatron + TransformerEngine ecosystem, with LoongForge optimizations layered on top.

ๆ˜†

Kunlun XPU

XPU Plugin mechanism shields the upper stack from adaptation differences, while integrating an XPU-specific optimization toolchain.

๐Ÿ›๏ธ Supported Models

From compact SLMs to large-scale MoE giants โ€” all batteries-included

DeepSeek-V2

v2-litev2

DeepSeek-V3

v3

DeepSeek-V3.2

v3.2

LLaMA2

7B13B70B

LLaMA3

8B70B

LLaMA3.1

8B70B405B

Qwen

1.8B โ†’ 72B

Qwen1.5

0.5B โ†’ 72B

Qwen2

0.5B โ†’ 72B

Qwen2.5

0.5B โ†’ 72B

Qwen3

0.6B โ†’ 480B-A35BCoder-30B-A3B

Qwen3-Next

80B-A3B

MiniMax

m2.1m2.5m2.7

MIMO

mimo-7b

GLM

glm5

๐Ÿš€ Quick Start

YAML-driven โ€” a few steps from install to launch

Docker is the recommended path โ€” a single image bundles CUDA/XPU toolchains, the patched Megatron submodule, and TransformerEngine, so every developer and every node trains from the same environment. Source install is also fully supported for advanced setups.

$ # NVIDIA GPU (Docker)
$ git clone --recurse-submodules https://github.com/baidu-baige/LoongForge.git
$ docker build --build-arg COMPILE_ENV=hopper --build-arg ENABLE_LEROBOT=false -t loongforge:latest -f ./LoongForge/docker/Dockerfile .

$ # Kunlun XPU (Docker)
$ docker build --build-arg BASE_IMAGE=loongforge/loongforge_kunlun:py310_torch25 --build-arg ENABLE_LEROBOT=false -t loongforge-kunlun:latest -f LoongForge/docker/Dockerfile.xpu .

Optional โ€” only needed for custom combinations. LoongForge uses declarative configs to compose different modality components into a full multimodal model. Take qwen3_vl_30b_a3b as an example: a single YAML assembles the vision encoder, projector, and language backbone. To swap the language backbone to DeepSeek V3, change one line under model.foundation. Mainstream open-source models are already built in under configs/models/ for you to pick from.

# configs/models/qwen3_vl/qwen3_vl_30b_a3b.yaml
defaults:
  - ../../models/image_encoder@model.image_encoder: qwen3_vit
  - ../../models/image_projector@model.image_projector: qwen_mlp_adapter
  - ../../models/qwen3@model.foundation: qwen3_30b_a3b
  - _self_

model:
  model_type: qwen3_vl
  ...

# Swap the language backbone โ€” one line change:
-  - ../../models/qwen3@model.foundation: qwen3_30b_a3b
+  - ../../models/deepseek3@model.foundation: deepseek_v3

LoongForge supports offline conversion of HuggingFace weights into the Megatron training format, and also supports loading HuggingFace-format weights directly at startup โ€” skipping the conversion step. On completion, weights can be exported back to HF format with one flag, for seamless hand-off to the downstream community ecosystem.

TRAINING_ARGS=(
    --load $CHECKPOINT_PATH            # point directly at the HF model directory
    --save $CHECKPOINT_PATH            # high-performance training checkpoints
    --save-interval 40
    --save-hf true                     # export HF weights on finish
    --save-hf-path /path/to/output
    ...
)

LoongForge ships a built-in data preprocessing toolchain that converts your raw data into the framework-compatible format. Below is a multimodal data preprocessing example โ€” refer to the per-model-family user guide for details.

$ python tools/data_preprocess/vlm/convert_to_webdataset.py \
    --output_dir /workspace/wds_data/ \
    --json_file tests/datasets/vlm/mllm_demo.json \
    --image_dir tests/datasets/vlm/ \
    --video_dir tests/datasets/vlm/ \
    --media mix \
    --columns_messages messages \
    --maxcount 10000 \
    --maxsize 3000000000 \
    --sample_type multi_mix_qa

The outer layer is fully Megatron-compatible โ€” familiar training arguments can be reused as-is.

TRAINING_ARGS=(
    --training-phase sft
    --seq-length 32768
    --micro-batch-size 1
    ...
)

MODEL_PARALLEL_ARGS=(
    --tensor-model-parallel-size 1
    --pipeline-model-parallel-size 2
    --expert-model-parallel-size 8
    ...
)

The inner layer uses Hydra overrides to assign parallelism (TP / PP / EP / DP) and freeze behavior per model component โ€” ideal for heterogeneous VLM training where ViT and LLM backbones have very different compute profiles.

# Per-component overrides (Hydra): different TP / freeze per component
+model.image_encoder.tensor-model-parallel-size=1
+model.foundation.tensor-model-parallel-size=4
+model.image_encoder.freeze=True

LoongForge ships example launch scripts for open-source models that you can run as-is. Browse examples/ (NVIDIA GPU) and examples_xpu/ (Kunlun XPU). The snippet below shows the common torchrun launch pattern shared across examples.

$ # e.g. examples/qwen3_vl/finetuning/sft_qwen3_vl_30b_a3b.sh
$ PYTHONPATH=$MEGATRON_PATH:$LOONGFORGE_PATH:$PYTHONPATH \
    torchrun --nproc_per_node 8 --nnodes $NNODES \
        $LOONGFORGE_PATH/loongforge/train.py \
        --config-file $LOONGFORGE_PATH/configs/models/qwen3_vl/qwen3_vl_30b_a3b.yaml \
        "${DATA_ARGS[@]}" "${TRAINING_ARGS[@]}" "${MOE_ARGS[@]}" "${MODEL_PARALLEL_ARGS[@]}" \
        +model.image_encoder.freeze=True

๐ŸŒŸ Powered by LoongForge

Open-source projects trained on LoongForge โ€” ordered from newest to earliest

๐Ÿค Community

Built in the open โ€” join discussions, report issues, and contribute

โ€”
Contributors
Apache 2.0
License
๐Ÿ‰ Loong
Baige Loong Series