LoongForge

LoongForge

A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models — built on Megatron-LM with native NVIDIA GPU & Kunlun XPU support

🧩 Easy

One framework, broad coverage

Full coverage of mainstream open-source LLMs, VLMs, MoE, diffusion, and VLA models. Ready-to-run configs and launch scripts included.

⚡ Efficient

Up to ~5× training speedup

Deep performance optimizations — fused kernels, adaptive FP8, MoE A2A overlap, and multimodal pipeline scheduling.

💎 Multi-chip

NVIDIA GPU & Kunlun XPU

Native heterogeneous hardware support — one framework, minimal migration between GPU and XPU.

🚀 Quick Start

YAML-driven — a few steps from install to launch

Docker is the recommended path — a single image bundles CUDA/XPU toolchains, the patched Megatron submodule, and TransformerEngine, so every developer and every node trains from the same environment. Source install is also fully supported for advanced setups.

$ # NVIDIA GPU (Docker)
$ git clone --recurse-submodules https://github.com/baidu-baige/LoongForge.git
$ docker build --build-arg COMPILE_ENV=hopper --build-arg ENABLE_LEROBOT=false -t loongforge:latest -f ./LoongForge/docker/Dockerfile .

$ # Kunlun XPU (Docker)
$ docker build --build-arg BASE_IMAGE=loongforge/loongforge_kunlun:py310_torch25 --build-arg ENABLE_LEROBOT=false -t loongforge-kunlun:latest -f LoongForge/docker/Dockerfile.xpu .

Optional — only needed for custom combinations. LoongForge uses declarative configs to compose different modality components into a full multimodal model. Take qwen3_vl_30b_a3b as an example: a single YAML assembles the vision encoder, projector, and language backbone. To swap the language backbone to DeepSeek V3, change one line under model.foundation. Mainstream open-source models are already built in under configs/models/ for you to pick from.

# configs/models/qwen3_vl/qwen3_vl_30b_a3b.yaml
defaults:
  - ../../models/image_encoder@model.image_encoder: qwen3_vit
  - ../../models/image_projector@model.image_projector: qwen_mlp_adapter
  - ../../models/qwen3@model.foundation: qwen3_30b_a3b
  - _self_

model:
  model_type: qwen3_vl
  ...

# Swap the language backbone — one line change:
-  - ../../models/qwen3@model.foundation: qwen3_30b_a3b
+  - ../../models/deepseek3@model.foundation: deepseek_v3

LoongForge supports offline conversion of HuggingFace weights into the Megatron training format, and also supports loading HuggingFace-format weights directly at startup — skipping the conversion step. On completion, weights can be exported back to HF format with one flag, for seamless hand-off to the downstream community ecosystem.

TRAINING_ARGS=(
    --load $CHECKPOINT_PATH            # point directly at the HF model directory
    --save $CHECKPOINT_PATH            # high-performance training checkpoints
    --save-interval 40
    --save-hf true                     # export HF weights on finish
    --save-hf-path /path/to/output
    ...
)

LoongForge ships a built-in data preprocessing toolchain that converts your raw data into the framework-compatible format. Below is a multimodal data preprocessing example — refer to the per-model-family user guide for details.

$ python tools/data_preprocess/vlm/convert_to_webdataset.py \
    --output_dir /workspace/wds_data/ \
    --json_file tests/datasets/vlm/mllm_demo.json \
    --image_dir tests/datasets/vlm/ \
    --video_dir tests/datasets/vlm/ \
    --media mix \
    --columns_messages messages \
    --maxcount 10000 \
    --maxsize 3000000000 \
    --sample_type multi_mix_qa

The outer layer is fully Megatron-compatible — familiar training arguments can be reused as-is.

TRAINING_ARGS=(
    --training-phase sft
    --seq-length 32768
    --micro-batch-size 1
    ...
)

MODEL_PARALLEL_ARGS=(
    --tensor-model-parallel-size 1
    --pipeline-model-parallel-size 2
    --expert-model-parallel-size 8
    ...
)

The inner layer uses Hydra overrides to assign parallelism (TP / PP / EP / DP) and freeze behavior per model component — ideal for heterogeneous VLM training where ViT and LLM backbones have very different compute profiles.

# Per-component overrides (Hydra): different TP / freeze per component
+model.image_encoder.tensor-model-parallel-size=1
+model.foundation.tensor-model-parallel-size=4
+model.image_encoder.freeze=True

LoongForge ships example launch scripts for open-source models that you can run as-is. Browse examples/ (NVIDIA GPU) and examples_xpu/ (Kunlun XPU). The snippet below shows the common torchrun launch pattern shared across examples.

$ # e.g. examples/qwen3_vl/finetuning/sft_qwen3_vl_30b_a3b.sh
$ PYTHONPATH=$MEGATRON_PATH:$LOONGFORGE_PATH:$PYTHONPATH \
    torchrun --nproc_per_node 8 --nnodes $NNODES \
        $LOONGFORGE_PATH/loongforge/train.py \
        --config-file $LOONGFORGE_PATH/configs/models/qwen3_vl/qwen3_vl_30b_a3b.yaml \
        "${DATA_ARGS[@]}" "${TRAINING_ARGS[@]}" "${MOE_ARGS[@]}" "${MODEL_PARALLEL_ARGS[@]}" \
        +model.image_encoder.freeze=True

📖 Full runnable tutorials: LLM ↗ VLM ↗ VLA ↗ Diffusion ↗ Kunlun XPU ↗

Browse configs/models/ · examples/ · examples_xpu/

🔥 Latest News

✨ Key Features

📊 Benchmark

💎 Hardware Compatibility

NVIDIA GPU

Kunlun XPU

🏛️ Supported Models

DeepSeek-V2

DeepSeek-V3

DeepSeek-V3.2

LLaMA2

LLaMA3

LLaMA3.1

Qwen

Qwen1.5

Qwen2

Qwen2.5

Qwen3

Qwen3-Next

MiniMax

MIMO

GLM

Qwen2.5-VL

Qwen3-VL

Qwen3.5

Qwen3.6

ERNIE4.5-VL

LLaVA-OneVision-1.5

InternVL2.5

InternVL3.5

CustomCombinedModel

WAN2.2

Pi

GR00T

🚀 Quick Start

🌟 Powered by LoongForge

LLaVA-OneVision-2.0

LLaVA-OneVision-1.5

Qianfan-VL

🤝 Community

GitHub Issues

Discussions

Contributing

Join us on WeChat