A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models โ built on Megatron-LM with native NVIDIA GPU & Kunlun XPU support
A quick tour of what sets LoongForge apart
Measured in v0.1.0 on A800 across LLM, VLM, and VLA workloads
One codebase, two silicon stacks โ production-ready on NVIDIA GPU and Baidu Kunlun XPU
Built on the community Megatron + TransformerEngine ecosystem, with LoongForge optimizations layered on top.
XPU Plugin mechanism shields the upper stack from adaptation differences, while integrating an XPU-specific optimization toolchain.
From compact SLMs to large-scale MoE giants โ all batteries-included
Compose any ViT + any LLM backbone via a YAML file. Example โ
YAML-driven โ a few steps from install to launch
Docker is the recommended path โ a single image bundles CUDA/XPU toolchains, the patched Megatron submodule, and TransformerEngine, so every developer and every node trains from the same environment. Source install is also fully supported for advanced setups.
$ # NVIDIA GPU (Docker) $ git clone --recurse-submodules https://github.com/baidu-baige/LoongForge.git $ docker build --build-arg COMPILE_ENV=hopper --build-arg ENABLE_LEROBOT=false -t loongforge:latest -f ./LoongForge/docker/Dockerfile . $ # Kunlun XPU (Docker) $ docker build --build-arg BASE_IMAGE=loongforge/loongforge_kunlun:py310_torch25 --build-arg ENABLE_LEROBOT=false -t loongforge-kunlun:latest -f LoongForge/docker/Dockerfile.xpu .
Optional โ only needed for custom combinations. LoongForge uses declarative configs to compose different
modality components into a full multimodal model. Take qwen3_vl_30b_a3b as an
example: a single YAML assembles the vision encoder, projector, and language backbone. To swap the language
backbone to DeepSeek V3, change one line under model.foundation. Mainstream
open-source models are already built in under configs/models/ for you to pick from.
# configs/models/qwen3_vl/qwen3_vl_30b_a3b.yaml defaults: - ../../models/image_encoder@model.image_encoder: qwen3_vit - ../../models/image_projector@model.image_projector: qwen_mlp_adapter - ../../models/qwen3@model.foundation: qwen3_30b_a3b - _self_ model: model_type: qwen3_vl ... # Swap the language backbone โ one line change: - - ../../models/qwen3@model.foundation: qwen3_30b_a3b + - ../../models/deepseek3@model.foundation: deepseek_v3
LoongForge supports offline conversion of HuggingFace weights into the Megatron training format, and also supports loading HuggingFace-format weights directly at startup โ skipping the conversion step. On completion, weights can be exported back to HF format with one flag, for seamless hand-off to the downstream community ecosystem.
TRAINING_ARGS=(
--load $CHECKPOINT_PATH # point directly at the HF model directory
--save $CHECKPOINT_PATH # high-performance training checkpoints
--save-interval 40
--save-hf true # export HF weights on finish
--save-hf-path /path/to/output
...
)
LoongForge ships a built-in data preprocessing toolchain that converts your raw data into the framework-compatible format. Below is a multimodal data preprocessing example โ refer to the per-model-family user guide for details.
$ python tools/data_preprocess/vlm/convert_to_webdataset.py \ --output_dir /workspace/wds_data/ \ --json_file tests/datasets/vlm/mllm_demo.json \ --image_dir tests/datasets/vlm/ \ --video_dir tests/datasets/vlm/ \ --media mix \ --columns_messages messages \ --maxcount 10000 \ --maxsize 3000000000 \ --sample_type multi_mix_qa
The outer layer is fully Megatron-compatible โ familiar training arguments can be reused as-is.
TRAINING_ARGS=(
--training-phase sft
--seq-length 32768
--micro-batch-size 1
...
)
MODEL_PARALLEL_ARGS=(
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 2
--expert-model-parallel-size 8
...
)
The inner layer uses Hydra overrides to assign parallelism (TP / PP / EP / DP) and freeze behavior per model component โ ideal for heterogeneous VLM training where ViT and LLM backbones have very different compute profiles.
# Per-component overrides (Hydra): different TP / freeze per component +model.image_encoder.tensor-model-parallel-size=1 +model.foundation.tensor-model-parallel-size=4 +model.image_encoder.freeze=True
LoongForge ships example launch scripts for open-source models that you can run as-is. Browse
examples/ (NVIDIA GPU) and examples_xpu/ (Kunlun XPU). The
snippet below shows the common torchrun launch pattern shared across examples.
$ # e.g. examples/qwen3_vl/finetuning/sft_qwen3_vl_30b_a3b.sh $ PYTHONPATH=$MEGATRON_PATH:$LOONGFORGE_PATH:$PYTHONPATH \ torchrun --nproc_per_node 8 --nnodes $NNODES \ $LOONGFORGE_PATH/loongforge/train.py \ --config-file $LOONGFORGE_PATH/configs/models/qwen3_vl/qwen3_vl_30b_a3b.yaml \ "${DATA_ARGS[@]}" "${TRAINING_ARGS[@]}" "${MOE_ARGS[@]}" "${MODEL_PARALLEL_ARGS[@]}" \ +model.image_encoder.freeze=True
Open-source projects trained on LoongForge โ ordered from newest to earliest
Built in the open โ join discussions, report issues, and contribute
File bug reports and feature requests.
Ask questions and share experiences.
Read the guide and send your first PR.
Scan the QR code in our README to join the developer group.