LoongForge v0.1.0 Release Notes
We are excited to announce the first public release of LoongForge โ a modular, scalable, and highly efficient training framework for large-scale transformer models across diverse modalities and architectures. This initial release covers four modality types (LLM, VLM, VLA, Diffusion), with out-of-the-box support for Pretrain, MidTrain, SFT, and LoRA workflows.
๐ Documentation: loongforge.readthedocs.io/en/v0.1.0
Supported Models
LoongForge ships with a wide set of production-ready model configurations. Highlighted below; the complete list is in the Supported Models docs.
LLM
- LLaMA 2 / 3
- Qwen 2 / 2.5 / 3
- DeepSeek V2 / V3 / R1
- MiniMax
- GLM-4
- MIMO
VLM
- Qwen-VL / Qwen2-VL / Qwen2.5-VL
- InternVL 2 / 2.5
- ERNIE-4.5-VL
- LLaVA-OneVision-1.5
Diffusion & VLA
- WAN 2.2 (video diffusion)
- Pi0.5 (vision-language-action)
Key Features
โก Adaptive FP8 Training
End-to-end FP8 training for both LLMs and VLMs, with per-operator decisions based on GEMM shape. Enables FP8 where it actually helps throughput, stays in higher precision where it would hurt numerics.
๐ MoE Optimization
Overlapped All2All communication, activation offload, and compute โ designed to hold trillion-parameter MoE training within a tight memory budget, with throughput that improves on upstream Megatron-LM for the same workload.
๐ผ๏ธ VLM Training
OmniCombinationModel decouples vision encoder and LLM backbone into composable components. Each
side can choose its own TP/PP/DP sizes, and a pipeline scheduler orchestrates them. Swapping a ViT or an LLM
becomes a YAML change, not a code change.
๐ง Memory & Compute
Systemic memory optimizations โ activation offloading, recomputation planning, and fused kernels (e.g. FusedDSA, Sparse MLA) โ reduce peak memory footprint while improving compute utilization.
๐พ Checkpoint & LoRA
Native HuggingFace โ Megatron checkpoint conversion, plus online HuggingFace load/save via
--save-hf true. LoRA adapters are first-class: attach, train, merge, ship.
๐ง Custom Operators
High-performance CUDA/C++ fused operators live in ops/ and are dispatched through a
hardware-abstracted layer, so the same training script can use GPU-native kernels or XPU-native equivalents
transparently.
๐ฆ Data Pipeline
Multimodal data pipeline with image/video processing plugins, sequence packing, length balancing, and data-parallel load balancing tuned for large multi-node training.
๐ฅ๏ธ Heterogeneous Hardware
The same codebase runs on NVIDIA GPU and Baidu Kunlun XPU (P800). The initial release validates the
flagship LLM and VLM families on both platforms, with a minimally intrusive plugin design under
models/dispatch.py.
Contributors
Thanks to everyone who shipped this release. In addition to the listed GitHub contributors (@nullnonenilNULL, @VVsssssk, @XueSongTap, @Zachary-wW, @pengxiangyu, @NeverlanD0829, @Dreamspr22, @Yangsx-1, @kaimo455), we are grateful to 20+ internal developers and early contributors whose work laid the foundation for LoongForge.
Feedback & Bug Reports
This is our first public release โ we actively welcome community feedback. Open an issue on GitHub, file a PR, or reach us through the WeChat developer group linked from our README.
๐ v0.1.0 is a beginning, not an arrival. Tell us what to build next.