Releasev0.1.0

LoongForge v0.1.0 Release Notes

2026-05-09 · The LoongForge Team

We are excited to announce the first public release of LoongForge — a modular, scalable, and highly efficient training framework for large-scale transformer models across diverse modalities and architectures. This initial release covers four modality types (LLM, VLM, VLA, Diffusion), with out-of-the-box support for Pretrain, MidTrain, SFT, and LoRA workflows.

📖 Documentation: loongforge.readthedocs.io/en/v0.1.0

Supported Models

LoongForge ships with a wide set of production-ready model configurations. Highlighted below; the complete list is in the Supported Models docs.

LLM

LLaMA 2 / 3
Qwen 2 / 2.5 / 3
DeepSeek V2 / V3 / R1
MiniMax
GLM-4
MIMO

VLM

Qwen-VL / Qwen2-VL / Qwen2.5-VL
InternVL 2 / 2.5
ERNIE-4.5-VL
LLaVA-OneVision-1.5

Diffusion & VLA

WAN 2.2 (video diffusion)
Pi0.5 (vision-language-action)

Key Features

⚡ Adaptive FP8 Training

End-to-end FP8 training for both LLMs and VLMs, with per-operator decisions based on GEMM shape. Enables FP8 where it actually helps throughput, stays in higher precision where it would hurt numerics.

🔀 MoE Optimization

Overlapped All2All communication, activation offload, and compute — designed to hold trillion-parameter MoE training within a tight memory budget, with throughput that improves on upstream Megatron-LM for the same workload.

🖼️ VLM Training

OmniCombinationModel decouples vision encoder and LLM backbone into composable components. Each side can choose its own TP/PP/DP sizes, and a pipeline scheduler orchestrates them. Swapping a ViT or an LLM becomes a YAML change, not a code change.

🧠 Memory & Compute

Systemic memory optimizations — activation offloading, recomputation planning, and fused kernels (e.g. FusedDSA, Sparse MLA) — reduce peak memory footprint while improving compute utilization.

💾 Checkpoint & LoRA

Native HuggingFace ↔ Megatron checkpoint conversion, plus online HuggingFace load/save via --save-hf true. LoRA adapters are first-class: attach, train, merge, ship.

🔧 Custom Operators

High-performance CUDA/C++ fused operators live in ops/ and are dispatched through a hardware-abstracted layer, so the same training script can use GPU-native kernels or XPU-native equivalents transparently.

📦 Data Pipeline

Multimodal data pipeline with image/video processing plugins, sequence packing, length balancing, and data-parallel load balancing tuned for large multi-node training.

🖥️ Heterogeneous Hardware

The same codebase runs on NVIDIA GPU and Baidu Kunlun XPU (P800). The initial release validates the flagship LLM and VLM families on both platforms, with a minimally intrusive plugin design under models/dispatch.py.

Contributors

Thanks to everyone who shipped this release. In addition to the listed GitHub contributors (@nullnonenilNULL, @VVsssssk, @XueSongTap, @Zachary-wW, @pengxiangyu, @NeverlanD0829, @Dreamspr22, @Yangsx-1, @kaimo455), we are grateful to 20+ internal developers and early contributors whose work laid the foundation for LoongForge.

Feedback & Bug Reports

This is our first public release — we actively welcome community feedback. Open an issue on GitHub, file a PR, or reach us through the WeChat developer group linked from our README.

🐉 v0.1.0 is a beginning, not an arrival. Tell us what to build next.

🔗 View on GitHub: LoongForge v0.1.0 release page

← Prev: Announcing LoongForge All posts →