โ† Back to blog
Releasev0.1.0

LoongForge v0.1.0 Release Notes

2026-05-09 ยท The LoongForge Team

We are excited to announce the first public release of LoongForge โ€” a modular, scalable, and highly efficient training framework for large-scale transformer models across diverse modalities and architectures. This initial release covers four modality types (LLM, VLM, VLA, Diffusion), with out-of-the-box support for Pretrain, MidTrain, SFT, and LoRA workflows.

๐Ÿ“– Documentation: loongforge.readthedocs.io/en/v0.1.0

Supported Models

LoongForge ships with a wide set of production-ready model configurations. Highlighted below; the complete list is in the Supported Models docs.

LLM

VLM

Diffusion & VLA

Key Features

โšก Adaptive FP8 Training

End-to-end FP8 training for both LLMs and VLMs, with per-operator decisions based on GEMM shape. Enables FP8 where it actually helps throughput, stays in higher precision where it would hurt numerics.

๐Ÿ”€ MoE Optimization

Overlapped All2All communication, activation offload, and compute โ€” designed to hold trillion-parameter MoE training within a tight memory budget, with throughput that improves on upstream Megatron-LM for the same workload.

๐Ÿ–ผ๏ธ VLM Training

OmniCombinationModel decouples vision encoder and LLM backbone into composable components. Each side can choose its own TP/PP/DP sizes, and a pipeline scheduler orchestrates them. Swapping a ViT or an LLM becomes a YAML change, not a code change.

๐Ÿง  Memory & Compute

Systemic memory optimizations โ€” activation offloading, recomputation planning, and fused kernels (e.g. FusedDSA, Sparse MLA) โ€” reduce peak memory footprint while improving compute utilization.

๐Ÿ’พ Checkpoint & LoRA

Native HuggingFace โ†” Megatron checkpoint conversion, plus online HuggingFace load/save via --save-hf true. LoRA adapters are first-class: attach, train, merge, ship.

๐Ÿ”ง Custom Operators

High-performance CUDA/C++ fused operators live in ops/ and are dispatched through a hardware-abstracted layer, so the same training script can use GPU-native kernels or XPU-native equivalents transparently.

๐Ÿ“ฆ Data Pipeline

Multimodal data pipeline with image/video processing plugins, sequence packing, length balancing, and data-parallel load balancing tuned for large multi-node training.

๐Ÿ–ฅ๏ธ Heterogeneous Hardware

The same codebase runs on NVIDIA GPU and Baidu Kunlun XPU (P800). The initial release validates the flagship LLM and VLM families on both platforms, with a minimally intrusive plugin design under models/dispatch.py.

Contributors

Thanks to everyone who shipped this release. In addition to the listed GitHub contributors (@nullnonenilNULL, @VVsssssk, @XueSongTap, @Zachary-wW, @pengxiangyu, @NeverlanD0829, @Dreamspr22, @Yangsx-1, @kaimo455), we are grateful to 20+ internal developers and early contributors whose work laid the foundation for LoongForge.

Feedback & Bug Reports

This is our first public release โ€” we actively welcome community feedback. Open an issue on GitHub, file a PR, or reach us through the WeChat developer group linked from our README.

๐Ÿ‰ v0.1.0 is a beginning, not an arrival. Tell us what to build next.
๐Ÿ”— View on GitHub: LoongForge v0.1.0 release page
โ† Prev: Announcing LoongForge All posts โ†’