← Back to blog
LoongForgeMultimodalTraining

LoongForge Multimodal Heterogeneous Parallel Training Acceleration: From Problems to Solutions

2026-05-21 · LoongForge 团队

This article introduces LoongForge's heterogeneous parallel acceleration solution for multimodal large-model training, including three progressive strategies—heterogeneous TP, heterogeneous DP, and full separation parallelism—and deep integration with MoE A2A Overlap.

Official website: https://baidu-baige.github.io/LoongForge/

GitHub: https://github.com/baidu-baige/LoongForge

1. Background: the era of multimodal large models

1.1. From language to multimodality: the capability leap of large models

Since 2023, large models have entered a new stage of development, moving from pure language understanding toward multimodal perception and reasoning. The release of GPT-4V marked multimodal large language models (MLLMs) as a mainstream industry direction. Gemini, Claude 3, Qwen-VL, and many other models quickly followed, integrating understanding of images, video, audio, and other modalities into language models.

The force behind this trend is clear and strong: real-world information is inherently multimodal. A truly general-purpose agent must understand text, images, speech, and video at the same time, like humans do, before it can complete complex real-world tasks—from document understanding, UI operation, and robot control to scientific research assistance.

1.2. Mainstream architecture: the Encoder-Projector-Decoder paradigm

After broad exploration by the community, current multimodal large models have converged on a relatively unified architecture paradigm:

Typical multimodal large-model architecture

Image / Video
  ↓
[Vision Encoder (ViT)]
  ↓
[Projector]
  ├──→ [LLM Decoder] → Output
  ↑
Text tokens

1.3. The encoder-decoder scale gap keeps growing, and MoE architectures are becoming common

The parameter scale of multimodal large models is rising quickly, while the ViT part remains basically unchanged:

Parameter-scale comparison between ViT encoders and LLM decoders in multimodal models
ViT encoder and LLM decoder parameter-scale comparison

Two notable trends deserve attention:

2. Limitations of the current training paradigm

The decoder is much larger than the encoder, usually by one to two orders of magnitude. In distributed training, this scale difference causes a series of efficiency problems, while the system must also account for characteristics of mainstream MoE models.

Problem 1: unified TP causes encoder communication waste

The traditional approach uses a unified Tensor Parallel (TP) configuration for the entire model. For example, to satisfy memory requirements for a 72B decoder, TP=4 or even TP=8 may be needed. But for a 0.6B ViT encoder, TP=4 leaves only 150M encoder parameters on each card. The compute amount is extremely small, while the AllReduce communication overhead introduced by TP occupies an overwhelming share of time. The encoder becomes communication-bound rather than compute-bound.

Traditional solution (unified TP=4):
GPU0: ViT_shard_0 → AllReduce → LLM_shard_0
GPU1: ViT_shard_1 → AllReduce → LLM_shard_1
GPU2: ViT_shard_2 → AllReduce → LLM_shard_2
GPU3: ViT_shard_3 → AllReduce → LLM_shard_3
                ↑
    Encoder TP communication becomes the bottleneck
    (compute is too small and communication share is too high)

Problem 2: PP pipeline bubbles

In Pipeline Parallelism (PP), the encoder usually exists only in the first pipeline stage. This means that when decoder layers are evenly divided, the first pipeline stage has more parameters and compute than the other stages, causing pipeline bubbles.

Solving this usually requires manually configuring the number of decoder layers in the first pipeline stage to reach compute balance. But because the encoder's compute characteristics are very different from the decoder's, manual configuration is cumbersome and difficult, and can still lead to imbalanced pipeline computation across stages.

Traditional solution (PP=4, encoder in stage 0):
Stage 0: [ViT Forward] [LLM layers 0-7 fwd] ... [bwd] ...
Stage 1: [  IDLE   ]  [LLM layers 8-15 fwd] ... [bwd] ...
Stage 2: [  IDLE   ]  [  wait ] [layers 16-23 fwd] ... [bwd] ...
Stage 3: [  IDLE   ]  [  wait ] [  wait ] [layers 24-31 fwd] ...
              ↑
    Stage 1/2/3 are completely idle during encoding

Problem 3: MoE communication stacks on top

When the decoder uses a Mixture-of-Experts (MoE) architecture, such as Qwen3-VL-30B-A3B with top-8 routing, All-to-All communication introduced by Expert Parallelism (EP) becomes another significant bottleneck. In the traditional solution, encoder inefficiency stacks with MoE communication overhead and worsens overall training efficiency. This requires encoder optimization and decoder optimization to be enabled together so that end-to-end efficiency improves.

An ideal multimodal-training parallel solution should satisfy the following:

3. Solution design: three progressive levels of heterogeneous parallelism

LoongForge designs a three-level progressive heterogeneous parallel solution. Each level further expands the encoder's effective data parallelism on top of the previous level, gradually maximizing multimodal training efficiency. The solution refers to Kimi's separated-parallelism design (https://arxiv.org/abs/2602.02276).

3.1. Level 1: Heterogeneous Tensor Parallelism

Design idea

The most direct optimization is: since the encoder does not need such a large TP degree, let the encoder and decoder use different TP sizes.

LoongForge's heterogeneous TP treats the model as multiple independent submodules, such as encoder and decoder. Each submodule can configure an independent TP process group, and the system automatically switches parallel context at module execution boundaries.

Implementation mechanism

The core implementation is based on a parallel-state snapshot and switching mechanism:

# Pseudo-code: parallel-state switching (simplified)
class OmniEncoderModel:
    def _pre_forward_hook(self, module, input):
        change_parallel_state("image_encoder")  # switch to encoder TP group

    def _post_forward_hook(self, module, input, output):
        change_parallel_state("text_decoder")   # switch back to decoder TP group

Usage

Only the encoder TP size needs to be specified in the YAML configuration:

# configs/models/image_encoder/qwen3_vit.yaml
_target_: loongforge.models.encoder.Qwen3VisionModelConfig
num_layers: 27
hidden_size: 1152
# ... other config ...
tensor_model_parallel_size: 2   # encoder TP=2 (decoder TP is controlled by CLI args)

Performance

Using Qwen3-VL as the baseline model at 32k sequence length in a 4-machine A-card environment, enabling heterogeneous TP gives throughput of about 800 tokens/second, which is used as the base.

3.2. Level 2: Heterogeneous Data Parallelism

Design idea

Heterogeneous TP solves the encoder communication-waste problem, but it does not fully utilize compute. When encoder TP=1 and decoder TP=4, each of the four GPUs in a TP group holds a complete encoder replica, but only one GPU runs the encoder; the encoder replicas on the other three GPUs are idle during forward.

The core insight of heterogeneous DP is: since every GPU has a complete encoder, let them process different data simultaneously.

Implementation mechanism

When heterogeneous DP is enabled, each GPU inside a TP group independently processes a different microbatch during the encoding phase:

Heterogeneous DP (decoder TP=4, encoder TP=1):
┌─── TP Group ────────────────────────────────┐
│ GPU0: ViT(batch_0) → embed_0                │
│ GPU1: ViT(batch_1) → embed_1                │  encoding: 4-way parallel
│ GPU2: ViT(batch_2) → embed_2                │
│ GPU3: ViT(batch_3) → embed_3                │
│                                             │
│ broadcast embed_i → all GPUs                │  embedding distribution
│                                             │
│ GPU0-3: standard TP=4 decoder fwd/bwd       │  decoding: normal TP
└──────────────────────────────────────────────┘

Key implementation points:

# Pseudo-code: heterogeneous DP forward logic (simplified)
def forward(self, batch_list, forward_group_id, inner_group_id):
    # Each rank processes its corresponding batch
    my_batch = batch_list[forward_group_id * tp_size + inner_group_id]
    # Run encoder independently
    embedding = self.encoder(my_batch)
    # Store in context for later broadcast
    self.vit_contexts[forward_group_id] = embedding
    # Decoding phase: broadcast embedding from each corresponding rank
    for i in range(tp_size):
        emb_i = hetero_dp_get_tensor(self.vit_contexts, src=i)
        # concatenate into decoder input

Usage

MODEL_PARALLEL_ARGS=(
    --tensor-model-parallel-size 4
    --pipeline-model-parallel-size 2
    --enable-encoder-hetero-dp    # one argument enables it
)

The encoder YAML must set tensor_model_parallel_size: 1.

Performance

Using Qwen3-VL as the baseline at 32k sequence length in a 4-machine A-card environment, heterogeneous DP reaches about 855 tokens/second, about 6.8% faster than the base.

3.3. Level 3: Full Heterogeneous DP

Design idea

Heterogeneous DP raises the encoder's effective DP degree to TP size, usually 4 to 8, but it is still limited to the TP dimension. When PP is used, GPUs in non-first stages remain idle during encoding.

The core idea of full separation parallelism is: expand the encoder's data parallelism to the entire model-parallel group (TP × PP × CP), so all GPUs participate in visual encoding during the encoding phase. For a complete iteration, full separation first uses all cards to compute encoder results for all microbatches. The encoder no longer runs only on PP stage 0; all stages have an Encoder and can perform encoder computation simultaneously. After computation, intermediate results are stored, then the full decoder execution runs and uses the stored intermediate results as decoder input. This completely separates encoder and decoder computation.

This means the encoding phase and decoding phase are fully decoupled and separated in time, rather than interleaving encoder and decoder execution as in traditional training. This is why it is called “full separation.”

Implementation mechanism

Full separation parallelism splits one training iteration into three explicit phases:

Full separation parallelism (TP=4, PP=2, model-parallel group size=8):
════════ Phase 1: Encoding (all GPUs independently encode) ═══════════
│ Stage0-GPU0: ViT(batch_0)  Stage0-GPU1: ViT(batch_1) │
│ Stage0-GPU2: ViT(batch_2)  Stage0-GPU3: ViT(batch_3) │
│ Stage1-GPU4: ViT(batch_4)  Stage1-GPU5: ViT(batch_5) │  encoder DP=8
│ Stage1-GPU6: ViT(batch_6)  Stage1-GPU7: ViT(batch_7) │
│ → gather_variable_shape_embeddings() → rank 0         │  gather all embeddings
════════ Phase 2: Decoding (standard PP+TP pipeline) ═════════════
│ Stage 0 (TP=4): LLM layers 0-15 [1F1B schedule]       │
│ Stage 1 (TP=4): LLM layers 16-31 [1F1B schedule]      │  normal pipeline
│ PreProcessNode: rank0 broadcast embedding → all ranks │  distribute embeddings on demand
════════ Phase 3: Encoder backward (gradients return to GPUs) ═══════════
│ scatter_variable_shape_embeddings(grads) → GPUs       │  scatter gradients
│ GPUs: torch.autograd.backward(local_embedding)        │  independent encoder backward
│ DDP bucket sync for encoder params                    │  encoder gradient sync
════════════════════════════════════════════════════════════

Key implementation details:

Gather/Scatter for variable-length embeddings

Because different samples contain different numbers of image tokens, especially with highly variable videos, encoder-output tensor shapes are inconsistent across GPUs. LoongForge implements two communication primitives, gather_variable_shape_embeddings and scatter_variable_shape_embeddings, to support collection and distribution of variable-length tensors across ranks:

# Exchange shape information first, then gather variable-length tensors as needed
def gather_variable_shape_embeddings(embedding, model_parallel_group):
    # 1. all_gather shapes
    local_shape = torch.tensor(embedding.shape)
    all_shapes = all_gather(local_shape, group=model_parallel_group)
    # 2. pad to max shape, then all_gather data
    max_shape = max(all_shapes)
    padded = pad_to(embedding, max_shape)
    all_embeddings = all_gather(padded, group=model_parallel_group)
    # 3. unpad and return list
    return [unpad(emb, shape) for emb, shape in zip(all_embeddings, all_shapes)]

Delayed encoder backward

In traditional interleaved forward-backward execution, encoder backward is triggered immediately when the first decoder microbatch runs backward. In full separation mode, encoder backward must wait until all decoder microbatches finish forward-backward and then run uniformly. The system uses full_hetero_dp_grad_hook_factory to capture gradients and temporarily store them in grad_list, then returns them uniformly in Phase 3.

Instantiate the encoder only in the first VPP chunk

To avoid repeated encoder instantiation on every PP stage and wasting memory, full separation creates the encoder only on the first Virtual Pipeline Parallel chunk:

# omni_model_provider.py
if enable_full_hetero_dp:
    add_encoder = (vp_stage == 0)  # only the first VPP chunk
else:
    add_encoder = (pp_stage == 0)  # only the first PP stage

Mock microbatch padding

When the actual number of microbatches is not divisible by the model-parallel group size, the system automatically pads empty mock microbatches to keep gather/scatter operations aligned.

Full-separation intermediate-result offload

When full separation is enabled, the encoder and decoder run independently. To ensure each decoder microbatch can obtain its corresponding visual features, the system must compute encoder results for all data in the entire GBS (Global Batch Size) before decoder forward and temporarily store the results in GPU memory. When input sequence length is large or GBS is high, these intermediate results, including visual embedding, visual_pos_masks, and deepstack_visual_embeds, can occupy a large amount of memory and may cause OOM during the decoder phase.

To solve this, full separation supports encoder-result offload. After encoder forward finishes, intermediate results are asynchronously moved to CPU memory. When each decoder microbatch actually needs them, they are loaded back to GPU on demand. In this way, with a small cost in CPU memory and H2D transfer overhead, peak GPU memory usage is significantly reduced, supporting longer sequences or larger batches.

Usage

MODEL_PARALLEL_ARGS=(
    --tensor-model-parallel-size 4
    --pipeline-model-parallel-size 2
    --enable-full-hetero-dp         # enable full separation
    --full-hetero-dp-cpu-offload    # enable full-separation intermediate offload
    --use-distributed-optimizer
)

Performance

Using Qwen3-VL as the baseline model at 32k sequence length in a 4-machine A-card environment, full separation parallelism reaches about 900 tokens/second, about 12.5% faster than the base.

Constraints

Constraints of full separation parallelism
Constraints of full separation parallelism

3.4. Comparison of the three-level solution

Comparison of heterogeneous TP, heterogeneous DP, and full separation
Comparison of the three heterogeneous-parallelism levels

4. Deep integration: full separation + MoE A2A Overlap

4.1. Communication bottleneck in MoE training

For MoE-architecture multimodal models such as Qwen3-VL-30B-A3B, Expert Parallelism (EP) introduces a large amount of All-to-All communication. In each Transformer layer, the MoE operation involves two All-to-All operations:

Token Dispatch (A2A) → Expert Computation → Token Combine (A2A)
       ↑                                          ↑
 tokens are dispatched to ranks with experts      expert outputs are collected back to original ranks

When EP=8, the communication volume of two A2A operations per layer is proportional to sequence length and expert hidden size, becoming a significant bottleneck in long-sequence training.

4.2. Principle of 1F1B A2A Overlap

LoongForge's A2A Overlap solution borrows from the idea of DeepSeek-V3 DualPipe. It decomposes a Transformer layer into fine-grained sub-operation nodes and interleaves scheduling across different microbatches:

Fine-grained node decomposition:
TransformerLayer = [Attn] → [PostAttn/Router] → [Dispatch A2A] → [Experts] → [Combine A2A] → [PostCombine]
                    compute     compute           comm           compute      comm          compute

Cross-microbatch interleaved scheduling:
MB0: [Attn] [PostAttn] [Dispatch─A2A] [Experts] [Combine─A2A] [PostCombine]
MB1:                 [Attn─────────] [PostAttn] [Dispatch─A2A] [Experts] ...
                       ↑                         ↑
                    MB1 compute hides MB0 A2A communication

In implementation, each sub-operation is encapsulated as a ScheduleNode and assigned to a compute stream or communication stream. The 1F1B scheduler automatically interleaves execution.

4.3. Fusion challenge and solution

Integrating full separation parallelism with A2A Overlap faces a key challenge: the fine-grained A2A Overlap scheduler needs to obtain encoder outputs through PreProcessNode, but in full separation mode encoder outputs are precomputed in Phase 1 of train_step rather than computed in real time inside PreProcessNode.

The two execution modes must be made compatible by design.

LoongForge's solution is to pass the full heterogeneous-DP context into the fine-grained scheduler:

# PreProcessNode extension (supports full separation mode)
class PreProcessNode:
    def __init__(self, ..., enable_full_hetero_dp=False,
                 enable_encoder_hetero_dp=False,
                 batch_list=None, forward_group_id=None, inner_group_id=None):
        self.enable_full_hetero_dp = enable_full_hetero_dp
        # ... cache context ...

    def forward_impl(self, ...):
        if self.enable_full_hetero_dp:
            # obtain precomputed embedding from embedding_list
            embedding = retrieve_precomputed_embedding(forward_group_id, inner_group_id)
            # broadcast to all TP ranks
            hetero_dp_get_tensor(embedding, src=0)
            # register gradient hook for Phase 3 backward
            register_grad_hook(embedding, grad_list)
        elif self.enable_encoder_hetero_dp:
            # independent encoding inside TP group + broadcast
            embedding = run_encoder_independently(batch_list, inner_group_id)
            hetero_dp_get_tensor(embedding)
        else:
            # standard path: directly run encoder
            embedding = self.encoder(input)
        return embedding

With this design, TransformerModelChunkSchedulePlan carries all heterogeneous-DP context correctly when constructing the schedule plan, allowing A2A Overlap microbatch interleaving to work seamlessly with precomputed embeddings from full separation mode.

4.4. Fine-Grained Activation Offload

A2A Overlap depends on block-level interleaved execution and is incompatible with traditional full-layer recomputation. To save memory, LoongForge provides module-level selective recomputation + fine-grained activation offload:

--recompute-granularity selective \
--recompute-modules a2a_overlap_attn a2a_overlap_post_attn a2a_overlap_mlp \
--fine-grained-activation-offloading \
--offload-tensors dispatched_input pre_mlp_layernorm_output

This approximates the memory savings of full-layer recomputation without sacrificing A2A overlap effectiveness.

5. Feature compatibility

5.1. Training-phase compatibility

Compatibility between training phases and heterogeneous parallelism modes
Training-phase compatibility

The two training phases use the same model architecture and forward logic and are registered uniformly through @register_model_trainer(family, training_phase). The heterogeneous-parallel implementation is completely transparent to the training phase.

5.2. Encoder Freeze compatibility

In SFT scenarios, freezing the visual encoder is common: it preserves pretrained visual-understanding capability and trains only the decoder's multimodal-alignment capability. LoongForge supports configuration-level freeze:

# Hydra override style
+model.image_encoder.freeze=True

In heterogeneous DP and full separation modes, encoder-freeze behavior is fully correct:

5.3. ViT DP Load Balancing compatibility

Different samples contain different numbers of images or video frames, causing encoder compute imbalance across DP ranks. LoongForge's use-vit-dp-balance feature rearranges samples during data loading to balance encoder compute across ranks. This feature runs correctly after switching to the encoder parallel context and is fully compatible with heterogeneous parallelism.

5.4. Full compatibility matrix

Full compatibility matrix for heterogeneous parallelism modes
Full compatibility matrix

6. Quick start

Heterogeneous TP, heterogeneous DP, and full separation can run by configuring the corresponding LoongForge training arguments. --enable-encoder-hetero-dp enables heterogeneous DP, and --enable-full-hetero-dp enables full separation. When heterogeneous DP or full separation is enabled, TP=1 must be specified in YAML.

6.1. Heterogeneous TP

Specifying the ViT tp size in YAML enables heterogeneous TP by default.

_target_: loongforge.models.encoder.Qwen2VisionRMSNormConfig
num_layers: 32
hidden_size: 1280
kv_channels: 80
ffn_hidden_size: 3420
patch_size: 14
num_attention_heads: 16
num_query_groups: 16
image_size: [1344, 1344]
activation_func: ${act:silu}
add_bias_linear: true
add_qkv_bias: true
swiglu: true
gated_linear_unit: true
position_embedding_type: "none"
bias_activation_fusion: False
hidden_dropout: 0
attention_dropout: 0
normalization: "RMSNorm"
apply_rope_fusion: true
tensor_model_parallel_size: 1
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 1
model_type: "qwen2_5_vit"

6.2. Heterogeneous DP

#!/bin/bash
# Qwen2.5-VL-7B SFT with Heterogeneous DP
LOONGFORGE_PATH=/path/to/LoongForge
MEGATRON_PATH=$LOONGFORGE_PATH/third_party/Loong-Megatron
MODEL_PARALLEL_ARGS=(
    --attention-backend flash
    --tensor-model-parallel-size 4
    --pipeline-model-parallel-size 1
    --use-distributed-optimizer
    --enable-encoder-hetero-dp
)
PYTHONPATH=$MEGATRON_PATH:$LOONGFORGE_PATH:$PYTHONPATH \
torchrun --nproc_per_node 8 --nnodes 1 \
    $LOONGFORGE_PATH/loongforge/train.py \
    --model-name qwen2_5_vl_7b \
    --training-phase sft \
    "${MODEL_PARALLEL_ARGS[@]}" \
    +model.image_encoder.freeze=True \
    --micro-batch-size 1 \
    --global-batch-size 32 \
    --lr 1e-5 \
    # ... other training args

6.3. Full separation parallelism for encoder and decoder

#!/bin/bash
# Qwen3-VL-30B-A3B SFT with Full Separation + A2A Overlap
MODEL_PARALLEL_ARGS=(
    --attention-backend flash
    --tensor-model-parallel-size 1
    --pipeline-model-parallel-size 2
    --expert-model-parallel-size 4
    --moe-token-dispatcher-type alltoall
    --num-virtual-stages-per-pipeline-rank 2
    --use-distributed-optimizer
    --enable-full-hetero-dp
    --overlap-moe-expert-parallel-comm
    --delay-wgrad-compute
)
# Optimize A2A overlap
export CUDA_DEVICE_MAX_CONNECTIONS=32
PYTHONPATH=$MEGATRON_PATH:$LOONGFORGE_PATH:$PYTHONPATH \
torchrun --nproc_per_node 8 --nnodes 4 \
    $LOONGFORGE_PATH/loongforge/train.py \
    --model-name qwen3_vl_30b_a3b \
    --training-phase sft \
    "${MODEL_PARALLEL_ARGS[@]}" \
    +model.image_encoder.freeze=True \
    --micro-batch-size 1 \
    --global-batch-size 64 \
    --lr 1e-5 \
    --recompute-granularity selective \
    --recompute-modules a2a_overlap_attn a2a_overlap_post_attn a2a_overlap_mlp \
    --fine-grained-activation-offloading \
    --offload-tensors dispatched_input pre_mlm_layernorm_output \
    # ... other training args

7. Summary

LoongForge's heterogeneous parallel solution systematically solves the compute-heterogeneity challenge between encoder and decoder in multimodal large-model training. Through three progressive levels, it gradually releases training efficiency:

The design follows these principles:

Summary of benefits from each heterogeneous parallelism level
Benefit summary for each heterogeneous-parallelism level

For next-generation multimodal large models using MoE architectures, such as Qwen3-VL-235B-A22B, the combination of full separation + A2A Overlap provides the current best end-to-end training-efficiency solution.

🔗 Original WeChat post: https://mp.weixin.qq.com/s/ZFWSxNZvDZJ38fyyb8t7vg
← Prev: DP Load Balancing Optimization All posts →