NVIDIA improves the training throughput with Megatron Core from Nemo-RL

NVIDIA improves the training throughput with Megatron Core from Nemo-RL

Ted Hisokawa
August 20, 2025 16:26

NVIDIA introduces the support of Megatron Core in NEMO-RL V0.3 and optimizes the training throughput for large models with GPU-optimized techniques and improved parallelism.

NVIDIA presented the latest iteration of its NEMO-RL frameworks, version 0.3, which includes support for megatron core. This improvement aims to optimize the training throughput for large language models by using GPU-optimized techniques and advanced parallelism strategies, according to the official NVIDIA blog.

Challenges with previous backends

Pytorch Dtensor (FSDP2) used the first release of NVIDIA NEMO-RL, which offers native integration into the Huggingface ecosystem and enables fast experimentation through Pytorch's native parallelities. However, when the model sizes rose to hundreds of billions of parameters, the Dtensor path proved due to a significant transition from the recalculation and the lack of optimized NVIDIA cuda core, which led to inefficient steps.

Introduction of megatron core

The megatron core library deals with these restrictions by offering a more efficient solution for training extensive models. It uses a 6D parallelism strategy to improve communication and calculation patterns and support various model architectures. This backend enables the seamless training of massive language models and significantly improves throughput and performance.

First steps with megatron core

The implementation of megatron-based training is added to specific configurations for the YAML setup. The process is optimized by NEMO-RL, which automatically processes the complex tuning and offers users simple configuration options. This makes the introduction of megatron core more accessible to developers and enables you to concentrate on optimizing your model training processes.

Performance improvements

Megatron -based training supports both density and a mixture of experts (MOE) models. Compared to Pytorch Dtensor, performance tests have shown a superior training performance at Megatron-Core, as shown in various model configurations such as Lama 3.1-8b and 70b. The improvements are obvious at faster step times and improved convergent properties.

Additional functions and future prospects

Nemo-RL V0.3 introduces functions such as Async Rollouts and non-colozed generations and expands its skills. With a view to the future, Nvidia plans to support larger MOE models and to introduce further optimizations, including supporting the FP8 generation and the non-colozed generation with megatron core.

The progress in NEMO-RL with megatron-core-backing marks a significant step forward when optimizing the strengthening learning for large-scale language models and ensuring both efficiency and scalability in model training.

Image source: Shutterstock