Research

NeurIPS 2025 Best Paper Award: Why 1000-layer networks unlock new capabilities

AI & data engineering

Author

Tomasz Trzciński Chief Scientist

Many researchers have scaled vision and language models using self-supervised learning techniques to generate successful gains. However, reinforcement learning in particular has yet to share as much success. There may be a potent solution. Traditional reinforcement models have used small architectures, only 2-5 layers. What this paper shows is that a much richer architecture, up to 1000 layers, generates dramatic gains in performance quality and efficiency.

How CRL is different

For standard self-supervised learning models, the vast amount of internet data has fueled their learning and growth. In contrast, Reinforcement learning (RL) data relies on interactions with an environment. Large-scale simulation and robotics datasets, which can be synthetically generated during training, provide abundant data that can be scaled for depth to enhance locomotion, navigation, and manipulation tasks.

Contrastive Reinforcement Learning (CRL) is a form of reinforcement learning that takes advantage of this shift. Instead of relying solely on single-event reward signals, CRL builds learning signals from experience in general. It draws on Hindsight Experience Replay (HER), where agents learn not just from successful outcomes, but from entire trajectories, including failures. By contrasting different outcomes within the same experience, CRL enables agents to learn useful representations of goals and behavior without requiring explicit supervision.

To make training at extreme depths possible, the approach also relies on modern architectural techniques such as residual connections and layer normalization, which help maintain stability as networks scale from hundreds to thousands of layers.

Study results

The scaled network tested locomotion, navigation, and manipulation tasks. Compared to standard shallow RL baselines, performance doubled on most tasks, while humanoid maze tasks saw improvements of up to 50×. Interestingly, the performance improvements weren’t a gradual increase, but instead spiked during critical depths: ex. 8 layers for an ant maze and 64 layers on a humanoid maze.

The study found that with more layers, the models discovered new capabilities, improved generalization, and learned how to create better representations.

Depth vs. width

Earlier research has shown that making a network wider can improve performance. While still true, this study shows that deeper networks can be more effective. Smaller models in particular, can achieve the same or better results by depth of networks rather than wider networks.

For example, in a Humanoid environment test, a wide network with thousands of units per layer performed worse than a smaller model that simply doubled its depth.

This advantage becomes even clearer as the problem gets more complex. When the agent has to process higher-dimensional observations, deeper networks consistently outperform wider ones. This suggests that depth helps the model build understanding step by step, rather than trying to do everything at once.

The study also tested batch size, the amount of experience the model processes per step. The study found that batch size starts to make a difference once the network reaches a certain depth.

With smaller or shallower models, increasing the batch size leads to small gains or no noticeable improvement. But when the network is larger and deeper, it can make good use of the extra data. In those cases, training with larger batches leads to clear performance improvements.

In other words, deeper networks are better at learning from more data at once. They can absorb the additional information instead of wasting it. This suggests that depth does not just improve performance on its own, but also enables other scaling techniques, like larger batch sizes, to become effective in reinforcement learning.

What are the limits?

The study tested depth limits up to 1024 layers. The results found a certain important nuance: layers are more advantageous to complexity.

In one environment, Humanoid Big Maze, performance eventually stopped improving. More layers did not help the model beyond a certain depth.

However, in Humanoid U-Maze, deeper networks continued to improve performance from 256 layers up to 1024 layers, without plateauing. In the U-Maze task, the humanoid has to learn a more complex behavior than simple navigation. It appears deeper networks are crucial for its success.

The study also ran into stability limits at extreme depths. When training with fully 1024-layer networks, the learning process became unstable early on. The solution was to keep the action-selecting part of the model at 512 layers, while using the full 1024 layers only in the parts of the network responsible for evaluating outcomes. This allowed benefits from extreme depth while keeping training stable.

Key takeaways

Contrastive Reinforcement Learning (CRL) successfully scales to depths that other RL algorithms do not.
Width and depth are important in performance enhancement, but depth has shown more efficient gains.
Deep networks show significant emergent behaviors, batch size scaling, and learning of difficult maze topologies.

Questions that remain

How far can these networks scale before hitting diminishing returns?
Does scaling depth help on simpler single-goal tasks or only on complex ones?
Can depth, width, and batch size be combined for even greater gains?
Why, from a theoretical perspective, does extreme depth help so much in CRL?
Can deep CRL models transfer skills learned in one environment to new tasks with minimal retraining?

Summary

This research shows that making CRL networks deeper than usual can lead to significant improvements in behavior learning and generality. By unlocking deeper representations without depending on explicit rewards, this approach points toward more flexible RL systems capable of discovering complex behaviors on their own.

The research was conducted by a team consisting of Kevin Wang, Ishaan Javali, and Benjamin Eysenbach of Princeton University, Michał Bortkiewicz and Tomasz Trzcinski of Warsaw University of Technology, as well as Tooploox and the IDEAS Research Institute.

The study can be found here (arxiv link here.)