Cooperative Multi-Agent Reinforcement Learning in Sparse-Reward, Partially Observable 3D Environments

Reinforcement learning (RL) already outperforms elite players in board games and classic video games. Turning those solo successes into genuine team play is the next big step. Yet, many algorithms built for single-agent tasks break down when several agents must work together, especially in 3-D worlds where each player sees only a slice of the action.

De Wet Denkema’s master’s research under supervision of Professor Herman Engelbrecht closes that gap. The research adapts recent single-agent advances for multi-task, cooperative multi-agent reinforcement learning (MARL) inside a Portal 2-inspired simulation. Two virtual robots must press buttons, create portals, and walk through them in the right sequence before time runs out. Sparse rewards, partial views, and strict ordering of subtasks make this a formidable test bench for teamwork.

Building the Testbed

The research evaluated available game engines, settled on a Portal 2 environment, and built a suite of puzzles that grow in complexity. Each scenario keeps the action space small—move, jump, place portal—yet demands long-range memory and joint planning. The setup supplies a challenging benchmark that mirrors real-world problems, such as warehouse robots sharing narrow aisles or drone fleets coordinating deliveries.

The core learner is QMIX, a value-based MARL algorithm. To raise its ceiling, the study added:

  • Decoupled actor processes that gather experience in parallel, multiplying learning samples.
  • Prioritised experience replay, so the network trains more often on rare but important memories.
  • Recurrent burn-in to give the long short-term memory (LSTM) layer enough context before gradients flow.
  • Reward standardisation and n-step returns to keep updates stable and informative.

Noisy networks for exploration and value-function rescaling were tested too, yet both options cut performance in every trial and were dropped. An exhaustive ablation confirmed that the remaining mix is required for success in this setting.

Figure 1: Demonstration images of features in the Portal 2 game.

Curriculum-Transfer Learning: From Simple to Intricate

Agents first practised on stripped-down tasks—press one button, place one portal—then faced harder puzzles. Knowledge carried over while transfer raised completion rates across the board and became indispensable in the toughest maps. 

The same curriculum strategy worked even when the final challenge reshuffled goals while keeping a related action and observation space, proving that the agents picked up general skills rather than narrow scripts.

Key outcomes:

  • First complete solution of a sparse-reward, first-person 3-D cooperative puzzle with independent visual agents.
  • Clear recipe for bringing single-agent refinements into MARL without loss of stability.
  • Evidence that progressive training lets virtual teammates adapt their play when tasks evolve.

Value of this Research

Robots on factory floors, autonomous vehicles in busy traffic, and non-player characters in game QA all need smooth collaboration. This work shows that off-policy MARL methods—once tuned with the right ingredients—can learn those behaviours inside rich simulations, paving the way for safer and more reliable real-world deployments.

Download and read the full research: https://scholar.sun.ac.za/items/45204343-faba-4f4b-8374-e9732bfdb9b2