## Overview

We evolve environments at the frontier of a reinforcement learning agent's capabilities, leading to self-supervised teacher-student processes with strong zero-shot generalization results for agents learning to walk through challenging terrain and navigating complex human-designed mazes.

• Roughness
1
• Stump height
1 - 3
• Pit gap
3 - 5
• Stair steps
3 - 5
Show all seeds

## Introduction

Deep reinforcement learning (RL) has seen tremendous success over the past decade. However, agents trained on fixed environments are brittle, often failing the moment the environment changes even slightly, thus limiting the real-world applicability of current RL methods. A common remedy is to introduce more training data diversity by randomizing the environment’s parameters in every episode—a process called domain randomization (DR). For example, these parameters might control the friction coefficient and lighting conditions in a robotic arm simulation or the position of obstacles in a maze (we call each such environment variation a level). Procedurally generating training levels to expose RL agents to more diverse experiences has quickly become a common method for improving the robustness of RL agents. However, DR is often not enough to train robust agents in domains where the agent struggles to make progress on many challenging levels.

Adaptive curriculum methods match the complexity of training levels to an agent’s current capabilities, and have been shown to produce more robust policies in fewer training steps than domain randomization. These methods can be viewed as a game between a teacher that designs challenging levels and a student that solves them. This game is potentially open-ended, leading to the co-evolution of generally-capable students. By tailoring the distribution of entire levels throughout training, adaptive curricula perform unsupervised environment design (UED). However, training an environment-designing teacher is difficult, and the prior state-of-the-art UED method, Prioritized Level Replay (PLR), simply finds challenging levels through random search, making it unable to build off of previously found structures that were useful for training agents in the past. We can also expect its performance to degrade as the size of the design space grows, limiting the potential for open-ended co-evolution between teacher and student. In contrast, the evolutionary processes between organisms and their environments, which UED resembles, efficiently search the design space by successively mutating a population and selecting the fittest individuals.

We combine evolution with a principled regret-based environment designer in an algorithm called Adversarially Compounding Complexity by Editing Levels (ACCEL). As in PLR, our new method uses the difference between how well an agent did in an environment and how well it could have done (a metric known in decision theory as regret) to quantify the learning potential of revisiting this environment in the future. The important difference from PLR is that ACCEL evolves a curriculum by making small edits (e.g. mutations) to previously high regret levels, thus constantly producing new levels at the frontier of the student agent's capabilities. By compounding complexity over time, ACCEL harnesses the potential for efficient, open-ended search inherent to evolution. Our results show that ACCEL’s adaptive curricula lead to RL agents that outperform existing auto-curriculum learning methods.

The rest of this article describes ACCEL in more technical detail. We first provide the necessary background on the UED formalism, followed by a comprehensive description of the ACCEL algorithm. We then present our experimental results in two challenging domains consisting of maze navigation and continuous-control bipedal locomotion, where ACCEL evolves levels with higher complexity and produces policies exhibiting significantly improved generalization compared to other adaptive curriculum methods.

## Background

### Unsupervised Environment Design

The RL problem is typically formalized as an agent trying to solve a Markov Decision Process (MDP), which is defined as a tuple $$\langle S, A, \mathcal{T}, \mathcal{R}, \gamma \rangle$$, where $$S$$ and $$A$$ stand for the sets of states and actions respectively and $$\mathcal{T}: S \times A \rightarrow \mathbf{\Delta}(S)$$ is a transition function representing the probability that the system transitions from a state $$s_t \in S$$ to $$s_{t+1} \in S$$ given action $$a_t \in A$$. Each transition also induces an associated reward $$r_t$$ generated by a reward function $$\mathcal{R}: S \rightarrow \mathbb{R}$$, and $$\gamma$$ is a discount factor. When provided with an MDP, the goal of RL is to learn a policy $$\pi$$ that maximizes expected discounted reward, i.e. $$\mathbb{E}\left[\sum_{i=0}^{T} r_t\gamma^t\right]$$.

Despite the generality of the MDP framework, it is often an unrealistic model for real-world environments. First, it assumes full observability of the state, which is often impossible in practice. This is addressed in partially-observable MDPs, or POMDPs, which include an observation function $$\mathcal{I}: S \rightarrow O$$ which maps the true state (which is unknown to the agent) to a potentially noisy set of observations $$O$$. Secondly, the traditional MDP framework assumes a single reward and transition function, which are fixed throughout learning. Instead, in the real world, agents may experience variations not seen during training, which makes it crucial that policies are capable of robust transfer.

To address these issues, we use the recently introduced Underspecified POMDP, or UPOMDP, given by $$\mathcal{M} = \langle A, O, \Theta, S^{\mathcal{M}}, \mathcal{T}^{\mathcal{M}},\mathcal{I}^{\mathcal{M}},\mathcal{R}^{\mathcal{M}},\gamma \rangle$$. This definition is identical to a POMDP with the addition of $$\Theta$$ to represent the free parameters of the environment, similar to the context in a Contextual MDP. These parameters can be distinct at every time step and incorporated into the transition function $$\mathcal{T}^{\mathcal{M}}: S \times A \times \Theta \rightarrow \mathbf{\Delta}(S)$$. Following Jiang et al, 2021 we define a level $$\mathcal{M}_{\theta}$$ as an environment resulting from a fixed $$\theta \in \Theta$$. We define the value of $$\pi$$ in $$\mathcal{M}_{\theta}$$ to be $$V^{\theta}(\pi) = \mathbb{E}\left[\sum_{i=0}^{T} r_t\gamma^t\right]$$ where $$r_t$$ are the rewards achieved by $$\pi$$ in $$\mathcal{M}_{\theta}$$. UPOMDPs benefit from their generality, since $$\Theta$$ can represent possible dynamics or changes to observations (e.g. in sim2real), different reward functions, or differing game maps in procedurally generated environments.

### Methods for Unsupervised Environment Design

Unsupervised Environment Design (UED) seeks to produce a series of levels that form a curriculum for a student agent, such that the student agent is capable of systematic generalization across all possible levels. UED typically views levels as produced by a generator (or teacher) maximizing some utility function $$U_t(\pi, \theta)$$, for example constant utility for a DR generator:

\begin{aligned} U_t^U(\pi, \theta) = C. \tag{1} \end{aligned}

for any constant $$C$$. Recent approaches proposed to use objectives seeking to maximize regret, defined as the difference between the expected return of the current policy and the optimal policy, i.e:

\begin{align} U_t^R(\pi, \theta) & =\underset{\pi^* \in \Pi}{\arg\max}\;{\text{Regret}^{\theta}(\pi,\pi^*)} \tag{2} \\ & = \underset{\pi^* \in \Pi}{\arg\max}\;\{V^\theta(\pi^*)-V^\theta(\pi)\}. \tag{3} \end{align}

Regret-based objectives are desirable as they have been shown to promote the simplest possible levels that the agent cannot currently solve in a range of settings. More formally, if $$S_t= \Pi$$ is the strategy set of the student and $$S_t = \Theta$$ is the strategy set of the teacher, and if the learning process reaches a Nash equilibrium, then the resulting student policy $$\pi$$ provably converges to a minimax regret policy, defined as:

\pi = \underset{{\pi_A \in \Pi}}{\arg\min}\{\max_{\theta,\pi_B \in \Theta , \Pi}\{\text{Regret}^{\theta}(\pi_A,\pi_B)\}\}. \tag{4}

However, without access to $$\pi^*$$, UED algorithms must approximate the regret. PAIRED estimates regret as the difference in return attained by the main student agent and a second agent. By maximizing this difference, the teacher maximizes an approximation of the student's regret. Furthermore, multi-agent learning systems may not always converge in practice . Indeed, the Achilles' heel of prior UED methods, like PAIRED, is the difficulty of training the teacher, typically entailing an RL problem with sparse rewards and long-horizon credit assignment. An alternative regret-based UED approach is Prioritized Level Replay (PLR). PLR trains the student on challenging levels found by curating a rolling buffer of the highest-regret levels surfaced through random search over possible level configurations. In practice, PLR has been found to outperform other UED methods that directly train a teacher. PLR approximates regret using the positive value loss, given by

\frac{1}{T}\sum_{t=0}^{T} \max \left(\sum_{k=t}^T(\gamma\lambda)^{k-t}\delta_k, 0\right), \tag{5}

where $$\lambda$$ and $$\gamma$$ are the Generalized Advantage Estimation (GAE) and MDP discount factors respectively, and $$\delta_t$$, the TD-error at timestep $$t$$. Equipped with this method for approximating regret, Corollary 1 in Jiang et al, 2021 finds that if the student agent only trains on curated levels, then it will follow a minimax regret strategy at equilibrium. Thus, counterintuitively, the student learns more effectively by training on less data.

Empirically PLR has been shown to produce policies with strong generalization capabilities, but remains limited in being able to only curate randomly sampled levels. PLR's inability to build off of previously discovered structures may make it challenging to sample complex structures with the frequency required to train agents capable of solving them. Further, random search will suffer from the curse-of-dimensionality in higher-dimensional design spaces, where randomly encountering levels at the frontier of the agent's current capabilities can be highly unlikely.them.

## ACCEL: Adversarially Compounding Complexity by Editing Levels

In this section we introduce a new algorithm for UED, combining an evolutionary environment generator with a principled regret-based curator. Unlike PLR which relies on random sampling to produce new batches of training levels, we instead propose to make edits (e.g. mutations) to previously curated ones. Evolutionary methods have been effective in a variety of challenging optimization problems, yet typically rely on handcrafted, domain-specific rules. For example, POET manually filters BipedalWalker levels to have a return in the range $$[50,300]$$. The key insight in this work is that regret serves as a domain-agnostic fitness function for evolution, making it possible to consistently produce batches of levels at the frontier of agent capabilities across domains. Indeed, by iteratively editing and curating the resulting levels, the levels in the level replay buffer quickly increase in complexity. As such, we call our method Adversarially Compounding Complexity by Editing Levels, or ACCEL. A high-level illustration of ACCEL is depicted in Figure 3.

ACCEL does not rely on a specific editing mechanism, which could be any mutation process used in other open-ended evolutionary approaches. In this paper, editing involves making a handful of changes (e.g. adding or removing obstacles in a maze), which can operate directly on environment elements within the level or on a more indirect encoding such as the latent-space representation of the level under a generative model of the environment. In general, editing may rely on more advanced mechanisms, such as search-based methods, but in this work we predominantly make use of simple, random mutations. ACCEL makes the key assumption that regret varies smoothly with the environment parameters $$\Theta$$, such that the regret of a level is close to the regret of others within a small edit distance. If this is the case, then small edits to a single high-regret level should lead to the discovery of entire batches of high-regret levels---which could be an otherwise challenging task in high-dimensional design spaces.

Following Robust PLR we do not immediately train on edited levels. Instead, we first evaluate them and only add them to the level replay buffer if they have high regret, estimated by positive value loss (Equation 5). We consider two different criteria for selecting which replayed levels to edit: Under the "easy" criterion, we edit those levels which the agent can currently solve with low regret, approximated as the agent's return minus its regret. Under the "batch" criterion, we simply edit the entire batch of replayed levels.The full is procedure detailed in Algorithm 1 below:

ACCEL can be seen as a UED algorithm taking a step toward open-ended evolution, where the evolutionary fitness is estimated regret, as levels only stay in the population (that is, the level replay buffer) if they meet the high-regret criterion for curation. However, ACCEL avoids some important weaknesses of evolutionary algorithms such as POET: First, ACCEL maintains a population of levels, but not a population of agents. Thus, ACCEL requires only a single desktop GPU for training. In contrast, evolutionary approaches typically require a CPU cluster. Moreover, forgoing an agent population allows ACCEL to avoid the agent selection problem. Instead, ACCEL directly trains a single generally-capable agent. Finally, since ACCEL uses a minimax regret objective (rather than minimax as in POET), it naturally promotes levels at the frontier of agent's capabilities, without relying on domain-specific knowledge (such as reward ranges). Training on high regret levels also means that ACCEL inherits the robustness guarantees in equilbrium from PLR (from Corrollary 1 in Jiang et al 2021):

Remark 1. If ACCEL reaches a Nash equilibrium, then the student follows a minimax regret strategy.

In stark contrast, other evolutionary approaches primarily justify their applicability solely via empirical results on specific domains. As we will show next, a key strength of ACCEL is its generality, as it can produce highly capable agents in a diverse range of environments, without domain knowledge.

## Experiments

We compare ACCEL to the top-performing UED baselines in a diverse range of environments spanning both discrete and continuous control. First we show ACCEL can produce agents that can solve complex mazes, in a fully observable lava grid requiring safe exploration and in a larger, partially-observable maze domain. In both environments, ACCEL's curriculum begins with a simple empty room, and the editor makes incremental changes by adding or removing tiles at random. As we see in Figure 4, this approach leads to the rapid emergence of highly complex structures and an agent that can solve them.

Interestingly, the agent not only learns to successfully navigate levels featuring complex structures, but also is able to perform zero-shot to a series of challenging human designed tasks in the same domain. In the maze domain, we evaluate agents on the full set of human-designed test levels from previous works. The results summarized in Figure 5 show that ACCEL produces a highly capable agent that consistently outperforms all other methods in zero-shot transfer.

Given ACCEL's strong performance on previous zero-shot transfer environments, we were curious to see if the robust behaviors it induces would generalize to even more complex settings. To test this, we used a $$51 \times 51$$ procedurally-generated maze environment, representing a significant increase in size from the $$15 \times 15$$ mazes seen during training. This larger maze environment also subjects the agent to a $$50\times$$ longer episode length. Once again the ACCEL agents outperforms the other methods, achieving over $$50\%$$ solved rate over 100 trials. The next best agent was PLR achieves a $$25\%$$, while all other baselines fail to solve the maze. A trajectory taken by an ACCEL agent in this large procedurally-generated maze is provided in Figure 6, where the agent seems to be approximately follow the left-hand rule, a algorithm that consistently solves single-component "perfect mazes" like this one.

Next, we move to a continuous-control BipedalWalker environment, popularized by the influential series of papers on POET. In this domain, the environment design space is an 8-dimensional indirect encoding representing the minimum and maximum intensity of four kinds of terrain-based obstacles for a two-legged robot: the roughness of the ground, height of stump obstacles, width of pit gap obstacles, and the size of ascending and descending flights of stairs. We use the same values as in POET. In this domain, ACCEL starts its evolutionary process from perfectly flat tracks without any obstacles. As we see in Figure 7, ACCEL quickly compounds multiple edits leading to training levels that rapidly increase in complexity while targeting the agent's current weaknesses, resulting in a robust agent capable of solving levels with highly challenging terrain.

Our primary objective is to produce a generally-capable agent robust to all challenges within a specific domain. To analyze our agents' strengths and weaknesses along the main kinds of challenges in this domain, we evaluate all agents on separate environments that present each type of terrain obstacle in insolation. We also evaluate each agent on the BipedalWalkerHardcore environment, which presents a difficult configuration of the 8D environment parameter that produces a combination of type of terrain obstacle. These results, summarized in Figure 8, show that ACCEL produces agents with more robust behaviors across the full spectrum of environment challenges compared to other methods. The interactive demo at the start of this article allows you to play the role of environment designer, by adjusting the environment parameters to produce new levels for the ACCEL and baseline agents to solve. The javascript RL environment in this demo is a heavily modified version of the excellent implementation of BipedalWalker from TeachMyAgent.

Despite their differing objectives, we compared the performance of agents trained with ACCEL to those produced by POET. We performed this comparison using the lower-dimensional design space used by the original POET experiments, which does not include stairs in the encoding. We trained the ACCEL agent for 50,000 student PPO updates, equivalent to approximately 3B environment steps—well under the approximately 500B environment steps used to train POET. We then looked at compared the proportion of levels in ACCEL's level replay buffer that fall into each of the difficulty settings defined in the original POET experiments: easy, challenging, very challenging, and extremely challenging levels satisfy zero, one, two, and three of the criteria in Table 1. The proportion of ACCEL's level replay buffer falling into each difficulty class at various stages of training is summarized in Figure 9. The composition of the level replay buffer shows ACCEL is capable of discovering "extremely challenging" levels like those discovered by POET, alongside an agent that can solve them, despite using neither a population nor a domain-specific novelty criteria, as required by POET.

Max stump height Max pit gap width Roughness
$$\ge 2.4$$ $$\ge 6.0$$ $$\ge 4.5$$

Importantly, the goal of our work differs from that of POET. The primary motivation of ACCEL is to produce a single robust agent that can solve a wide range of challenges. ACCEL's regret-based curriculum seeks to prioritize the simplest levels the agent cannot currently solve. In contrast, POET co-evolves agent-environment pairs in order to find specialized policies for solving a single highly specialized task. POET's "specialist" agents may likely learn to solve challenging environments outside the capabilities of ACCEL's "generalist" agents, but at the cost of potentially overfitting to their paired levels. Thus, unlike ACCEL, the policies produced by POET should not be expected to be robust across the full distribution of levels. The relative strengths of POET and ACCEL highlight an important trade-off between specialization and generalization. While demonstrated that adaptive curricula resulting from regret-based evolutionary search can lead to highly robust generalist agents within specific domains, these results certainly do not cast doubt on the value of agent populations. It is likely that as we move to evermore challenging environments, population of specialists will become increasingly important. A population of specialist agents may be more effective at discovering diverse and complex behaviors. The interplay between generalists and specialists serves as an intringuing open question that warrants much further thought and exploration.

## Related Work

Algorithm Generation strategy Generator objective Curator objective Setting
POET Evolution Minimax MCC Population-based
PAIRED RL Minimax regret None Single agent
PLR Random None Minimax regret Single agent
ACCEL Random + Evolution None Minimax regret Single agent

In this section we discuss related work, with a summary of the most closely related methods in Table 2. Our paper focuses on testing agents on distributions of environments, long known to be crucial for evaluating the generality of RL agents. The inability of deep RL agents to reliably generalize has drawn considerable attention, with policies often failing to adapt to changes in the observation, dynamics, or reward.

This work focuses on the unsupervised environment design (UED) paradigm, which aims to design environment directly, such that the they induce experiences that result in learning more robust policies. Domain Randomization (DR) can be viewed as the most basic form of UED. DR has been particularly successful in areas such as robotics, with extensions that actively update the DR distribution. This paper directly extends PLR, a method for curating DR levels such that those with high learning potential can be revisited by the student agent for training. PLR is related to TSCL, self-paced learning, and ALP-GMM, which seek to maintain and update distributions over environment parameterizations based on the recent performance of the agent. Recently, a method similar to PLR was shown to be capable of producing generally capable agents in a simulated game world with a smooth space of levels.

Dennis et al, 2020 first formalized UED and introduced the PAIRED algorithm, a minimax regret UED algorithm whereby an environment adversary learns to present levels that maximize regret, approximated as the difference in performance between the main student agent and a second agent. PAIRED produces agents with improved zero-shot transfer to unseen environments and has been extended to train agents that can robustly navigate websites. Adversarial objectives have also been used in robotics and in directly searching for situations in which the agent sees the weakest performance. POET considers co-evolving a popualation of minimax environments alongside the agents. We take inspiration from the evolutionary nature of POET but train a single agent, which is beneficial as it takes significantly fewer resources, while also eliminating the agent selection problem.

UED is inherently related to the field of Automatic Curriculum Learning (ACL), which seeks to provide an adaptive curriculum of increasingly challenging tasks or goals given a (typically) fixed environment. This setting differs from a general UPOMDP, where a task or goal specifier is not provided. Furthermore, UED approaches seek to \emph{fully specify} environments, rather than just goals within a fixed environment. Asymmetric Self-Play takes the form of one agent proposing goals for another, which was shown to be effective for challenging robotic manipulation tasks. AMIGo and APT-Gen provide solutions to problems where the target task is known, providing a curriculum of increasing difficulty. Indeed, many ACL methods emphasize learning to reach goals or states with high uncertainty, either using generative or world models.

Environment design has also been considered in the symbolic AI community as a means to shape an agent's decisions. Automated environment design seeks to redesign specific levels to improve the agent's performance within them. Unlike these works, ACCEL automatically designs environments to produce a curriculum that improves the student agent's performance across the full range of levels.

Our work also relates to the field of procedural content generation (PCG), which seeks to algorithmically generate diverse levels. Popular PCG environments used in RL include the Procgen Benchmark, MiniGrid, Obstacle Tower, GVGAI, and the NetHack Learning Environment. This work uses the recently proposed MiniHack environment, which provides a flexible framework for creating diverse environments. Within the PCG community, automatically generating game levels has been of interest for more than a decade, with recent methods making use of machine learning. PCGRL frames level design as an RL problem, whereby the design policy makes incremental changes to the level to maximize some arbitrary objective subject to game-specific constraints. Unlike ACCEL, it makes use of hand-designed dense rewards and focuses on the design of levels for human players, rather than as an avenue to training highly-robust agents.

## Conclusion and Future Work

This article presented ACCEL, a new method for Unsupervised Environment Design (UED) that evolves a curriculum by editing previously discovered high-regret levels. Editing high-regret levels leads to a wide variety of levels at the frontier of the agent's capabilities, producing curricula that start simple and quickly compound in complexity. Thus, ACCEL provides a principled regret-based curriculum that does not require domain-specific heuristics, alongside an evolutionary process that produces a broad spectrum of complexity matched to the agent's current capabilities. In our experiments, we showed that ACCEL is capable of producing robust agents in several challenging design spaces, where ACCEL agents outperform the best-performing baselines.

We are excited by the many possibilities for extending how ACCEL edits and maintains its population of high-regret levels. The editing mechanism could encompass a wide variety of approaches, such as Neural Cellular Automata, controllable editors, or perturbing the latent space of a generative model. Another possibility is to actively search for levels that are likely to lead to useful levels in the future. There is also much promise in methods that directly encourage level diversity, which we expect to play a more important role as the level design space grows. We believe the intersection of evolution and adaptive curricula presents many promising directions that may take us closer to producing a truly open-ended learning process between the agent and the environment.

### Acknowledgements

We thank Kenneth Stanley, Alessandro Lazaric, Ian Fox, and Iryna Korshunova for useful discussions and feedback on this work, while we are also grateful to anonymous reviewers for their recommendations that helped improve the paper.

### Citation

Parker-Holder et al., "Evolving Curricula with Regret-Based Environment Design", 2022.
@article{parkerholder2022evolving,
}