We evolve environments at the frontier of a reinforcement learning agent's capabilities, leading to self-supervised teacher-student processes with strong zero-shot generalization results for agents learning to walk through challenging terrain and navigating complex human-designed mazes.
Deep reinforcement learning (RL) has seen tremendous success over the past decade. However, agents trained on fixed environments are brittle, often failing the moment the environment changes even slightly, thus limiting the real-world applicability of current RL methods. A common remedy is to introduce more training data diversity by randomizing the environment’s parameters in every episode—a process called domain randomization (DR). For example, these parameters might control the friction coefficient and lighting conditions in a robotic arm simulation or the position of obstacles in a maze (we call each such environment variation a level). Procedurally generating training levels to expose RL agents to more diverse experiences has quickly become a common method for improving the robustness of RL agents. However, DR is often not enough to train robust agents in domains where the agent struggles to make progress on many challenging levels.
Adaptive curriculum methods match the complexity of training levels to an agent’s current capabilities, and have been shown to produce more robust policies in fewer training steps than domain randomization. These methods can be viewed as a game between a teacher that designs challenging levels and a student that solves them. This game is potentially open-ended, leading to the co-evolution of generally-capable students. By tailoring the distribution of entire levels throughout training, adaptive curricula perform unsupervised environment design (UED). However, training an environment-designing teacher is difficult, and the prior state-of-the-art UED method, Prioritized Level Replay (PLR), simply finds challenging levels through random search, making it unable to build off of previously found structures that were useful for training agents in the past. We can also expect its performance to degrade as the size of the design space grows, limiting the potential for open-ended co-evolution between teacher and student. In contrast, the evolutionary processes between organisms and their environments, which UED resembles, efficiently search the design space by successively mutating a population and selecting the fittest individuals.
We combine evolution with a principled regret-based environment designer in an algorithm called Adversarially Compounding Complexity by Editing Levels (ACCEL). As in PLR, our new method uses the difference between how well an agent did in an environment and how well it could have done (a metric known in decision theory as regret) to quantify the learning potential of revisiting this environment in the future. The important difference from PLR is that ACCEL evolves a curriculum by making small edits (e.g. mutations) to previously high regret levels, thus constantly producing new levels at the frontier of the student agent's capabilities. By compounding complexity over time, ACCEL harnesses the potential for efficient, open-ended search inherent to evolution. Our results show that ACCEL’s adaptive curricula lead to RL agents that outperform existing auto-curriculum learning methods.
The rest of this article describes ACCEL in more technical detail. We first provide the necessary background on the UED formalism, followed by a comprehensive description of the ACCEL algorithm. We then present our experimental results in two challenging domains consisting of maze navigation and continuous-control bipedal locomotion, where ACCEL evolves levels with higher complexity and produces policies exhibiting significantly improved generalization compared to other adaptive curriculum methods.
The RL problem is typically formalized as an agent trying to solve a Markov Decision Process (MDP), which is defined as a tuple $$\langle S, A, \mathcal{T}, \mathcal{R}, \gamma \rangle$$, where $$S$$ and $$A$$ stand for the sets of states and actions respectively and $$\mathcal{T}: S \times A \rightarrow \mathbf{\Delta}(S)$$ is a transition function representing the probability that the system transitions from a state $$s_t \in S$$ to $$s_{t+1} \in S$$ given action $$a_t \in A$$. Each transition also induces an associated reward $$r_t$$ generated by a reward function $$\mathcal{R}: S \rightarrow \mathbb{R}$$, and $$\gamma$$ is a discount factor. When provided with an MDP, the goal of RL is to learn a policy $$\pi$$ that maximizes expected discounted reward, i.e. $$\mathbb{E}\left[\sum_{i=0}^{T} r_t\gamma^t\right]$$.
Despite the generality of the MDP framework, it is often an unrealistic model for real-world environments. First, it assumes full observability of the state, which is often impossible in practice. This is addressed in partially-observable MDPs, or POMDPs, which include an observation function $$\mathcal{I}: S \rightarrow O$$ which maps the true state (which is unknown to the agent) to a potentially noisy set of observations $$O$$. Secondly, the traditional MDP framework assumes a single reward and transition function, which are fixed throughout learning. Instead, in the real world, agents may experience variations not seen during training, which makes it crucial that policies are capable of robust transfer.
To address these issues, we use the recently introduced Underspecified POMDP, or UPOMDP
Unsupervised Environment Design (UED)
for any constant $$C$$. Recent approaches proposed to use objectives seeking to maximize regret, defined as the difference between the expected return of the current policy and the optimal policy, i.e:
Regret-based objectives are desirable as they have been shown to promote the simplest possible levels that the agent cannot currently solve in a range of settings
However, without access to $$\pi^*$$, UED algorithms must approximate the regret. PAIRED estimates regret as the difference in return attained by the main student agent and a second agent. By maximizing this difference, the teacher maximizes an approximation of the student's regret. Furthermore, multi-agent learning systems may not always converge in practice
where $$\lambda$$ and $$\gamma$$ are the Generalized Advantage Estimation (GAE)
Empirically PLR has been shown to produce policies with strong generalization capabilities, but remains limited in being able to only curate randomly sampled levels. PLR's inability to build off of previously discovered structures may make it challenging to sample complex structures with the frequency required to train agents capable of solving them. Further, random search will suffer from the curse-of-dimensionality in higher-dimensional design spaces, where randomly encountering levels at the frontier of the agent's current capabilities can be highly unlikely.them.
In this section we introduce a new algorithm for UED, combining an evolutionary environment generator with a principled regret-based curator. Unlike PLR which relies on random sampling to produce new batches of training levels, we instead propose to make edits (e.g. mutations) to previously curated ones. Evolutionary methods have been effective in a variety of challenging optimization problems
ACCEL does not rely on a specific editing mechanism, which could be any mutation process used in other open-ended evolutionary approaches
Following Robust PLR
ACCEL can be seen as a UED algorithm taking a step toward open-ended evolution
Remark 1. If ACCEL reaches a Nash equilibrium, then the student follows a minimax regret strategy.
In stark contrast, other evolutionary approaches primarily justify their applicability solely via empirical results on specific domains. As we will show next, a key strength of ACCEL is its generality, as it can produce highly capable agents in a diverse range of environments, without domain knowledge.
We compare ACCEL to the top-performing UED baselines in a diverse range of environments spanning both discrete and continuous control. First we show ACCEL can produce agents that can solve complex mazes, in a fully observable lava grid requiring safe exploration and in a larger, partially-observable maze domain. In both environments, ACCEL's curriculum begins with a simple empty room, and the editor makes incremental changes by adding or removing tiles at random. As we see in Figure 4, this approach leads to the rapid emergence of highly complex structures and an agent that can solve them.
Interestingly, the agent not only learns to successfully navigate levels featuring complex structures, but also is able to perform zero-shot to a series of challenging human designed tasks in the same domain. In the maze domain, we evaluate agents on the full set of human-designed test levels from previous works
Given ACCEL's strong performance on previous zero-shot transfer environments, we were curious to see if the robust behaviors it induces would generalize to even more complex settings. To test this, we used a $$51 \times 51$$ procedurally-generated maze environment, representing a significant increase in size from the $$15 \times 15$$ mazes seen during training. This larger maze environment also subjects the agent to a $$50\times$$ longer episode length. Once again the ACCEL agents outperforms the other methods, achieving over $$50\%$$ solved rate over 100 trials. The next best agent was PLR achieves a $$25\%$$, while all other baselines fail to solve the maze. A trajectory taken by an ACCEL agent in this large procedurally-generated maze is provided in Figure 6, where the agent seems to be approximately follow the left-hand rule, a algorithm that consistently solves single-component "perfect mazes" like this one.
Next, we move to a continuous-control BipedalWalker environment, popularized by the influential series of papers on POET
Our primary objective is to produce a generally-capable agent robust to all challenges within a specific domain. To analyze our agents' strengths and weaknesses along the main kinds of challenges in this domain, we evaluate all agents on separate environments that present each type of terrain obstacle in insolation. We also evaluate each agent on the BipedalWalkerHardcore environment, which presents a difficult configuration of the 8D environment parameter that produces a combination of type of terrain obstacle. These results, summarized in Figure 8, show that ACCEL produces agents with more robust behaviors across the full spectrum of environment challenges compared to other methods. The interactive demo at the start of this article allows you to play the role of environment designer, by adjusting the environment parameters to produce new levels for the ACCEL and baseline agents to solve. The javascript RL environment in this demo is a heavily modified version of the excellent implementation of BipedalWalker from TeachMyAgent
Despite their differing objectives, we compared the performance of agents trained with ACCEL to those produced by POET. We performed this comparison using the lower-dimensional design space used by the original POET experiments
Max stump height | Max pit gap width | Roughness |
---|---|---|
$$\ge 2.4$$ | $$\ge 6.0$$ | $$\ge 4.5$$ |
Importantly, the goal of our work differs from that of POET. The primary motivation of ACCEL is to produce a single robust agent that can solve a wide range of challenges. ACCEL's regret-based curriculum seeks to prioritize the simplest levels the agent cannot currently solve. In contrast, POET co-evolves agent-environment pairs in order to find specialized policies for solving a single highly specialized task. POET's "specialist" agents may likely learn to solve challenging environments outside the capabilities of ACCEL's "generalist" agents, but at the cost of potentially overfitting to their paired levels. Thus, unlike ACCEL, the policies produced by POET should not be expected to be robust across the full distribution of levels. The relative strengths of POET and ACCEL highlight an important trade-off between specialization and generalization. While demonstrated that adaptive curricula resulting from regret-based evolutionary search can lead to highly robust generalist agents within specific domains, these results certainly do not cast doubt on the value of agent populations. It is likely that as we move to evermore challenging environments, population of specialists will become increasingly important. A population of specialist agents may be more effective at discovering diverse and complex behaviors. The interplay between generalists and specialists serves as an intringuing open question that warrants much further thought and exploration.
Algorithm | Generation strategy | Generator objective | Curator objective | Setting |
---|---|---|---|---|
POET | Evolution | Minimax | MCC | Population-based |
PAIRED | RL | Minimax regret | None | Single agent |
PLR | Random | None | Minimax regret | Single agent |
ACCEL | Random + Evolution | None | Minimax regret | Single agent |
In this section we discuss related work, with a summary of the most closely related methods in Table 2. Our paper focuses on testing agents on distributions of environments, long known to be crucial for evaluating the generality of RL agents
This work focuses on the unsupervised environment design (UED) paradigm
Dennis et al, 2020
UED is inherently related to the field of Automatic Curriculum Learning (ACL)
Environment design has also been considered in the symbolic AI community as a means to shape an agent's decisions
Our work also relates to the field of procedural content generation (PCG)
This article presented ACCEL, a new method for Unsupervised Environment Design (UED) that evolves a curriculum by editing previously discovered high-regret levels. Editing high-regret levels leads to a wide variety of levels at the frontier of the agent's capabilities, producing curricula that start simple and quickly compound in complexity. Thus, ACCEL provides a principled regret-based curriculum that does not require domain-specific heuristics, alongside an evolutionary process that produces a broad spectrum of complexity matched to the agent's current capabilities. In our experiments, we showed that ACCEL is capable of producing robust agents in several challenging design spaces, where ACCEL agents outperform the best-performing baselines.
We are excited by the many possibilities for extending how ACCEL edits and maintains its population of high-regret levels. The editing mechanism could encompass a wide variety of approaches, such as Neural Cellular Automata
We thank Kenneth Stanley, Alessandro Lazaric, Ian Fox, and Iryna Korshunova for useful discussions and feedback on this work, while we are also grateful to anonymous reviewers for their recommendations that helped improve the paper.
For attribution in academic contexts, please cite this work as
Parker-Holder et al., "Evolving Curricula with Regret-Based Environment Design", 2022.
BibTeX citation
@article{parkerholder2022evolving, title={Evolving Curricula with Regret-Based Environment Design}, author={Parker-Holder, Jack and Jiang, Minqi and Dennis, Michael D and Samvelyan, Mikayel and Foerster, Jakob Nicolaus and Grefenstette, Edward and Rockt{\"a}schel, Tim}, journal={arXiv preprint arXiv:2203.01302}, year={2022} }