# Glen Berseth

I am an assistant professor at the University de Montreal and Mila. My research explores how to use deep learning and reinforcement learning to develop generalist robots.

# Publication Articles

#### XT2: Training an X-to-Text Typing Interface with Online Learning from Implicit Feedback

Jensen Gao, Siddharth Reddy, Glen Berseth, Anca Dragan, Sergey Levine

We aim to help users communicate their intent to machines using flexible, adaptive interfaces that translate arbitrary user input into desired actions. In this work, we focus on assistive typing applications in which a user cannot operate a keyboard, but can instead supply other inputs, such as webcam images that capture eye gaze. Standard methods train a model on a fixed dataset of user inputs, then deploy a static interface that does not learn from its mistakes; in part, because extracting an error signal from user behavior can be challenging. We investigate a simple idea that would enable such interfaces to improve over time, with minimal additional effort from the user: online learning from user feedback on the accuracy of the interface's actions. In the typing domain, we leverage backspaces as feedback that the interface did not perform the desired action. We propose an algorithm called x-to-text (X2T) that trains a predictive model of this feedback signal, and uses this model to fine-tune any existing, default interface for translating user input into actions that select words or characters. We evaluate X2T through a small-scale online user study with 12 participants who type sentences by gazing at their desired words, and a large-scale observational study on handwriting samples from 60 users. The results show that X2T learns to outperform a non-adaptive default interface, stimulates user co-adaptation to the interface, personalizes the interface to individual users, and can leverage offline data collected from the default interface to improve its initial performance and accelerate online learning.

#### DisCo RL: Distribution-Conditioned Reinforcement Learning for General-Purpose Policies

Soroush Nasiriany, Vitchyr Pong, Ashvin Nair, Khazatsky Alexander, Glen Berseth, Sergey Levine

Can we use reinforcement learning to learn general-purpose policies that can perform a wide range of different tasks, resulting in flexible and reusable skills? Contextual policies provide this capability in principle, but the representation of the context determines the degree of generalization and expressivity. Categorical contexts preclude generalization to entirely new tasks. Goal-conditioned policies may enable some generalization, but cannot capture all tasks that might be desired. In this paper, we propose goal distributions as a general and broadly applicable task representation suitable for contextual policies. Goal distributions are general in the sense that they can represent any state-based reward function when equipped with an appropriate distribution class, while the particular choice of distribution class allows us to trade off expressivity and learnability. We develop an off-policy algorithm called distribution-conditioned reinforcement learnin (DisCo) to efficiently learn these policies. We evaluate DisCo on a variety of robot manipulation tasks and find that it significantly outperforms prior methods on tasks that require generalization to new goal distributions.

#### Deep Integration of Physical Humanoid Control and Crowd Navigation

Brandon Haworth, Glen Berseth, Seonghyeon Moon, Petros Faloutsos, Mubbasir Kapadia

Many multi-agent navigation approaches make use of simplified representations such as a disk. These simplifications allow for fast simulation of thousands of agents but limit the simulation accuracy and fidelity. In this paper, we propose a fully integrated physical character control and multi-agent navigation method. In place of sample complex online planning methods, we extend the use of recent deep reinforcement learning techniques. This extension improves on multi-agent navigation models and simulated humanoids by combining Multi-Agent and Hierarchical Reinforcement Learning. We train a single short term goal-conditioned low-level policy to provide directed walking behaviour. This task-agnostic controller can be shared by higher-level policies that perform longer-term planning. The proposed approach produces reciprocal collision avoidance, robust navigation, and emergent crowd behaviours. Furthermore, it offers several key affordances not previously possible in multi-agent navigation including tunable character morphology and physically accurate interactions with agents and the environment. Our results show that the proposed method outperforms prior methods across environments and tasks, as well as, performing well in terms of zero-shot generalization over different numbers of agents and computation time.

#### Ecological Reinforcement Learning

John D. Co-Reyes, Suvansh Sanjeev, Glen Berseth, Abhishek Gupta, Sergey Levine

Reinforcement learning is normally studied in the episodic setting where the agent is reset each episode. This makes learning easier but in the real world, we would like our agent to continually learn with minimal human supervision and without having to manually reset the agent each time it makes a mistake. Reset-free or non-episodic learning is difficult, especially with sparse reward where the agent may never experience any rewarding states and not make any progress. Without any algorithmic changes however, certain properties of the environment can make learning without resets and with sparse reward more tractable. We investigate and analyze these properties: environment shaping and environment dynamism.

#### Inter-Level Cooperation in Hierarchical Reinforcement Learning

Abdul Rahman Kreidieh, Glen Berseth, Brandon Trabucco, Samyak Parajuli, Sergey Levine, and Alexandre M. Bayen

Hierarchical models for deep reinforcement learning (RL) have emerged as powerful methods for generating meaningful control strategies in difficult long time horizon tasks. Training of said hierarchical models, however, continue to suffer from instabilities that limit their applicability. In this paper, we address instabilities that arise from the concurrent optimization of goal-assignment and goal-achievement policies. Drawing connections between this concurrent optimization scheme and communication and cooperation in multi-agent RL, we redefine the standard optimization procedure to explicitly promote cooperation between these disparate tasks. Our method is demonstrated to achieve superior results to existing techniques in a set of difficult long time horizon tasks, and serves to expand the scope of solvable tasks by hierarchical reinforcement learning.

#### Morphology-Agnostic Visual Robotic Control

Brian Yang, Dinesh Jayaraman, Glen Berseth, Alexei Efros, Sergey Levine

Existing approaches for visuomotor robotic control typically require characterizing the robot in advance by calibrating the camera or performing system identification. We propose MAVRIC, an approach that works with minimal prior knowledge of the robot's morphology, and requires only a camera view containing the robot and its environment and an unknown control interface. MAVRIC revolves around a mutual information-based method for self-recognition, which discovers visual "control points" on the robot body within a few seconds of exploratory interaction, and these control points in turn are then used for visual servoing. MAVRIC can control robots with imprecise actuation, no proprioceptive feedback, unknown morphologies including novel tools, unknown camera poses, and even unsteady handheld cameras. We demonstrate our method on visually-guided 3D point reaching, trajectory following, and robot-to-robot imitation.

#### SMiRL: Surprise Minimizing RL in Unstable Environments

Glen Berseth, Daniel Geng, Coline Devin, Nicholas Rhinehart, Chelsea Finn, Dinesh Jayaraman, Sergey Levine

All living organisms carve out environmental niches within which they can maintain relative predictability amidst the ever-increasing entropy around them [schneider1994, friston2009]. Humans, for example, go to great lengths to shield themselves from surprise --- we band together in millions to build cities with homes, supplying water, food, gas, and electricity to control the deterioration of our bodies and living spaces amidst heat and cold, wind and storm. The need to discover and maintain such surprise-free equilibria has driven great resourcefulness and skill in organisms across very diverse natural habitats. Motivated by this, we ask: could the motive of preserving order amidst chaos guide the automatic acquisition of useful behaviors in artificial agents?

#### Contextual Imagined Goals for Self-Supervised Robotic Learning

Ashvin Nair*, Shikhar Bahl*, Alexander Khazatsky*, Vitchyr Pong, Glen Berseth, Sergey Levine

While reinforcement learning provides an appealing formalism for learning individual skills, a general-purpose robotic system must be able to master an extensive repertoire of behaviors. Instead of learning a large collection of skills individually, can we instead enable a robot to propose and practice its own behaviors automatically, learning about the affordances and behaviors that it can perform in its environment, such that it can then repurpose this knowledge once a new task is commanded by the user? In this paper, we study this question in the context of self-supervised goal-conditioned reinforcement learning. A central challenge in this learning regime is the problem of goal setting: in order to practice useful skills, the robot must be able to autonomously set goals that are feasible but diverse. When the robot's environment and available objects vary, as they do in most open-world settings, the robot must propose to itself only those goals that it can accomplish in its present setting with the objects that are at hand. Previous work only studies self-supervised goal-conditioned RL in a single-environment setting, where goal proposals come from the robot's past experience or a generative model are sufficient. In more diverse settings, this frequently leads to impossible goals and, as we show experimentally, prevents effective learning. We propose a conditional goal-setting model that aims to propose goals that are feasible from the robot's current state. We demonstrate that this enables self-supervised goal-conditioned off-policy learning with raw image observations in the real world, enabling a robot to manipulate a variety of objects and generalize to new objects that were not seen during training.

#### Interactive Architectural Design with Diverse Solution Exploration

Glen Berseth, Brandon Haworth, Muhammad Usman, Davide Schaumann, Mahyar Khayatkhoei, Mubbasir Turab Kapadia, Petros Faloutsos

In architectural design, architects explore a vast amount of design options to maximize various performance criteria, while adhering to specific constraints. In an effort to assist architects in such a complex endeavour, we propose IDOME, an interactive system for computer-aided design optimization. Our approach balances automation and control by efficiently exploring, analyzing, and filtering space layouts to inform architects' decision-making better. At each design iteration, IDOME provides a set of alternative building layouts which satisfy user-defined constraints and optimality criteria concerning a user-defined space parametrization. When the user selects a design generated by IDOME, the system performs a similar optimization process with the same (or different) parameters and objectives. A user may iterate this exploration process as many times as needed. In this work, we focus on optimizing built environments using architectural metrics by improving the degree of visibility, accessibility, and information gaining for navigating a proposed space. This approach, however, can be extended to support other kinds of analysis as well. We demonstrate the capabilities of IDOME through a series of examples, performance analysis, user studies, and a usability test. The results indicate that IDOME successfully optimizes the proposed designs concerning the chosen metrics and offers a satisfactory experience for users with minimal training.

#### Gamification of Crowd-Driven Environment Design

Michael Brandon Haworth, Muhammad Usman, Davide Schaumann, Nilay Chakraborty, Glen Berseth, Petros Faloutsos, Mubbasir Kapadia

This paper describes using human creativity within a gamified collaborative design framework to address the complexity of predictive environment design. This framework is predicated on gamifying crowd objectives and presenting environment design problems as puzzles. A usability study reveals that the framework is considered usable for the task. Participants were asked to configure an environment puzzle to reduce an important crowd metric, the total egress time. The design task was constructed to be straightforward and uses a simplified environment as a probe for understanding the utility of gamification and the performance of collaboration. Single-player and multiplayer designs outperformed both optimization and expert-sourced designs of the same environment and multiplayer designs further outperformed the single-player designs. Single-player and multiplayer iterations followed linear and exponential decrease trends in total egress time respectively. Our experiments provide strong evidence towards an interesting novel approach of crowdsourcing collaborative environment design.

#### Visual Imitation with Reinforcement Learning using Recurrent Siamese Networks

Glen Berseth, Christopher Pal

Imitation learning, the ability to reproduce some behaviour, is a challenging and vital problem. It is what enables animals with the ability to understand and mimic from observation. Many SoTA methods for imitation accomplish this via additional data that is often not available in the real world. For example, along with an expert's joint positions, the torques used by the expert are available as well. In this work, we describe a learning system that allows an agent to reproduce imitative behaviour of 3D simulated robots from video. This progress will enable us to create robots that can learn behaviour from observing humans, and allow humans to instruct robots in a very natural form of instruction.

#### Feedback Control for Cassie with Deep Reinforcement Learning

Zhaoming Xie, Glen Berseth, Patrick Clary, Jonathan Hurst, Michiel van de Panne

Deep reinforcement learning has achieved great strides in solving challenging motion control tasks. Recently, there has been significant work on methods for exploiting the data gathered during training, but there has been less work on how to best generate the data to learn from. For continuous action domains, the most common method for generating exploratory actions involves sampling from a Gaussian distribution centred around the mean action output by a policy. Although these methods can be quite capable, they do not scale well with the dimensionality of the action space, and can be dangerous to apply on hardware. We consider learning a forward dynamics model to predict the result, $$(x_{t+1})$$, of taking a particular action, $$(u_{t})$$, given a specific observation of the state, $$(x_{t})$$. With this model we perform internal look-ahead predictions of outcomes and seek actions we believe have a reasonable chance of success. This method alters the exploratory action space, thereby increasing learning speed and enables higher quality solutions to difficult problems, such as robotic locomotion and juggling

#### Model-Based Action Exploration for Learning Dynamic Motion Skills

Glen Berseth, Alex Kyriazis, Ivan Zinin, William Choi, Michiel van de Panne

Deep reinforcement learning has achieved great strides in solving challenging motion control tasks. Recently, there has been significant work on methods for exploiting the data gathered during training, but there has been less work on how to best generate the data to learn from. For continuous action domains, the most common method for generating exploratory actions involves sampling from a Gaussian distribution centred around the mean action output by a policy. Although these methods can be quite capable, they do not scale well with the dimensionality of the action space, and can be dangerous to apply on hardware. We consider learning a forward dynamics model to predict the result, $$(x_{t+1})$$, of taking a particular action, $$(u_{t})$$, given a specific observation of the state, $$(x_{t})$$. With this model we perform internal look-ahead predictions of outcomes and seek actions we believe have a reasonable chance of success. This method alters the exploratory action space, thereby increasing learning speed and enables higher quality solutions to difficult problems, such as robotic locomotion and juggling

#### TerrainRL Sim

Glen Berseth, Xue Bin Peng, Michiel van de Panne

We provide 88 challenging simulation environments that range in difficulty. The difficulty in these environments is linked not only to the number of dimensions in the action space but also to the task complexity. Using more complex and accurate simulations will help push the field closer to creating human-level intelligence. Therefore, we are releasing a number of simulation environments that include local egocentric visual perception. These environments include randomly generated terrain which the agent needs to learn to interpret via visual features. The library also provides simple mechanisms to create new environments with different agent morphologies and the option to modify the distribution of generated terrain.

#### Progressive Reinforcement Learning with Distillation for Multi-Skilled Motion Control

Glen Berseth, Cheng Xie, Paul Cernek, Michiel van de Panne

Deep reinforcement learning has demonstrated increasing capabilities for continuous control problems, including agents that can move with skill and agility through their environment. An open problem in this setting is that of developing good strategies for integrating or merging policies for multiple skills, where each individual skill is a specialist in a specific skill and its associated state distribution. We extend policy distillation methods to the continuous action setting and leverage this technique to combine expert policies, as evaluated in the domain of simulated bipedal locomotion across different classes of terrain. We also introduce an input injection method for augmenting an existing policy network to exploit new input features. Lastly, our method uses transfer learning to assist in the efficient acquisition of new skills. The combination of these methods allows a policy to be incrementally augmented with new skills. We compare our progressive learning and integration via distillation (PLAID) method against three alternative baselines.

#### Evaluating and Optimizing Evacuation Plans for Crowd Egress

Vincius J Cassol, Estêvão Smania Testa, Cláudio Rosito Jung, Muhammad Usman, Petros Faloutsos, Glen Berseth, Mubbasir Kapadia, Norman I Badler, Soraia Raupp Musse

Evacuation planning is an important and difficult task in building design. The proposed framework can identify optimal evacuation plans using decision points, which control the ratio of agents that select a particular route at a specific spatial location. The authors optimize these ratios to achieve the best evacuation based on a quantitatively validated metric for evacuation performance. This metric captures many of the important aspects of an evacuation: total evacuation time, average evacuation time, agent speed, and local agent density. The proposed approach was validated using a night club model that incorporates real data from an actual evacuation.

#### DeepLoco: Dynamic Locomotion Skills Using Hierarchical Deep Reinforcement Learning

Xue Bin Peng, Glen Berseth, KangKang Yin, Michiel van de Panne

Learning physics-based locomotion skills is a difficult problem, leading to solutions that typically exploit prior knowledge of various forms. In this paper, we aim to learn a variety of environment-aware locomotion skills with a limited amount of prior knowledge. We adopt a two-level hierarchical control framework. First, low-level controllers are learned that operate at a fine timescale and which achieve robust walking gaits that satisfy stepping-target and style objectives. Second, high-level controllers are then learned which plan at the timescale of steps by invoking desired step targets for the low-level controller. The high-level controller makes decisions directly based on high-dimensional inputs, including terrain maps or other suitable representations of the surroundings. Both levels of the control policy are trained using deep reinforcement learning. Results are demonstrated on a simulated 3D biped. Low-level controllers are learned for a variety of motion styles and demonstrate robustness with respect to force-based disturbances, terrain variations, and style interpolation. High-level controllers are demonstrated that are capable of following trails through terrains, dribbling a soccer ball towards a target location, and navigating through static or dynamic obstacles.

#### Towards Computer Assisted Crowd Aware Architectural Design

Brandon Haworth, Muhammad Usman, Glen Berseth, Mahyar Khayatkhoei, Mubbasir Turab Kapadia, Petros Faloutsos

We present a preliminary exploration of an architectural optimization process towards a computational tool for designing environments (e.g., building floor plans). Using dynamic crowd simulators we derive the fitness of architectural layouts. The results of the simulation are used to provide feedback to a user in terms of crowd animation, aggregate statistics, and heat maps. Our approach automatically optimizes the placement of environment elements to maximize the flow of the crowd, while satisfying constraints that are imposed by the user (e.g., immovable walls or support bearing structures). We take steps towards user-in-the-loop optimization and design of an environment by applying an adaptive refinement approach to reduce the search space of the optimization. We perform a small scale user study to obtain early feedback on the performance and quality of our method in contrast with a manual approach.

#### Using synthetic crowds to inform building pillar placements

Brandon Haworth, Muhammad Usman, Glen Berseth, Mahyar Khayatkhoei, Mubbasir Turab Kapadia, Petros Faloutsos

We present a preliminary exploration of synthetic crowds towards computational tools for informing the design of environments (e.g., building floor plans). Feedback and automatic design processes are developed from exploring crowd behaviours and metrics derived from simulations of environments in density stressed scenarios, such as evacuations. Computational approaches for crowd analysis and environment design benefit from measures characterizing the relationships between environments and crowd flow behaviours. We investigate the optimization of environment elements to maximize crowd flow, under a range of LoS conditions, a standard indicator for characterizing the service afforded by environments to crowds of specific densities widely used in crowd management and urban design. The steering algorithm, the number of optimized environment elements, the scenario configuration and the LoS conditions affect the optimal configuration of environment elements. From the insights gained exploring optimizations under LoS conditions, we take steps towards user-in-the-loop optimization and design of an environment by applying an adaptive refinement approach to reduce the search space of the optimization. We derive the fitness of architectural layouts from background simulations. We perform a ground truth study to gauge the performance and quality of our method.

#### Dynamic terrain traversal skills using reinforcement learning

Xue Bin Peng, Glen Berseth, Michiel van de Panne

The locomotion skills developed for physics-based characters most often target flat terrain. However, much of their potential lies with the creation of dynamic, momentum-based motions across more complex terrains. In this paper, we learn controllers that allow simulated characters to traverse terrains with gaps, steps, and walls using highly dynamic gaits. This is achieved using reinforcement learning, with careful attention given to the action representation, non-parametric approximation of both the value function and the policy; epsilon-greedy exploration; and the learning of a good state distance metric. The methods enable a 21-link planar dog and a 7-link planar biped to navigate challenging sequences of terrain using bounding and running gaits. We evaluate the impact of the key features of our skill learning pipeline on the resulting performance.

#### Modelling Dynamic Brachiation

Glen Berseth, Michiel van de Panne

Significant progress has been made with regard to motions such as walking, running, and other specific motions, such as falling and rolling. However, we still have difficulty simulating agile motions we see in nature, for example, brachiation by gibbons. Gibbons are one of the most agile primates and can leap remarkable distances. In this work we discuss the advantages of skill learning with explicit planning to create motion controllers for more complex and dynamic navigation tasks. Skill learning is complex and cannot be directly solved using only supervised learning because generating good data plays a key role in learning good skills. Here we construct a FSM controller to model the motion and capabilities of a gibbon, one of the most agile primates, shown in Figure 1. We endeavour to give this controller motion skills using reinforcement learning and use this dynamics model to intelligently sample good actions.

#### Terrain Adaptive Locomotion Skills using Deep Reinforcement Learning

Xue Bin Peng, Glen Berseth, Michiel van de Panne

Reinforcement learning offers a promising methodology for developing skills for simulated characters, but typically requires working with sparse hand-crafted features. Building on recent progress in deep reinforcement learning (DeepRL), we introduce a mixture of actor-critic experts (MACE) approach that learns terrain-adaptive dynamic locomotion skills using high-dimensional state and terrain descriptions as input, and parameterized leaps or steps as output actions. MACE learns more quickly than a single actor-critic approach and results in actor-critic experts that exhibit specialization. Additional elements of our solution that contribute towards efficient learning include Boltzmann exploration and the use of initial actor biases to encourage specialization. Results are demonstrated for multiple planar characters and terrain classes.

#### ACCLMesh: Curvature-Based Navigation Mesh Generation

Glen Berseth, Mubbasir Kapadia, Petros Faloutsos

The proposed method robustly and efficiently computes a navigation mesh for arbitrary and dynamic 3D environments based on curvature. This method addresses a number of known limitations in state-of-the-art techniques to produce navigation meshes that are tightly coupled to the original geometry, incorporate geometric details that are crucial for movement decisions and can robustly handle complex surfaces. The method is integrated into a standard navigation and collision-avoidance system to simulate thousands of agents on complex 3D surfaces in real-time.

#### Evaluating and Optimizing Level of Service for Crowd Evacuations

Computational approaches for crowd analysis and environment design need robust measures for characterizing the relation between environments and crowd flow. Level of service (Level of Service) is a standard indicator for characterizing the service afforded by environments to crowds of specific densities, and is widely used in crowd management and urban design. However, current Level of Service indicators are qualitative and rely on expert analysis. In this paper, we perform a systematic analysis of Level of Service for synthetic crowds. The flow-density relationships in crowd evacuation scenarios are explored with respect to three state-of-the-art steering algorithms. Our results reveal that Level of Service is sensitive to the crowd behavior, and the configuration of the environment benchmark. Following this study, we automatically optimize environment elements to maximize crowd flow, under a range of Level of Service conditions. The steering algorithm, the number of optimized environment elements, the scenario configuration and the Level of Service conditions affect the optimal configuration of environment elements. We observe that the critical density of crowd simulators increases due to the optimal placement of pillars, thereby effectively increasing the Level of Service of environments at higher crowd densities. Depending on the simulation technique and environment benchmark, pillars are configured to produce lanes or form wall-like structures, in an effort to maximize crowd flow. These experiments serve as an important precursor to computational crowd optimization and management and motivate the need for further study using additional real and synthetic crowd datasets across a larger representation of environment benchmarks.

#### Characterizing and Optimizing Game Level Difficulty

Glen Berseth

In this work we parameterized the configuration of a game level. With the parameterization we optimized aspects of the game level to change a players expected difficulty. Given specific constrains on the configuration of the game level we can produce game levels with a varying degree of difficulty.

#### Environment Optimization for Crowd Evacuation

Glen Berseth

In this work we parameterize subspaces of the full environment configuration space and optimize parts of these subspaces for an agent flow metric. We focus on building configuration and optimize the possible placement of support pillars or obstacles in the environment.

#### Robust Space-time Footsteps for Agent-based Steering

Glen Berseth
 Randomized, continuous footstep action space. Geometric validation to ensure valid configuration.

With this work we build upon prevous work to inductavely design a robust footsteps-based steering algorithm. The steering algorithm uses an A* algorithm to generate least cost footstep plans between goals while avoiding dynamic collisions.

#### SteerFit

Glen Berseth, Brandon Haworth, Mubbasir Turab Kapadia, Petros Faloutsos

In the context of crowd simulation, there is a diverse set of algorithms that model steering. The performance of steering approaches, both in terms of quality of results and computational efficiency, depends on internal parameters that are manually tuned to satisfy application-specific requirements. This paper investigates the effect that these parameters have on an algorithm's performance. Using three representative steering algorithms and a set of established performance criteria, we perform a number of large scale optimization experiments that optimize an algorithm's parameters for a range of objectives.

#### SteerPlex

Glen Berseth

Levels from Baulder's Gate

SteerPlex works to define a salient set of scenario features that can be used to identify the relative complexity of a steering scenario.

#### Unreal IndiGolog Bot Project

Glen B

This is my project on developing strategic Unreal Tournament bots in the Unreal Development Kit. They were designed to display team work, using well defined team roles. These roles established seniority in a sense where some bot on the team could influence others and with one bot taking command of …