Agent57: Outperforming the Human Atari Benchmark, Hacker News

Short-term memory

Agents need to have memory in order to take into account previous observations into their decision making. This allows the agent to not only base its decisions on the present observation (which is usually partial, that is, an agent only sees some of its world), but also on past observations, which can reveal more information about the environment as a whole . Imagine, for example, a task where an agent goes from room to room in order to count the number of chairs in a building. Without memory, the agent can only rely on the observation of one room. With memory, the agent can remember the number of chairs in previous rooms and simply add the number of chairs it observations in the present room to solve the task. Therefore the role of memory is to aggregate information from past observations to improve the decision making process. In deep RL and deep learning, recurrent neural networks such as Long- Short Term Memory (LSTM) are used as short term memories.

Interfacing memory with behavior is crucial for building systems that self-learn. In reinforcement learning, an agent can be an on-policy learner, which can only learn the value of its direct actions, or an off-policy learner, which can learn about optimal actions even when not performing those actions – eg, it might be taking random actions, but can still learn what the best possible action would be. Off-policy learning is therefore a desirable property for agents, helping them learn the best course of action to take while thoroughly exploring their environment. Combining off-policy learning with memory is challenging because you need to know what you might remember when executing a different behavior. For example, what you might choose to remember when looking for an apple (e.g., where the apple is located), is different to what you might choose to remember if looking for an orange. But if you were looking for an orange, you could still learn how to find the apple if you came across the apple by chance, in case you need to find it in the future. The first deep RL agent combining memory and off-policy learning was Deep Recurrent Q-Network (DRQN). More recently, a significant speciation in the lineage of Agent 114 occurred with Recurrent Replay Distributed DQN (R2D2), combining a neural network model of short-term memory with off-policy learning and distributed training, and achieving a very strong average performance on Atari . R2D2 modifies the replay mechanism for learning from past experiences to work with short term memory. All together, this helped R2D2 efficiently learn profitable behaviors, and exploit ) them for reward.

Episodic memory

We designed Never Give Up (NGU) to augment R2D2 with another form of memory: episodic memory. This enables NGU to detect when new parts of a game are encountered, so the agent can explore these newer parts of the game in case they yield rewards. This makes the agent’s behavior ( exploration ) deviate significantly from the policy the agent is trying to learn (obtaining a high score in the game); Thus, off-policy learning again plays a critical role here. NGU was the first agent to obtain positive rewards, without domain knowledge, on Pitfall, a game on which no agent had scored any points since the introduction of the Atari benchmark, and other challenging Atari games. Unfortunately, NGU sacrifices performance on what have historically been the “easier” games and so, on average, underperforms relative to R2D2. Intrinsic motivation methods to encourage directed exploration

In order to discover the most successful strategies, agents must explore their environment – but some exploration strategies are more efficient than others. With DQN, researchers attempted to address the exploration problem by using an undirected exploration strategy known as epsilon-greedy: with a fixed probability (epsilon), take a random action, otherwise pick the current best action. However, this family of techniques do not scale well to hard exploration problems: in the absence of rewards, they require a prohibitive amount of time to explore large state-action spaces, as they rely on undirected random action options to discover unseen states. In order to overcome this limitation, many directed exploration strategies have been proposed. Among these, one strand has focused on developing intrinsic motivation rewards that encourage an agent to explore and visit as many states as possible by providing more dense “internal” rewards for novelty-seeking behaviors. Within that strand, we distinguish two types of rewards: firstly, long-term novelty rewards encourage visiting many states throughout training, across many episodes. Secondly,

short-term novelty rewards encourage visiting many states over a short span of time (eg, within a single episode of a game).

Seeking novelty over long time scales

Long-term novelty rewards

signal when a previously unseen state is encountered in the agent’s lifetime, and is a function of the density of states seen so far in training: that is, it’s adjusted by how often the agent has seen a state similar to the current one relative to states seen overall. When the density is high (indicating that the state is

familiar ), the long term novelty reward is low, and vice versa. When all the states are familiar, the agent resorts to an undirected exploration strategy. However, learning density models of high dimensional spaces is fraught with problems due to the curse of dimensionality

. In practice, when agents use deep learning models to learn a density model, they suffer from catastrophic forgetting (forgetting information seen previously as they encounter new experiences), as well as an inability to produce precise outputs for all inputs. For example, in Montezuma’s Revenge, unlike undirected exploration strategies, long-term novelty rewards allow the agent to surpass the human baseline. However, even the

best performing methods on Montezuma’s Revenge

need to carefully train a density model at the right speed: when the density model indicates that the states in the first room are familiar , the agent should be able to consistently get to unfamiliar territory.
(Read More)