## Installing gym

In this course, we will mostly address RL environments available in the OpenAI Gym framework:

https://gym.openai.com

It provides a multitude of RL problems, from simple text-based problems with a few dozens of states (Gridworld, Taxi) to continuous control problems (Cartpole, Pendulum) to Atari games (Breakout, Space Invaders) to complex robotics simulators (Mujoco):

https://gym.openai.com/envs

However, gym is not maintained by OpenAI anymore since September 2022. We will use instead the gymnasium library maintained by the Farama foundation, which will keep on maintaining and improving the library.

https://gymnasium.farama.org/

You can install gymnasium and its dependencies using:

pip install -U gymnasium pygame moviepy
pip install gymnasium[classic_control]
pip install gymnasium[box2d]

For this exercise and the following, we will focus on simple environments whose installation is straightforward: toy text, classic control and box2d. More complex environments based on Atari games or the Mujoco physics simulator are described in the last (optional) section of this notebook, as they require additional dependencies.

On colab, gym cannot open graphical windows for visualizing the environments, as it is not possible in the browser. We will see a workaround allowing to produce videos. Running that cell in colab should allow you to run the simplest environments:

try:
IN_COLAB = True
except:
IN_COLAB = False

if IN_COLAB:
!pip install -U gymnasium pygame moviepy
!pip install gymnasium[box2d]
import numpy as np
import matplotlib.pyplot as plt
import os

import gymnasium as gym
print("gym version:", gym.__version__)

from moviepy.editor import ImageSequenceClip, ipython_display

class GymRecorder(object):
"""
Simple wrapper over moviepy to generate a .gif with the frames of a gym environment.

The environment must have the render_mode rgb_array_list.
"""
def __init__(self, env):
self.env = env
self._frames = []

def record(self, frames):
"To be called at the end of an episode."
for frame in frames:
self._frames.append(np.array(frame))

def make_video(self, filename):
"Generates the gif video."
directory = os.path.dirname(os.path.abspath(filename))
if not os.path.exists(directory):
os.mkdir(directory)
del self._frames
self._frames = []
gym version: 0.26.3

## Interacting with an environment

A gym environment is created using:

env = gym.make('CartPole-v1', render_mode="human")

where ‘CartPole-v1’ should be replaced by the environment you want to interact with. The following cell lists the environments available to you (including the different versions).

for env in gym.envs.registry.items():
print(env[0])
CartPole-v0
CartPole-v1
MountainCar-v0
MountainCarContinuous-v0
Pendulum-v1
Acrobot-v1
LunarLander-v2
LunarLanderContinuous-v2
BipedalWalker-v3
BipedalWalkerHardcore-v3
CarRacing-v2
Blackjack-v1
FrozenLake-v1
FrozenLake8x8-v1
CliffWalking-v0
Taxi-v3
Reacher-v2
Reacher-v4
Pusher-v2
Pusher-v4
InvertedPendulum-v2
InvertedPendulum-v4
InvertedDoublePendulum-v2
InvertedDoublePendulum-v4
HalfCheetah-v2
HalfCheetah-v3
HalfCheetah-v4
Hopper-v2
Hopper-v3
Hopper-v4
Swimmer-v2
Swimmer-v3
Swimmer-v4
Walker2d-v2
Walker2d-v3
Walker2d-v4
Ant-v2
Ant-v3
Ant-v4
Humanoid-v2
Humanoid-v3
Humanoid-v4
HumanoidStandup-v2
HumanoidStandup-v4
GymV26Environment-v0

The render_mode argument defines how you will see the environment:

• None (default): allows to train a DRL algorithm without wasting computational resources rendering it.
• rgb_array_list: allows to get numpy arrays corresponding to each frame. Will be useful when generating videos.
• ansi: string representation of each state. Only available for the “Toy text” environments.
• human: graphical window displaying the environment live.

The main interest of gym(nasium) is that all problems have a common interface defined by the class gym.Env. There are only three methods that have to be used when interacting with an environment:

• state, info = env.reset() restarts the environment and returns an initial state s_0.

• state, reward, terminal, truncated, info = env.step(action) takes an action a_t and returns:

• the new state s_{t+1},
• the reward r_{t+1},
• two boolean flags indicating whether the current state is terminal (won/lost) or truncated (timeout),
• a dictionary containing additional info for debugging (you can ignore it most of the time).
• env.render() displays the current state of the MDP. When the render mode is set to rgb_array_list or human, it does not even have to called explicitly (since gym 0.25).

With this interface, we can interact with the environment in a standardized way:

• We first create the environment.
• For a fixed number of episodes:
• We pick an initial state with reset().
• Until the episode is terminated:
• We select an action using our RL algorithm or randomly.
• We take that action (step()), observe the new state and the reward.
• We go into the new state.

The following cell shows how to interact with the CartPole environment using a random policy. Note that it will only work on your computer, not in colab.

env = gym.make('CartPole-v1', render_mode="human")

for episode in range(10):
state, info = env.reset()
done = False
while not done:
# Select an action randomly
action = env.action_space.sample()

# Sample a single transition
next_state, reward, terminal, truncated, info = env.step(action)

# Go in the next state
state = next_state

# End of the episode
done = terminal or truncated

env.close()

On colab (or whenever you want to record videos of the episodes instead of watching them live), you need to create the environment with the rendering mode rgb_array_list.

You then create a GymRecorder object (defined in the first cell of this notebook).

recorder = GymRecorder(env)

At the end of each episode, you tell the recorder to record all frames generated during the episode. The frames returned by env.render() are (width, height, 3) numpy arrays which are accumulated by the environment during the episode and flushed when env.reset() is called.

recorder.record(env.render())

You can then generate a gif at the end of the simulation with:

recorder.make_video('videos/CartPole-v1.gif')

Finally, you can render the gif in the notebook by calling at the very last line of the cell:

ipython_display('videos/CartPole-v1.gif')
env = gym.make('CartPole-v1', render_mode="rgb_array_list")
recorder = GymRecorder(env)

for episode in range(10):
state, info = env.reset()

done = False
while not done:
# Select an action randomly
action = env.action_space.sample()

# Sample a single transition
next_state, reward, terminal, truncated, info = env.step(action)

# Go in the next state
state = next_state

# End of the episode
done = terminal or truncated

# Record at the end of the episode
recorder.record(env.render())

recorder.make_video('videos/CartPole-v1.gif')
ipython_display('videos/CartPole-v1.gif', autoplay=1, loop=0)
MoviePy - Building file videos/CartPole-v1.gif with imageio.
                                                              

Each environment defines its state space (env.observation_space) and action space (env.action_space).

State and action spaces can either be :

• discrete (gym.spaces.Discrete(nb_states)), with states being an integer between 0 and nb_states -1.

• feature-based (gym.spaces.Box(low=0, high=255, shape=(SCREEN_HEIGHT, SCREEN_WIDTH, 3))) for pixel frames.

• continuous. Example for two joints of a robotic arm limited between -180 and 180 degrees:

gym.spaces.Box(-180.0, 180.0, (2, ))

You can sample a state or action randomly from these spaces:

action_space = gym.spaces.Box(-180.0, 180.0, (2, ))
action = action_space.sample()
print(action)
[-113.33049   83.84796]

Sampling the action space is particularly useful for exploration. We use it here to perform random (but valid) actions:

action = env.action_space.sample()

Q: Create a method random_interaction(env, number_episodes, recorder=None) that takes as arguments:

• The environment.
• The number of episodes to be performed.
• An optional GymRecorder object that may record the frames of the environment if it is not None (if renderer is not None:). Otherwise, do not nothing.

The method should return the list of undiscounted returns (\gamma=1, i.e. just the sum of rewards obtained during each episode) for all episodes.

def random_interaction(env, number_episodes, recorder=None):

returns = []

# Sample episodes
for episode in range(number_episodes):

# Sample the initial state
state, info = env.reset()

return_episode = 0.0
done = False
while not done:

# Select an action randomly
action = env.action_space.sample()

# Sample a single transition
next_state, reward, terminal, truncated, info = env.step(action)

# Update the return
return_episode += reward

# Go in the next state
state = next_state

# End of the episode
done = terminal or truncated

# Record at the end of the episode
if recorder is not None:
recorder.record(env.render())

returns.append(return_episode)

return returns

Q: Use that method to visualize all the available simple environments for a few episodes:

• CartPole-v1
• MountainCar-v0
• Pendulum-v1
• Acrobot-v1
• LunarLander-v2
• BipedalWalker-v3
• CarRacing-v2
• Blackjack-v1
• FrozenLake-v1
• CliffWalking-v0
• Taxi-v3

If you do many episodes (CarRacing or Taxi have very long episodes with a random policy), plot the obtained returns to see how they vary.

If you managed to install the mujoco and atari dependencies, feel free to visualize them too.

envname = 'CartPole-v1'
env = gym.make(envname, render_mode="rgb_array_list")
recorder = GymRecorder(env)

returns = random_interaction(env, 10, recorder)

plt.figure(figsize=(10, 6))
plt.plot(returns)
plt.xlabel("Episodes")
plt.ylabel("Return")
plt.show()

video = "videos/" + envname + ".gif"
recorder.make_video(video)
ipython_display(video)
MoviePy - Building file videos/CartPole-v1.gif with imageio.