Environment Setup

Install VSCode

Download and install this fantastic editor

Install Extensions:

  • Python
  • Jupyter

Install Python Environment with uv

We’ll use uv as a modern, fast Python package manager along with the provided pyproject.toml file to set up the environment.

Install uv

Install uv by running the following command in your terminal:

curl -LsSf https://astral.sh/uv/install.sh | sh

After installation, restart your terminal or run:

source $HOME/.local/bin/env

Create Virtual Environment and Install Dependencies

Download the pyproject.toml to finish configuration

# Download the toml file
wget https://rhit-csse.github.io/CSSE490_DRL/pyproject.toml

# Activate the virtual environment
source .venv/bin/activate

Install the project dependencies from pyproject.toml:

# Install base dependencies
uv pip install -e .

# Install optional Jupyter dependencies for lab work
uv pip install -e ".[jupyter]"

The pyproject.toml file includes all necessary packages:

  • PyTorch for deep learning
  • Gymnasium for RL environments
  • NumPy, Pandas for data processing
  • Matplotlib, Seaborn for visualization
  • Jupyter for interactive notebooks

Verify your setup by running the code

Demo: CEM Implementation
Download
# %% [markdown]
# # Lab 1: Cross-Entropy Method (CEM) for CartPole
# 
# ## Introduction
# In this lab, we will implement the Cross-Entropy Method (CEM) algorithm to solve the CartPole-v1 environment. CEM is a simple yet effective policy search method that iteratively improves a policy by sampling from a distribution and selecting elite samples.
# 
# ## Student Information
# **Name:** [Your Name Here]  
# **Date:** [Today's Date]
#
# ---

# %% [markdown]
# ## Part 1: Setup and Environment Testing
# Let's start by importing the necessary libraries and creating the CartPole environment.

# %%
import numpy as np
import gymnasium as gym
import pandas as pd
import matplotlib.pyplot as plt
from IPython import display

# Create environment
env = gym.make("CartPole-v1", render_mode="rgb_array")

total_reward = 0.0
total_steps = 0
obs, info = env.reset()

print(f"Observation space: {env.observation_space}")
print(f"Action space: {env.action_space}")
print(f"Initial observation: {obs}")

# %% [markdown]
# ### Random Agent Test
# Let's test the environment with a random agent to see how it performs.

# %%
while True:
    action = env.action_space.sample()
    obs, reward, done, truncated, info = env.step(action)
    total_reward += reward
    plt.imshow(env.render())
    display.display(plt.gcf())    
    display.clear_output(wait=True)
    plt.close()

    if done or truncated:
        break

print(f"Random agent - Total reward: {total_reward}, Total steps: {total_steps}")


# %% [markdown]
# ---
# ## Part 2: Linear Policy Implementation
# 
# We'll implement a simple linear policy that maps observations to actions using a weight matrix W and bias vector b.

# %%
class LinearPolicy(object):
    def __init__(self, theta, ob_space, ac_space):
        assert len(theta) == (ob_space + 1) * ac_space
        self.W = theta[0:ob_space*ac_space].reshape(ob_space, ac_space)
        self.b = theta[ob_space*ac_space:None].reshape(1, ac_space)

    def act(self, obs):
        y = obs.dot(self.W) + self.b
        a = y.argmax()
        return a


# Test the policy
ob_space = env.observation_space.shape[0]
ac_space = env.action_space.n
n_theta = (ob_space + 1) * ac_space
print(f"Policy parameter size: {n_theta}")

def run_episode(policy, env, num_steps, render=False):
    total_rew = 0
    ob, info = env.reset()
    for t in range(num_steps):
        a = policy.act(ob)
        ob, reward, done, truncated, info = env.step(a)
        total_rew += reward
        if done or truncated:
            break
    return total_rew


theta_mean = np.zeros(n_theta)
theta_std = np.ones(n_theta)

reward_list = []

# CEM hyperparameters
n_iterations = 50  # Number of CEM iterations
n_samples = 25     # Number of policy samples per iteration
n_elite = 5        # Number of elite samples to keep

print("Starting CEM training...")
print(f"Iterations: {n_iterations}, Samples: {n_samples}, Elite: {n_elite}")
print("-" * 60)

for itr in range(n_iterations):
    # Sample policies from current distribution
    thetas = np.random.multivariate_normal(
        mean=theta_mean,
        cov=np.diag(np.array(theta_std)**2),
        size=n_samples
    )

    rewards = []
    for theta in thetas:
        policy = LinearPolicy(theta, ob_space, ac_space)
        r = run_episode(policy, env, 500)
        rewards.append(r)

    rewards = np.array(rewards)
    
    # Get elite parameters
    elite_inds = rewards.argsort()[-n_elite:]
    elite_thetas = thetas[elite_inds]

    # Update theta_mean, theta_std
    theta_mean = elite_thetas.mean(axis=0)
    theta_std = elite_thetas.std(axis=0)
    
    # Log progress
    print(f"[Iteration {itr:2d}] mean: {np.mean(rewards):5.1f} | max: {np.max(rewards):5.1f} | min: {np.min(rewards):5.1f}")
    reward_list.append(np.mean(rewards))

print("-" * 60)
print("Training complete!")

env.close()

# %% [markdown]
# ---
# ## Part 5: Results Analysis
# 
# Let's visualize the learning curve and analyze the results.

# %%
# Plot learning curve
plt.figure(figsize=(10, 6))
plt.plot(reward_list, linewidth=2)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Average Reward', fontsize=12)
plt.title('CEM Learning Curve on CartPole-v1', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Print statistics
print(f"Initial average reward: {reward_list[0]:.2f}")
print(f"Final average reward: {reward_list[-1]:.2f}")
print(f"Best average reward: {max(reward_list):.2f}")
print(f"Improvement: {reward_list[-1] - reward_list[0]:.2f}")




# %% [markdown]
# ---
# ## Save Results
# 
# Save the results to a CSV file for later analysis.

# %%
df = pd.DataFrame({"reward": reward_list})
df.to_csv("Linear_CEM.csv", index=False, header=True)

env.close()

Copy the demo code above in VScode and try to run it on your computer.

Write Report

Each lab, you’re required to write a report to

  • demonstrate your code result
  • answer a few questions to show your understanding of the content

Please go thru this Submission Guideline to get familiarized the overall process.

For this lab, you should include the following items in the report:

  • Answer to the question: What are the cons and pros of CEM for RL?
  • A picture of the learning curve of CEM on CartPole-v1

Push Further for Full Credit

So far, your effort is mostly on installing the environment. Ideally, I want you to learn more about RL and CEM thru this lab.

CEM for Continuous Control

The demo shows you how to use CEM to run a discrete case, meaning the action is categorical (either 0 or 1 in Cartpole-v1). Now, I want you to modify the code to run a continuous task - Pendulum-v1, in which the action takes a number ranging from -2 to 2.

There are several places you need to change:

  • To get the dimension of the action space, use env.action_space.shape[0].
  • Change the parameters self.W and self.b so that the output is a signal scalar number from -2 to 2.
    • To bound (or squash) the range, feel free to put a 2*np.tanh() around the output.

Debug: you will encounter a bug that is due to incorrect data dimension. Use the opportunity to practice your debugging skill.

Hint: squeeze() is a very common function to remove redundant dimensions.

When you have done this part, plot the learning curve of Pendulum-v1 into your report and briefly share your thoughts on it: what is the performance looks like, and why it is this way.

Improve the performance

The fundamental deficiency of the current setup is that the policy is too simple/weak to handle the complexity of this continuous task. Try to improve the code to enhance the complexity of the policy model, e.g., make it a 3-layer neural network with non-linear activation function. Finish the implementation and plot the results in the report.

Deliverables and Rubrics

Overall, you need to complete the environment installation and be able to run the demo code. You need to submit:

  • (75 pts) A PDF from running the demo code in jupyter notebook with embedded learning curve picture.
  • (10 pts) If you finish the continuous control part, submit the additional report from the modified code.
  • (15 pts) If you improve continuous control policy, submit the another report with the improved learning curve picture.