From fc8b75f2ce5e5b42f23ee439e459ace6acb94eab Mon Sep 17 00:00:00 2001 From: Jan Kaiser Date: Mon, 4 Mar 2024 17:04:59 +0900 Subject: [PATCH] Minor fix gymnasium done and typo --- tutorial.ipynb | 276 ++++++++++++++++++++++++++----------------------- 1 file changed, 145 insertions(+), 131 deletions(-) diff --git a/tutorial.ipynb b/tutorial.ipynb index 70eb2a5..612b8aa 100644 --- a/tutorial.ipynb +++ b/tutorial.ipynb @@ -11,7 +11,7 @@ "

Applying Reinforcement Learning to Particle Accelerators: An Introduction

\n", "

Use case: Transverse beam steering at ARES linear accelerator at DESY

\n", "\n", - "Tutorial at 4th ICFA beam dynamics mini-workshop on machine learning applications for particle accelerators" + "Tutorial at 4th ICFA beam dynamics mini-workshop on machine learning applications for particle accelerators\n" ] }, { @@ -24,12 +24,12 @@ "source": [ "

Today!

\n", "\n", - "In this tutorial notebook we will implement all the basic components of a **Reinforcement Learning algorithm** to solve a problem in particle accelerators, focused on __reward definition__\n", + "In this tutorial notebook we will implement all the basic components of a **Reinforcement Learning algorithm** to solve a problem in particle accelerators, focused on **reward definition**\n", "\n", "- Part I: Introduction\n", "- Part II: Algorithm implementation in Python\n", "- Part III: Reward definition!\n", - "- Part IV: Training an RL agent" + "- Part IV: Training an RL agent\n" ] }, { @@ -42,7 +42,7 @@ "source": [ "
\n", "

Part I: Introduction

\n", - "
" + "\n" ] }, { @@ -62,7 +62,7 @@ "- **Final energy**: 100-155 MeV\n", "- **Bunch charge**: 0.5-30 pC\n", "- **Bunch length**: 0.2-10 fs\n", - "- **Pulse repetition rate**: 10-50 Hz" + "- **Pulse repetition rate**: 10-50 Hz\n" ] }, { @@ -77,7 +77,7 @@ "\n", "We would like to focus and center the electron beam on a diagnostic screen using corrector and quadrupole magnets\n", "\n", - "" + "\n" ] }, { @@ -92,12 +92,13 @@ "

Refresher from the lecture

\n", "\n", "We need to define:\n", + "\n", "- Actions\n", "- Observations\n", "- Reward\n", "- Environment\n", "- Agent\n", - "" + " \n" ] }, { @@ -114,7 +115,7 @@ "\n", "

Discussion

\n", "

$\\implies$ Is the action space continuous or discrete?

\n", - "

$\\implies$ Is the problem deterministic or stochastic?

" + "

$\\implies$ Is the problem deterministic or stochastic?

\n" ] }, { @@ -145,7 +146,7 @@ " \n", " \n", " \n", - "" + "\n" ] }, { @@ -177,7 +178,7 @@ " \n", "

The camera films the screen

\n", " \n", - "" + "\n" ] }, { @@ -192,17 +193,18 @@ "

The environment's state

\n", "\n", "The `state` can be fully described by with four components:\n", - "- The __target beam__: the beam we want to achieve, our goal\n", + "\n", + "- The **target beam**: the beam we want to achieve, our goal\n", " - as a 4-dimensional array $b^\\mathrm{(t)}=[\\mu_x^{(\\mathrm{t})},\\sigma_x^{(\\mathrm{t})},\\mu_y^{(\\mathrm{t})},\\sigma_y^{(\\mathrm{t})}]$, where $\\mu$ denotes the position on the screen, $\\sigma$ denotes the beam size, and $t$ stands for \"target\".\n", - "- The __incoming beam__: the beam that enters the EA upstream\n", - " - $I = [\\mu_x^{(\\mathrm{i})},\\sigma_x^{(\\mathrm{i})},\\mu_y^{(\\mathrm{i})},\\sigma_y^{(\\mathrm{i})},\\mu_{xp}^{(\\mathrm{i})},\\sigma_{xp}^{(\\mathrm{i})},\\mu_{yp}^{(\\mathrm{i})},\\sigma_{yp}^{(\\mathrm{i})},\\mu_s^{(\\mathrm{i})},\\sigma_s^{(\\mathrm{i})}]$, where $i$ stands for \"incoming\"\n", - "- The __magnet strengths__ and __deflection angles__\n", - " - $[k_{\\mathrm{Q1}},k_{\\mathrm{Q2}},\\theta_\\mathrm{CV},k_{\\mathrm{Q3}},\\theta_\\mathrm{CH}]$\n", - "- The __transverse misalignments__ of __quadrupoles__ and the __diagnostic screen__\n", - " - $[m_{\\mathrm{Q1}}^{(\\mathrm{x})},m_{\\mathrm{Q1}}^{(\\mathrm{y})},m_{\\mathrm{Q2}}^{(\\mathrm{x})},m_{\\mathrm{Q2}}^{(\\mathrm{y})},m_{\\mathrm{Q3}}^{(\\mathrm{x})},m_{\\mathrm{Q3}}^{(\\mathrm{y})},m_{\\mathrm{S}}^{(\\mathrm{x})},m_{\\mathrm{S}}^{(\\mathrm{y})}]$\n", + "- The **incoming beam**: the beam that enters the EA upstream\n", + " - $I = [\\mu_x^{(\\mathrm{i})},\\sigma_x^{(\\mathrm{i})},\\mu_y^{(\\mathrm{i})},\\sigma_y^{(\\mathrm{i})},\\mu_{xp}^{(\\mathrm{i})},\\sigma_{xp}^{(\\mathrm{i})},\\mu_{yp}^{(\\mathrm{i})},\\sigma_{yp}^{(\\mathrm{i})},\\mu_s^{(\\mathrm{i})},\\sigma_s^{(\\mathrm{i})}]$, where $i$ stands for \"incoming\"\n", + "- The **magnet strengths** and **deflection angles**\n", + " - $[k_{\\mathrm{Q1}},k_{\\mathrm{Q2}},\\theta_\\mathrm{CV},k_{\\mathrm{Q3}},\\theta_\\mathrm{CH}]$\n", + "- The **transverse misalignments** of **quadrupoles** and the **diagnostic screen**\n", + " - $[m_{\\mathrm{Q1}}^{(\\mathrm{x})},m_{\\mathrm{Q1}}^{(\\mathrm{y})},m_{\\mathrm{Q2}}^{(\\mathrm{x})},m_{\\mathrm{Q2}}^{(\\mathrm{y})},m_{\\mathrm{Q3}}^{(\\mathrm{x})},m_{\\mathrm{Q3}}^{(\\mathrm{y})},m_{\\mathrm{S}}^{(\\mathrm{x})},m_{\\mathrm{S}}^{(\\mathrm{y})}]$\n", "\n", "

Discussion

\n", - "

$\\implies$ Do we (fully) know or can we observe the state of the environment?

" + "

$\\implies$ Do we (fully) know or can we observe the state of the environment?

\n" ] }, { @@ -217,15 +219,16 @@ "

Our definition of observation

\n", "\n", "The `observation` for this task contains three parts:\n", - "- The __target beam__: the beam we want to achieve, our goal\n", + "\n", + "- The **target beam**: the beam we want to achieve, our goal\n", " - as a 4-dimensional array $b^\\mathrm{(t)}=[\\mu_x^{(\\mathrm{t})},\\sigma_x^{(\\mathrm{t})},\\mu_y^{(\\mathrm{t})},\\sigma_y^{(\\mathrm{t})}]$, where $\\mu$ denotes the position on the screen, $\\sigma$ denotes the beam size, and $t$ stands for \"target\".\n", - "- The __current beam__: the beam we currently have\n", - " - $b^\\mathrm{(c)}=[\\mu_x^{(\\mathrm{c})},\\sigma_x^{(\\mathrm{c})},\\mu_y^{(\\mathrm{c})},\\sigma_y^{(\\mathrm{c})}]$, where $c$ stands for \"current\"\n", - "- The __magnet strengths__ and __deflection angles__\n", - " - $[k_{\\mathrm{Q1}},k_{\\mathrm{Q2}},\\theta_\\mathrm{CV},k_{\\mathrm{Q3}},\\theta_\\mathrm{CH}]$\n", + "- The **current beam**: the beam we currently have\n", + " - $b^\\mathrm{(c)}=[\\mu_x^{(\\mathrm{c})},\\sigma_x^{(\\mathrm{c})},\\mu_y^{(\\mathrm{c})},\\sigma_y^{(\\mathrm{c})}]$, where $c$ stands for \"current\"\n", + "- The **magnet strengths** and **deflection angles**\n", + " - $[k_{\\mathrm{Q1}},k_{\\mathrm{Q2}},\\theta_\\mathrm{CV},k_{\\mathrm{Q3}},\\theta_\\mathrm{CH}]$\n", "\n", "

Discussion

\n", - "

$\\implies$ Does this state definition fullfil the Markov property? (does the probability distribution for the next beam depend only on the present state? or is it affected by information about the past?)

" + "

$\\implies$ Does this state definition fullfil the Markov property? (does the probability distribution for the next beam depend only on the present state? or is it affected by information about the past?)

\n" ] }, { @@ -240,12 +243,13 @@ "

Goal and reward

\n", "\n", "Our goal is divided in two tasks:\n", - "- to __steer__ the beam to the desired positions\n", - "- to __focus__ the beam to the desired beam size\n", + "\n", + "- to **steer** the beam to the desired positions\n", + "- to **focus** the beam to the desired beam size\n", "\n", "

Discussion

\n", "

$\\implies$ How should we define our reward function? Give it a go!

\n", - "

$\\implies$ We have a whole section dedicated to reward formulation later on

" + "

$\\implies$ We have a whole section dedicated to reward formulation later on

\n" ] }, { @@ -259,12 +263,11 @@ "

Formulating the RL problem

\n", "

Agent / algorithm

\n", "\n", - "\n", "\n", "

image from RL Tips and Tricks - A. Raffin

\n", "\n", "

Discussion

\n", - "

$\\implies$ What would you choose and why?

" + "

$\\implies$ What would you choose and why?

\n" ] }, { @@ -277,7 +280,7 @@ "source": [ "
\n", "

Part II: Algorithm implementation in Python

\n", - "
" + "\n" ] }, { @@ -291,14 +294,16 @@ "

About libraries for RL

\n", "\n", "There are many libraries with already implemented RL algorithms, and frameworks to implement an environment to interact with. In this notebook we use:\n", + "\n", "- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/) for the RL algorithms\n", "- [Gymnasium](https://gymnasium.farama.org/) for the environment\n", "\n", "

More info here

\n", "\n", - "Note: \n", + "Note:\n", + "\n", "- Gymnasium is the successor of the [OpenAI Gym](https://www.gymlibrary.dev/).\n", - "- Stable-baselines3 now has an early-stage JAX implementation [sbx](https://github.com/araffin/sbx)." + "- Stable-baselines3 now has an early-stage JAX implementation [sbx](https://github.com/araffin/sbx).\n" ] }, { @@ -312,7 +317,7 @@ "

Agent / algorithm

\n", "\n", "- As mentioned, we use the [Stable-Baselines3](https://stable-baselines3.readthedocs.io/) (SB3) package to implement the reinforcement learning algorithms.\n", - "- In this tutorial we focus on two examples: [PPO](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) (proximal policy optimization) and [TD3](https://stable-baselines3.readthedocs.io/en/master/modules/td3.html) (twin delayed DDPG)" + "- In this tutorial we focus on two examples: [PPO](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) (proximal policy optimization) and [TD3](https://stable-baselines3.readthedocs.io/en/master/modules/td3.html) (twin delayed DDPG)\n" ] }, { @@ -329,11 +334,12 @@ "\n", "A custom `gym.Env` would contain the following parts:\n", "\n", - "- __Initialization__: setup the environment, declares the allowed `observation_space` and `action_space`\n", - "- `reset` __method__: resets the environment for a new episode, returns 2-tuple `(observation, info)`\n", - "- `step` __method__: main logic of the environment. It takes an `action`, changes the environment to a new `state`, get new `observation`, compute the `reward`, and finally returns the 4-tuple `(observation, reward, done, info)` \n", - " - `done` checks if the current episode should be terminated (reached goal reached, or exceeded some thresholds)\n", - "- `render` __method__: to visualize the environment (a video, or just some plots)" + "- **Initialization**: setup the environment, declares the allowed `observation_space` and `action_space`\n", + "- `reset` **method**: resets the environment for a new episode, returns 2-tuple `(observation, info)`\n", + "- `step` **method**: main logic of the environment. It takes an `action`, changes the environment to a new `state`, get new `observation`, compute the `reward`, and finally returns the 5-tuple `(observation, reward, terminated, truncated, info)`\n", + " - `terminated` checks if the current episode should be terminated according to the underlying MDP (reached goal reached, or exceeded some thresholds)\n", + " - `truncated` checks if the current episode should be truncated outside of the underlying MD (e.g. time limit)\n", + "- `render` **method**: to visualize the environment (a video, or just some plots)\n" ] }, { @@ -345,7 +351,7 @@ }, "source": [ "

An overview of this RL project

\n", - "" + "\n" ] }, { @@ -361,10 +367,10 @@ "We list the most relevant parts of the project structure below:\n", "\n", "- `utils/train.py` contains the gym environments and the training script\n", - " - `ARESEA` implements the ARES Experimental Area transverse tuning task as a `gym.Env`. It contains the basic logic, such as definition of observation space, action space, and reward. How an action is taken is implemented in child classes with specific backends. \n", - " - `ARESEACheetah` is derived from the base class `ARESEA`, where it uses `cheetah` simulation as a backend.\n", - " - `make_env` Initializes a `ARESEA` envrionment, and wraps it with required [gym.wrappers](https://www.gymlibrary.dev/api/wrappers/) with convenient features (e.g. monitoring the progress, end episode when time_limit is reached, rescales the action, normalize the observation, ...)\n", - " - `train` convenient function for training the RL agent. It calls `make_env`, setup the RL algorithm, starts training, and saves the results in `utils/recordings`, `utils/monitors` and `utils/models`." + " - `ARESEA` implements the ARES Experimental Area transverse tuning task as a `gym.Env`. It contains the basic logic, such as definition of observation space, action space, and reward. How an action is taken is implemented in child classes with specific backends.\n", + " - `ARESEACheetah` is derived from the base class `ARESEA`, where it uses `cheetah` simulation as a backend.\n", + " - `make_env` Initializes a `ARESEA` envrionment, and wraps it with required [gym.wrappers](https://www.gymlibrary.dev/api/wrappers/) with convenient features (e.g. monitoring the progress, end episode when time_limit is reached, rescales the action, normalize the observation, ...)\n", + " - `train` convenient function for training the RL agent. It calls `make_env`, sets up the RL algorithm, starts training, and saves the results in `utils/recordings`, `utils/monitors` and `utils/models`.\n" ] }, { @@ -380,8 +386,8 @@ "We list the most relevant parts of the project structure below:\n", "\n", "- `utils/helpers.py` contains some utility functions\n", - " - `evaluate_ares_ea_agent` Takes a trained agent and evaluates its performance using different metrics.\n", - " - `plot_ares_ea_training_history` shows the progress during training" + " - `evaluate_ares_ea_agent` Takes a trained agent and evaluates its performance using different metrics.\n", + " - `plot_ares_ea_training_history` shows the progress during training\n" ] }, { @@ -394,13 +400,13 @@ "source": [ "

What is Cheetah?

\n", "\n", - "- RL algorithms require a large number of samples to learn ($10^5-10^9$), and getting those samples in the real accelerator is often too costly. \n", - " - This is why a common approach is to train the agent in simulation, and then deploy it in the real machine\n", + "- RL algorithms require a large number of samples to learn ($10^5-10^9$), and getting those samples in the real accelerator is often too costly.\n", + " - This is why a common approach is to train the agent in simulation, and then deploy it in the real machine\n", "- In our case we would train with optics simulation codes for accelerators, such as OCELOT \n", - " - These codes were developed to help the design phases of accelerators, but not to generate training data, making their computing time too high for RL.\n", - "- __Cheetah__ is a tensorized approach for transfer matrix tracking, which saves computation time and overhead compared to OCELOT\n", + " - These codes were developed to help the design phases of accelerators, but not to generate training data, making their computing time too high for RL.\n", + "- **Cheetah** is a tensorized approach for transfer matrix tracking, which saves computation time and overhead compared to OCELOT\n", "\n", - "You can find more information in the [paper](https://arxiv.org/abs/2401.05815) and the [code repository](https://github.com/desy-ml/cheetah)." + "You can find more information in the [paper](https://arxiv.org/abs/2401.05815) and the [code repository](https://github.com/desy-ml/cheetah).\n" ] }, { @@ -447,10 +453,11 @@ "- In this part, you will get familiar with the environment for the beam focusing and positioning at ARES accelerator.\n", "\n", "Some methods:\n", + "\n", "- `reset`: in both real and simulation cases: resets the magnets to initial values. In simulation, regenerate incoming beam, (optionally) resets the magnet misalignments.\n", "- `step`: set magnets to new settings. Observe the beam (run a simulation or observe screen image in real-world).\n", "\n", - "Now let's create the environment:" + "Now let's create the environment:\n" ] }, { @@ -476,7 +483,7 @@ "

$\\implies$ Let's define the position $(\\mu_x, \\mu_y)$ and size $(\\sigma_x, \\sigma_y)$ of the beam on the screen

\n", "

$\\implies$ Modify the target_beam list below, where the order of the arguments is $[\\mu_x,\\sigma_x,\\mu_y,\\sigma_y]$

\n", "

$\\implies$ Take into account the dimensions of the screen ($\\pm$ 2e-3 m)

\n", - "

$\\implies$ The target beam will be represented by a blue circle on the screen

" + "

$\\implies$ The target beam will be represented by a blue circle on the screen

\n" ] }, { @@ -517,7 +524,7 @@ "source": [ "env.target_beam_values = target_beam\n", "env.reset() ##\n", - "plt.figure(figsize = (7, 4))\n", + "plt.figure(figsize=(7, 4))\n", "plt.imshow(env.render()) # Plot the screen image" ] }, @@ -532,7 +539,7 @@ "

Get familiar with the Gym environment

\n", "

$\\implies$ Change the magnet values, i.e. the actions

\n", "

$\\implies$ The actions are normalized to 1, so valid values are in the [-1, 1] interval

\n", - "

$\\implies$ The values of the action list in the cell below follows this magnet order: [Q1, Q2, CV, Q3, CH]

" + "

$\\implies$ The values of the action list in the cell below follows this magnet order: [Q1, Q2, CV, Q3, CH]

\n" ] }, { @@ -552,7 +559,7 @@ } }, "source": [ - "Perform one step: update the env, observe new beam!" + "Perform one step: update the env, observe new beam!\n" ] }, { @@ -585,7 +592,7 @@ "env = RescaleAction(env, -1, 1) # rescales the action to the interval [-1, 1]\n", "env.reset()\n", "env.step(action)\n", - "plt.figure(figsize = (7, 4))\n", + "plt.figure(figsize=(7, 4))\n", "plt.imshow(env.render())" ] }, @@ -609,7 +616,7 @@ }, "source": [ "- Let's now use the environment in a loop, and perform 10 steps\n", - "- The function below will linearly vary the value of the vertical corrector" + "- The function below will linearly vary the value of the vertical corrector\n" ] }, { @@ -632,16 +639,17 @@ "env.reset()\n", "steps = 10\n", "\n", + "\n", "def change_vertical_corrector(q1, q2, cv, q3, ch, steps, i):\n", " action = np.array([q1, q2, cv + 1 / steps * i, q3, ch])\n", " return action\n", "\n", "\n", - "fig, ax = plt.subplots(1, figsize = (7, 4))\n", + "fig, ax = plt.subplots(1, figsize=(7, 4))\n", "for i in range(steps):\n", " action = change_vertical_corrector(0.2, -0.2, -0.5, 0.3, 0, steps, i)\n", " env.step(action)\n", - " \n", + "\n", " img = env.render()\n", " ax.imshow(img)\n", " display(fig)\n", @@ -659,7 +667,7 @@ "source": [ "
\n", "

Part III: Reward definition!

\n", - "
" + "\n" ] }, { @@ -670,10 +678,10 @@ } }, "source": [ - "- In the following, we reduce our problem to only __focusing of the beam__, and actuators to only __3 quadrupole magnets__ \n", - " - In this way, we can train a RL agent with fewer steps\n", + "- In the following, we reduce our problem to only **focusing of the beam**, and actuators to only **3 quadrupole magnets**\n", + " - In this way, we can train a RL agent with fewer steps\n", "\n", - "Training a good agent revolves primarily around finding the right setup for the environment and the correct reward function. In order to iterate over and compare many different options, our training function takes a dictionary called `config`. The dictionary keys or \"configurations\" are explained below" + "Training a good agent revolves primarily around finding the right setup for the environment and the correct reward function. In order to iterate over and compare many different options, our training function takes a dictionary called `config`. The dictionary keys or \"configurations\" are explained below\n" ] }, { @@ -686,7 +694,7 @@ "source": [ "

Configurations

\n", "\n", - "In the following, we use a `config` dictionary to set up the training. This allows us to easily switch between different training conditions. Below we show some selected configurations that have the most influence on training results, the parameters can mostly be divided into two parts." + "In the following, we use a `config` dictionary to set up the training. This allows us to easily switch between different training conditions. Below we show some selected configurations that have the most influence on training results, the parameters can mostly be divided into two parts.\n" ] }, { @@ -704,7 +712,7 @@ "- `action_mode` Set directly the magnet strength or set a delta action. You may set this to `\"direct\"` or `\"delta\"`. You should find that \"delta\" trains faster. Setting \"delta\" is also crucial in running the agent on the real accelerator.\n", "- `reward_mode`: How the reward is calculated. Can be set to `negative_objective`, `objective_improvement`, or `sum_of_pixels`.\n", "- `time_reward`: Whether the agent will be penalized for making another step, this is intended to make the tuning faster.\n", - "- `rescale_action`: Takes the limits of the magnet settings and scale them into the following range." + "- `rescale_action`: Takes the limits of the magnet settings and scale them into the following range.\n" ] }, { @@ -720,6 +728,7 @@ "

Environment configurations

\n", "\n", "Termination conditions:\n", + "\n", "- `abort_if_off_screen` If this property is set to True, episodes are aborted when the beam is no longer on the screen.\n", "- `time_limit`: Number of interactions the agent gets to tune the magnets within one episode.\n", "- `target_sigma_x_threshold`, `target_sigma_y_threshold`: Thresholds for beam parameters. If all beam parameters are within the threshold from their target, episodes will end and the agent will stop optimising.\n" @@ -734,7 +743,7 @@ }, "source": [ "

Question

\n", - "

$\\implies$ What does the existence of termination conditions says about the nature of the problem? is it episodic or continuous?

" + "

$\\implies$ What does the existence of termination conditions says about the nature of the problem? is it episodic or continuous?

\n" ] }, { @@ -747,7 +756,7 @@ "source": [ "

What could go wrong?

\n", "\n", - "Let's load some pre-trained models using different combinations of the `config` dictionary and using different reward definitions" + "Let's load some pre-trained models using different combinations of the `config` dictionary and using different reward definitions\n" ] }, { @@ -772,11 +781,11 @@ "

Reward = objective_improvement

\n", "Difference of the objective:\n", "\n", - "$$ r_\\mathrm{obj-improvement} = ( \\mathrm{obj}_{j-1} - \\mathrm{obj}_{j} ) / \\mathrm{obj}_0 $$\n", + "$$ r*\\mathrm{obj-improvement} = ( \\mathrm{obj}*{j-1} - \\mathrm{obj}\\_{j} ) / \\mathrm{obj}\\_0 $$\n", "\n", - "$$ obj = \\sum_{i}|b_i^\\mathrm{(c)} - b_i^\\mathrm{(t)}|$$\n", + "$$ obj = \\sum\\_{i}|b_i^\\mathrm{(c)} - b_i^\\mathrm{(t)}|$$\n", "\n", - "where $j$ is the index of the current time step." + "where $j$ is the index of the current time step.\n" ] }, { @@ -788,7 +797,7 @@ }, "source": [ "

Question

\n", - "

$\\implies$ What do you expect to happen, why?

" + "

$\\implies$ What do you expect to happen, why?

\n" ] }, { @@ -826,7 +835,7 @@ "while not (terminated or truncated):\n", " action, _ = loaded_model.predict(observation)\n", " observation, reward, terminated, truncated, info = env.step(action)\n", - " \n", + "\n", " img = env.render()\n", " ax.imshow(img)\n", " display(fig)\n", @@ -855,7 +864,7 @@ "\n", "

Reward = sum_of_pixels (focusing-only)

\n", " \n", - "$$r_\\mathrm{sum-pixel} = - \\sum_\\text{all pixels} \\text{pixel-value}$$" + "$$r_\\mathrm{sum-pixel} = - \\sum_\\text{all pixels} \\text{pixel-value}$$\n" ] }, { @@ -867,7 +876,7 @@ }, "source": [ "

Question

\n", - "

$\\implies$ What do you expect to happen, why?

" + "

$\\implies$ What do you expect to happen, why?

\n" ] }, { @@ -905,7 +914,7 @@ "while not (terminated or truncated):\n", " action, _ = loaded_model.predict(observation)\n", " observation, reward, terminated, truncated, info = env.step(action)\n", - " \n", + "\n", " img = env.render()\n", " ax.imshow(img)\n", " display(fig)\n", @@ -935,10 +944,10 @@ "

Reward = objective_improvement

\n", "Difference of the objective:\n", "\n", - "$$ r_\\mathrm{obj-improvement} = ( \\mathrm{obj}_{j-1} - \\mathrm{obj}_{j} ) / \\mathrm{obj}_0 $$\n", + "$$ r*\\mathrm{obj-improvement} = ( \\mathrm{obj}*{j-1} - \\mathrm{obj}_{j} ) / \\mathrm{obj}\\_0 $$\n", "$$ obj = \\sum_{i}|b_i^\\mathrm{(c)} - b_i^\\mathrm{(t)}|$$\n", "\n", - "where $j$ is the index of the current time step." + "where $j$ is the index of the current time step.\n" ] }, { @@ -951,7 +960,7 @@ "source": [ "

Question

\n", "

$\\implies$ What do you expect to happen?

\n", - "

$\\implies$ What is the difference between Agent 1: \"Gary Buchwald\" and this agent?

" + "

$\\implies$ What is the difference between Agent 1: \"Gary Buchwald\" and this agent?

\n" ] }, { @@ -989,7 +998,7 @@ "while not (terminated or truncated):\n", " action, _ = loaded_model.predict(observation)\n", " observation, reward, terminated, truncated, info = env.step(action)\n", - " \n", + "\n", " img = env.render()\n", " ax.imshow(img)\n", " display(fig)\n", @@ -1019,10 +1028,10 @@ "

Reward = objective_improvement

\n", "Difference of the objective:\n", "\n", - "$$ r_\\mathrm{obj-improvement} = ( \\mathrm{obj}_{j-1} - \\mathrm{obj}_{j} ) / \\mathrm{obj}_0 $$\n", + "$$ r*\\mathrm{obj-improvement} = ( \\mathrm{obj}*{j-1} - \\mathrm{obj}_{j} ) / \\mathrm{obj}\\_0 $$\n", "$$ obj = \\sum_{i}|b_i^\\mathrm{(c)} - b_i^\\mathrm{(t)}|$$\n", "\n", - "where $j$ is the index of the current time step." + "where $j$ is the index of the current time step.\n" ] }, { @@ -1035,7 +1044,7 @@ "source": [ "

Question

\n", "

$\\implies$ What do you expect to happen?

\n", - "

$\\implies$ What is the difference between Agent 1: \"Gary Buchwald\", Agent 3: \"Bertha Sparkman\", and this agent?

" + "

$\\implies$ What is the difference between Agent 1: \"Gary Buchwald\", Agent 3: \"Bertha Sparkman\", and this agent?

\n" ] }, { @@ -1073,7 +1082,7 @@ "while not (terminated or truncated):\n", " action, _ = loaded_model.predict(observation)\n", " observation, reward, terminated, truncated, info = env.step(action)\n", - " \n", + "\n", " img = env.render()\n", " ax.imshow(img)\n", " display(fig)\n", @@ -1103,9 +1112,9 @@ "

Reward = negative_objective\"

\n", "$$ \\mathrm{obj} = \\sum_{i}|b_i^\\mathrm{(c)} - b_i^\\mathrm{(t)}|$$\n", "\n", - "$$ r_\\mathrm{neg-obj} = -1 * \\mathrm{obj} / \\mathrm{obj}_0 $$\n", + "$$ r\\_\\mathrm{neg-obj} = -1 \\* \\mathrm{obj} / \\mathrm{obj}\\_0 $$\n", "\n", - "where $b = [\\mu_x,\\sigma_x,\\mu_y,\\sigma_y]$, $b^\\mathrm{(c)}$ is the current beam, and $b^\\mathrm{(t)}$ is the target beam. $\\mathrm{obj}_0$ is the initial objective after `reset`." + "where $b = [\\mu_x,\\sigma_x,\\mu_y,\\sigma_y]$, $b^\\mathrm{(c)}$ is the current beam, and $b^\\mathrm{(t)}$ is the target beam. $\\mathrm{obj}_0$ is the initial objective after `reset`.\n" ] }, { @@ -1117,7 +1126,7 @@ }, "source": [ "

Question

\n", - "

$\\implies$ What do you expect to happen, why?

" + "

$\\implies$ What do you expect to happen, why?

\n" ] }, { @@ -1155,7 +1164,7 @@ "while not (terminated or truncated):\n", " action, _ = loaded_model.predict(observation)\n", " observation, reward, terminated, truncated, info = env.step(action)\n", - " \n", + "\n", " img = env.render()\n", " ax.imshow(img)\n", " display(fig)\n", @@ -1173,7 +1182,7 @@ "source": [ "
\n", "

Part IV: Training an RL agent

\n", - "
" + "\n" ] }, { @@ -1187,8 +1196,8 @@ "

What is inside an actor-critic agent like PPO?

\n", "\n", "- An `actor model`, often a neural network, takes the `observation` of the current `state` and predicts an `action` to be taken (forward pass)\n", - " - In the ARES case, it observes the accelerator and predicts the magnet settings\n", - "- A `critic model`, also a neural network, takes the `observation` of the current `state` and predicts the value function of the state (and evaluates how good is the action taken by the `actor model`)" + " - In the ARES case, it observes the accelerator and predicts the magnet settings\n", + "- A `critic model`, also a neural network, takes the `observation` of the current `state` and predicts the value function of the state (and evaluates how good is the action taken by the `actor model`)\n" ] }, { @@ -1203,15 +1212,16 @@ "

Step 1: collect samples

\n", "\n", "- `n_samples = n_steps * n_envs` is the total number of samples, or interactions with the environment in one `epoch` (more on what that means later)\n", - " - One sample is collected at each step\n", - " - We can initialize `n_envs` parallel environments, in which the agent will take `n_steps`\n", - " - The total number of samples then has to account for the samples gathered in all environments\n", + " - One sample is collected at each step\n", + " - We can initialize `n_envs` parallel environments, in which the agent will take `n_steps`\n", + " - The total number of samples then has to account for the samples gathered in all environments\n", "\n", "At each step:\n", + "\n", "- The agent will take actions according to the current `actor model` prediction (forward pass of the model NN)\n", "- The `critic model` will predict the value functions of the states during the episode (forward pass of the model NN)\n", "\n", - "The samples (actions, rewards,...) from all environments are stored in a `buffer`, where `buffer_size = n_samples`" + "The samples (actions, rewards,...) from all environments are stored in a `buffer`, where `buffer_size = n_samples`\n" ] }, { @@ -1229,11 +1239,11 @@ "After performing `n_steps` in a particular environment (and therefore gathering `n_steps` number of samples per environment), it's time to update the actor and critic models (backpropagation of the NNs). Let's consider only 1 environment now for simplicity.\n", "\n", "- One can split the `n_samples` in mini-batches of a certain `batch_size`\n", - " - This means that the model will be completely updated (i.e. has seen all the samples) after `n_samples_tot`/`batch_size` number of backpropagations\n", - " - Once the model is updated, it can be trained again on the same samples a certain number of `n_epochs` (number of iterations on the training set)\n", - " - This process can be repeated a certain number of `epochs` (yes...)\n", - " - The total number of samples across the epochs is `total_timesteps`, where\n", - " - `total_timesteps = n_steps * n_envs * n_epochs = n_samples * n_epoch`" + " - This means that the model will be completely updated (i.e. has seen all the samples) after `n_samples_tot`/`batch_size` number of backpropagations\n", + " - Once the model is updated, it can be trained again on the same samples a certain number of `n_epochs` (number of iterations on the training set)\n", + " - This process can be repeated a certain number of `epochs` (yes...)\n", + " - The total number of samples across the epochs is `total_timesteps`, where\n", + " - `total_timesteps = n_steps * n_envs * n_epochs = n_samples * n_epoch`\n" ] }, { @@ -1245,7 +1255,7 @@ }, "source": [ "

What actually happens when you train a PPO agent?

\n", - "" + "\n" ] }, { @@ -1257,7 +1267,7 @@ }, "source": [ "

Question

\n", - "

$\\implies$ What the advantage of having a buffer?

" + "

$\\implies$ What the advantage of having a buffer?

\n" ] }, { @@ -1273,11 +1283,12 @@ "

Example

\n", "\n", "Let's consider the following training parameters:\n", + "\n", "- `n_steps` = 100\n", "- `n_envs` = 2\n", "- `batch_size` = 50\n", "- `n_epochs` = 3\n", - "- `epochs` = 2" + "- `epochs` = 2\n" ] }, { @@ -1291,7 +1302,7 @@ "

Question

\n", "

$\\implies$ What is total_timesteps?

\n", "

$\\implies$ What is the total number of batches n_batch in 1 epoch?

\n", - "

$\\implies$ What is the total number of model updates?

" + "

$\\implies$ What is the total number of model updates?

\n" ] }, { @@ -1311,7 +1322,7 @@ "- `net_arch`: architecture of the policy network (# of neurons in each layer)\n", "- `gamma`: Discount factor of the RL problem. Set lower to make rewards now more important than rewards later (usually above 0.9)\n", "- `normalize_observation`: Normalize observations throughout training by fitting a running mean and standard deviation of them\n", - "- `normalize_reward`: Normalize rewards throughout training by fitting a running mean and standard deviation of them" + "- `normalize_reward`: Normalize rewards throughout training by fitting a running mean and standard deviation of them\n" ] }, { @@ -1360,7 +1371,7 @@ "

Questions

\n", "

Looking at the config dictionary in the cell above:

\n", "

$\\implies$ How many epochs does it correspond to?

\n", - "

$\\implies$ How many model updates (backpropagation) would you be doing in total?

" + "

$\\implies$ How many model updates (backpropagation) would you be doing in total?

\n" ] }, { @@ -1372,7 +1383,7 @@ }, "source": [ "You will train the agent by executing the cell below:\n", - "_Note_: This could take about 10 min on a laptop." + "_Note_: This could take about 10 min on a laptop.\n" ] }, { @@ -1401,7 +1412,7 @@ "\n", "Let's look at the training metrics to see how the agent did.\n", "\n", - "Comment out the following line and set `agent_under_investigation` to the name of your agent, to check its training history." + "Comment out the following line and set `agent_under_investigation` to the name of your agent, to check its training history.\n" ] }, { @@ -1447,9 +1458,10 @@ "

Check the videos

\n", "\n", "To look at videos of the agent during training:\n", - "1. find the first output line of the training cell. Your agent should have a name (e.g. *Fred Rogers*). \n", - "2. Find the subdirectory `utils/recordings/`. \n", - "3. There should be a directory for the name of your agent with video files in it. The `ml_workshop` directory contains videos from an example training." + "\n", + "1. find the first output line of the training cell. Your agent should have a name (e.g. _Fred Rogers_).\n", + "2. Find the subdirectory `utils/recordings/`.\n", + "3. There should be a directory for the name of your agent with video files in it. The `ml_workshop` directory contains videos from an example training.\n" ] }, { @@ -1463,7 +1475,7 @@ "

Agent evaluation

\n", "Run the following cell to evaluate your agent. This is the mean deviation of the beam parameters from the target. Lower results are better.\n", "\n", - "If you are training agents that include the dipoles, set the functions argument `include_position=True`." + "If you are training agents that include the dipoles, set the functions argument `include_position=True`.\n" ] }, { @@ -1494,7 +1506,7 @@ } ], "source": [ - "plt.figure(figsize = (7,4))\n", + "plt.figure(figsize=(7, 4))\n", "evaluate_ares_ea_agent(agent_under_investigation, include_position=False, n=2000)" ] }, @@ -1508,7 +1520,7 @@ "source": [ "We can also test the trained agent on a simulation.\n", "\n", - "If you want to see an example agent instead of the one you just trained, set `agent_name=\"ml_workshop\"`." + "If you want to see an example agent instead of the one you just trained, set `agent_name=\"ml_workshop\"`.\n" ] }, { @@ -1542,7 +1554,7 @@ "while not done:\n", " action, _ = loaded_model.predict(observation)\n", " observation, reward, done, info = env.step(action)\n", - " \n", + "\n", " img = env.render(mode=\"rgb_array\")\n", " ax.imshow(img)\n", " display(fig)\n", @@ -1566,7 +1578,7 @@ "\n", "Note that this does not happen by itself and is the result of various careful decisions when designing the traiing setup.\n", "\n", - "Once trained, the agent is, however, trivial to use and requires no futher tuning or knowledge of RL." + "Once trained, the agent is, however, trivial to use and requires no futher tuning or knowledge of RL.\n" ] }, { @@ -1610,11 +1622,12 @@ "

Further Resources

\n", "\n", "### Getting started in RL\n", - " - [OpenAI Spinning Up](https://spinningup.openai.com/en/latest/index.html) - Very understandable explainations on RL and the most popular algorithms acompanied by easy-to-read Python implementations.\n", - " - [Reinforcement Learning with Stable Baselines 3](https://youtube.com/playlist?list=PLQVvvaa0QuDf0O2DWwLZBfJeYY-JOeZB1) - YouTube playlist giving a good introduction on RL using Stable Baselines3.\n", - " - [Build a Doom AI Model with Python](https://youtu.be/eBCU-tqLGfQ) - Detailed 3h tutorial of applying RL using *DOOM* as an example.\n", - " - [An introduction to Reinforcement Learning](https://youtu.be/JgvyzIkgxF0) - Brief introdution to RL.\n", - " - [An introduction to Policy Gradient methods - Deep Reinforcement Learning](https://www.youtube.com/watch?v=5P7I-xPq8u8) - Brief introduction to PPO." + "\n", + "- [OpenAI Spinning Up](https://spinningup.openai.com/en/latest/index.html) - Very understandable explainations on RL and the most popular algorithms acompanied by easy-to-read Python implementations.\n", + "- [Reinforcement Learning with Stable Baselines 3](https://youtube.com/playlist?list=PLQVvvaa0QuDf0O2DWwLZBfJeYY-JOeZB1) - YouTube playlist giving a good introduction on RL using Stable Baselines3.\n", + "- [Build a Doom AI Model with Python](https://youtu.be/eBCU-tqLGfQ) - Detailed 3h tutorial of applying RL using _DOOM_ as an example.\n", + "- [An introduction to Reinforcement Learning](https://youtu.be/JgvyzIkgxF0) - Brief introdution to RL.\n", + "- [An introduction to Policy Gradient methods - Deep Reinforcement Learning](https://www.youtube.com/watch?v=5P7I-xPq8u8) - Brief introduction to PPO.\n" ] }, { @@ -1629,12 +1642,12 @@ "\n", "### Papers\n", "\n", - " - [Learning-based optimisation of particle accelerators under partial observability without real-world training](https://proceedings.mlr.press/v162/kaiser22a.html) - Tuning of electron beam properties on a diagnostic screen using RL.\n", - " - [Sample-efficient reinforcement learning for CERN accelerator control](https://journals.aps.org/prab/abstract/10.1103/PhysRevAccelBeams.23.124801) - Beam trajectory steering using RL with a focus on sample-efficient training.\n", - " - [Autonomous control of a particle accelerator using deep reinforcement learning](https://arxiv.org/abs/2010.08141) - Beam transport through a drift tube linac using RL.\n", - " - [Basic reinforcement learning techniques to control the intensity of a seeded free-electron laser](https://www.mdpi.com/2079-9292/9/5/781/htm) - RL-based laser alignment and drift recovery.\n", - " - [Real-time artificial intelligence for accelerator control: A study at the Fermilab Booster](https://journals.aps.org/prab/abstract/10.1103/PhysRevAccelBeams.24.104601) - Regulation of a gradient magnet power supply using RL and real-time implementation of the trained agent using field-programmable gate arrays (FPGAs).\n", - " - [Magnetic control of tokamak plasmas through deep reinforcement learning](https://www.nature.com/articles/s41586-021-04301-9) - Landmark paper on RL for controling a real-world physical system (plasma in a tokamak fusion reactor)." + "- [Learning-based optimisation of particle accelerators under partial observability without real-world training](https://proceedings.mlr.press/v162/kaiser22a.html) - Tuning of electron beam properties on a diagnostic screen using RL.\n", + "- [Sample-efficient reinforcement learning for CERN accelerator control](https://journals.aps.org/prab/abstract/10.1103/PhysRevAccelBeams.23.124801) - Beam trajectory steering using RL with a focus on sample-efficient training.\n", + "- [Autonomous control of a particle accelerator using deep reinforcement learning](https://arxiv.org/abs/2010.08141) - Beam transport through a drift tube linac using RL.\n", + "- [Basic reinforcement learning techniques to control the intensity of a seeded free-electron laser](https://www.mdpi.com/2079-9292/9/5/781/htm) - RL-based laser alignment and drift recovery.\n", + "- [Real-time artificial intelligence for accelerator control: A study at the Fermilab Booster](https://journals.aps.org/prab/abstract/10.1103/PhysRevAccelBeams.24.104601) - Regulation of a gradient magnet power supply using RL and real-time implementation of the trained agent using field-programmable gate arrays (FPGAs).\n", + "- [Magnetic control of tokamak plasmas through deep reinforcement learning](https://www.nature.com/articles/s41586-021-04301-9) - Landmark paper on RL for controling a real-world physical system (plasma in a tokamak fusion reactor).\n" ] }, { @@ -1648,13 +1661,14 @@ "

Further Resources

\n", "\n", "### Literature\n", - " \n", - " - [Reinforcement Learning: An Introduction](http://incompleteideas.net/book/the-book.html) - Standard text book on RL.\n", + "\n", + "- [Reinforcement Learning: An Introduction](http://incompleteideas.net/book/the-book.html) - Standard text book on RL.\n", "\n", "### Packages\n", - " - [Gymnasium](https://gymnasium.farama.org/), (successor of [OpenAI Gym](https://www.gymlibrary.ml)) - De facto standard for implementing custom environments. Also provides a library of RL tasks widely used for benchmarking.\n", - " - [Stable Baselines3](https://github.com/DLR-RM/stable-baselines3) - Provides reliable, benchmarked and easy-to-use implementations of the most important RL algorithms.\n", - " - [Ray RLlib](https://docs.ray.io/en/latest/rllib/index.html) - Part of the *Ray* Python package providing implementations of various RL algorithms with a focus on distributed training." + "\n", + "- [Gymnasium](https://gymnasium.farama.org/), (successor of [OpenAI Gym](https://www.gymlibrary.ml)) - De facto standard for implementing custom environments. Also provides a library of RL tasks widely used for benchmarking.\n", + "- [Stable Baselines3](https://github.com/DLR-RM/stable-baselines3) - Provides reliable, benchmarked and easy-to-use implementations of the most important RL algorithms.\n", + "- [Ray RLlib](https://docs.ray.io/en/latest/rllib/index.html) - Part of the _Ray_ Python package providing implementations of various RL algorithms with a focus on distributed training.\n" ] } ],