{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "nRrn4j-O0JMM" }, "source": [ "# Practical Session: Reinforcement Learning" ] }, { "cell_type": "markdown", "metadata": { "id": "WFZsyyDk0JMP" }, "source": [ "In" ] }, { "cell_type": "markdown", "metadata": { "id": "SWKBTGvH0JMQ" }, "source": [ "# Gymnasium" ] }, { "cell_type": "markdown", "metadata": { "id": "KvhmAjE30JMQ" }, "source": [ "Gymnasium - это фреймворк моделирования и библиотека задач (или модельных сред environment) разработанная в [OpenAI](https://openai.com/) с единым доступом к стандартному интерфейсу модели.\n", "Этот стандартный интерфейс позволяет писать общие алгоритмы обучения с подкреплением (RL) и тестировать их в нескольких средах без особых адаптаций.\n", "Его основным объектом является ** среда **, обычно создаваемая с помощью инструкции ``python gym.make(\"ENV_NAME\")```\n", "В Gymnasium есть три ключевых метода:\n", "* `reset()`: этот метод сбрасывает среду и возвращает наблюдение случайного начального состояния \n", "* `step(a)`: этот метод выполняет действие `a` и возвращает три переменные:\n", " * `observation`: наблюдение за следующим состоянием\n", " * `reward`: награда, полученная после перехода из предыдущего состояния в новое при выполнении действия \"а\".\n", " * `done`: логическое значение, указывающее, завершен ли эпизод.\n", " * `info`: переменная, используемая для передачи любого другого вида информации\n", "* `render()`: метод отображает/отрисовывает текущее состояние среды" ] }, { "cell_type": "markdown", "metadata": { "id": "b_AWOwwu0JMR" }, "source": [ "## Frozen Lake\n", "\n", "Ледяное озеро [Frozen lake](https://gym.openai.com/envs/FrozenLake-v0/)\n", "простая среда \"сеточного мира\", созданная в Gymnasium.\n", "Начиная с неизменяющейся начальной позиции, вы управляете агентом, цель которого - достичь цели, расположенной на противоположной стороне карты.\n", "По некоторым плиткам можно ходить, по другим - нет (полынья), и хождение по ним ведет к окончанию эпизода. Из-за скользкости замерзшего озера в переходы может добавляться некоторая случайность, что означает, что направление движения агента неопределенно и лишь частично зависит от выбранного направления.\n", "\n", "Вот официальное описание замерзшего озера, предоставленное Open AI: \n", "*Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.*\n", "\n", "*Поверхность описывается с помощью сетчатой поверхности, например:*\n", "```\n", "SFFF (S: starting point, safe)\n", "FHFH (F: frozen surface, safe)\n", "FFFH (H: hole, fall to your doom)\n", "HFFG (G: goal, where the frisbee is located)\n", "```\n", "* Эпизод заканчивается, когда вы достигаете цели или падаете в полынью. Вы получаете награду в размере 1, если достигнете цели, и 0 в противном случае.*\n", "\n", "Давайте создадим Модельную Среду Ледяного озера:" ] }, { "cell_type": "code", "source": [ "!pip install gymnasium" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ixBQ-gjmvr-J", "outputId": "0ca0cbc7-5e35-4f8f-8667-2746765bb3d2" }, "execution_count": 1, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", "Requirement already satisfied: gymnasium in /usr/local/lib/python3.9/dist-packages (0.28.1)\n", "Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.9/dist-packages (from gymnasium) (1.22.4)\n", "Requirement already satisfied: cloudpickle>=1.2.0 in /usr/local/lib/python3.9/dist-packages (from gymnasium) (2.2.1)\n", "Requirement already satisfied: importlib-metadata>=4.8.0 in /usr/local/lib/python3.9/dist-packages (from gymnasium) (6.1.0)\n", "Requirement already satisfied: jax-jumpy>=1.0.0 in /usr/local/lib/python3.9/dist-packages (from gymnasium) (1.0.0)\n", "Requirement already satisfied: typing-extensions>=4.3.0 in /usr/local/lib/python3.9/dist-packages (from gymnasium) (4.5.0)\n", "Requirement already satisfied: farama-notifications>=0.0.1 in /usr/local/lib/python3.9/dist-packages (from gymnasium) (0.0.4)\n", "Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.9/dist-packages (from importlib-metadata>=4.8.0->gymnasium) (3.15.0)\n" ] } ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "id": "tCxJZ9Kn0JMR", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "06e0ca22-6376-46dd-d4a2-b5dee8009a48" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "EnvSpec(id='FrozenLake8x8-v1', entry_point='gymnasium.envs.toy_text.frozen_lake:FrozenLakeEnv', reward_threshold=0.85, nondeterministic=False, max_episode_steps=None, order_enforce=True, autoreset=False, disable_env_checker=False, apply_api_compatibility=False, kwargs={'map_name': '8x8', 'render_mode': 'ansi', 'is_slippery': False}, namespace=None, name='FrozenLake8x8', version=1, additional_wrappers=(), vector_entry_point=None)\n" ] } ], "source": [ "import gymnasium as gym\n", "from gymnasium.spaces import Discrete,MultiDiscrete\n", "\n", "env = gym.make(\"FrozenLake8x8-v1\",render_mode='ansi',is_slippery=False) # render_mode='human' render_mode=\"rgb_array\" ansi_list)\n", "env.reset()\n", "env.render()\n", "print(env.env.spec )" ] }, { "cell_type": "markdown", "metadata": { "id": "JVJS3IfP0JMS" }, "source": [ "Модельная Среда предоставляет информацию о своих Действиях и Состояниях." ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "id": "d6Rn2_X10JMT", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "4be654c5-c5c1-4251-c835-47f73793550c" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Action Space Discrete(4)\n", "State Space Discrete(64)\n", "Rewards (0, 1)\n" ] } ], "source": [ "print(f\"Action Space {env.action_space}\")\n", "print(f\"State Space {env.observation_space}\")\n", "print(f\"Rewards {env.reward_range }\")" ] }, { "cell_type": "markdown", "metadata": { "id": "7bVcnm0v0JMT" }, "source": [ "Среда состоит из 64 дискретных состояний, соответствующих положению агента в сети. \n", "Четырьмя возможными действиями являются:" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "id": "oL4IVAZs0JMT", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "dd8dd012-8020-4782-c810-30231b9c5cfe" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "action 0: ←\n", "action 1: ↓\n", "action 2: →\n", "action 3: ↑\n" ] } ], "source": [ "action_map = {0:u'\\u2190', 1:u'\\u2193', 2:u'\\u2192', 3:u'\\u2191'}\n", "for k, v in action_map.items():\n", " print(f\"action {k}: {v}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "xJZVf4VU0JMU" }, "source": [ "Используйте методы `step` и `render`, чтобы наблюдать за воздействием действий на Окружающую Среду" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "id": "o_-rOkxU0JMU", "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "outputId": "19b68b63-60a7-4e1f-cfc1-a57d2b07fdb2" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "' (Down)\\nSFFFFFFF\\n\\x1b[41mF\\x1b[0mFFFFFFF\\nFFFHFFFF\\nFFFFFHFF\\nFFFHFFFF\\nFHHFFFHF\\nFHFFHFHF\\nFFFHFFFG\\n'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 102 } ], "source": [ "env.step(1)\n", "env.render()" ] }, { "cell_type": "markdown", "metadata": { "id": "ck2vZNpj0JMU" }, "source": [ "Эта функция поможет визуализировать траектории движения нашего агента:" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "id": "_9aqdJ3Y0JMU" }, "outputs": [], "source": [ "from IPython.display import clear_output\n", "from time import sleep\n", "\n", "def display_trajectory(frames):\n", " for i, frame in enumerate(frames):\n", " clear_output(wait=True)\n", " print(frame['frame'])\n", " print(f\"Timestep: {i + 1}\")\n", " print(f\"State: {frame['state']}\")\n", " print(f\"Reward: {frame['reward']}\")\n", " sleep(.2)" ] }, { "cell_type": "markdown", "metadata": { "id": "TGWFYqGp0JMU" }, "source": [ "# Агент случайного действия" ] }, { "cell_type": "markdown", "metadata": { "id": "VBSii8fD0JMU" }, "source": [ "Следующий код показывает, как запустить эпизод с агентом, выполняющим случайные действия." ] }, { "cell_type": "code", "execution_count": 136, "metadata": { "id": "QK5gHNNR0JMV", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "c987ea6d-d093-4440-e208-eb6eaff1edc0" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ ">>>>>\n", "Timestep: 50\n", "State: 29\n", "Reward: 0.0\n" ] } ], "source": [ "frames = [] # for animation\n", "env.reset()\n", "while True:\n", " # draw a random action from the action space\n", " action = env.action_space.sample()\n", " # the step method takes an action as input and returns 4 variables described in the OpenAI section\n", " state, reward, done, info, loo = env.step(action)\n", " frames.append({\n", " 'frame': env.render,\n", " 'state': state,\n", " 'reward': reward\n", " })\n", " #if done is True then the episode is over\n", " if done:\n", " break\n", " \n", "display_trajectory(frames)" ] }, { "cell_type": "markdown", "metadata": { "id": "6Xy-TqHk0JMV" }, "source": [ "## Политика человека (действия игрока) " ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "id": "gB5c8UmY0JMV", "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "outputId": "7d4c8b09-4ecd-40f2-9455-e33397631bda" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'\\n\\x1b[41mS\\x1b[0mFFFFFFF\\nFFFFFFFF\\nFFFHFFFF\\nFFFFFHFF\\nFFFHFFFF\\nFHHFFFHF\\nFHFFHFHF\\nFFFHFFFG\\n'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 105 } ], "source": [ "env.reset()\n", "env.render()" ] }, { "cell_type": "markdown", "metadata": { "id": "qCcN5dHi0JMV" }, "source": [ "Мы сейчас посмотрели случайную политику. Давайте теперь проверим нашу политику.\n", "Используя возможные действия, попытайтесь достичь цели, выполнив несколько раз следующую ячейку кода.\n", "Напоминание:\n", "action 0: ←\n", "action 1: ↓\n", "action 2: →\n", "action 3: ↑" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "id": "bsiJTO480JMV", "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "outputId": "d01daaef-0c43-4323-9e32-48a09b2ab046" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "' (Down)\\nSFFFFFFF\\n\\x1b[41mF\\x1b[0mFFFFFFF\\nFFFHFFFF\\nFFFFFHFF\\nFFFHFFFF\\nFHHFFFHF\\nFHFFHFHF\\nFFFHFFFG\\n'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 106 } ], "source": [ "env.step(1)\n", "env.render()" ] }, { "cell_type": "code", "source": [ "print('---- winning sequence ------ ')\n", "actions = { 'Left': 0, 'Down': 1, 'Right': 2, 'Up': 3 }\n", "winning_sequence = (7 * ['Right']) + (7 * ['Down'])\n", "print(winning_sequence)\n", "# env = gym.make(\"FrozenLake-v0\", is_slippery=False)\n", "env.reset()\n", "env.render()\n", "\n", "for a in winning_sequence:\n", " new_state, reward, done, _, info = env.step(actions[a])\n", " env.render()\n", " print(f\"Reward: {reward:.2f}\")\n", " print(new_state)\n", " if done: break" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9qGsOYfc9-Fh", "outputId": "1e2ba81d-b1f4-4993-fec3-10e9ddcb16de" }, "execution_count": 135, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "---- winning sequence ------ \n", "['Right', 'Right', 'Right', 'Right', 'Right', 'Right', 'Right', 'Down', 'Down', 'Down', 'Down', 'Down', 'Down', 'Down']\n", "Reward: 0.00\n", "1\n", "Reward: 0.00\n", "2\n", "Reward: 0.00\n", "2\n", "Reward: 0.00\n", "10\n", "Reward: 0.00\n", "2\n", "Reward: 0.00\n", "3\n", "Reward: 0.00\n", "3\n", "Reward: 0.00\n", "11\n", "Reward: 0.00\n", "10\n", "Reward: 0.00\n", "11\n", "Reward: 0.00\n", "19\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "Jw5w7yKG0JMW" }, "source": [ "В модели есть вероятность достижения желаемой цели, что связано со стохастичностью окружающей среды.\n", "Модельная среда предоставляет полное описание модели переходов:" ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "id": "7Od24Qgu0JMW", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "162eb569-8934-4c62-9503-9fecf5668478" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{0: [(0.3333333333333333, 1, 0.0, False),\n", " (0.3333333333333333, 0, 0.0, False),\n", " (0.3333333333333333, 9, 0.0, False)],\n", " 1: [(0.3333333333333333, 0, 0.0, False),\n", " (0.3333333333333333, 9, 0.0, False),\n", " (0.3333333333333333, 2, 0.0, False)],\n", " 2: [(0.3333333333333333, 9, 0.0, False),\n", " (0.3333333333333333, 2, 0.0, False),\n", " (0.3333333333333333, 1, 0.0, False)],\n", " 3: [(0.3333333333333333, 2, 0.0, False),\n", " (0.3333333333333333, 1, 0.0, False),\n", " (0.3333333333333333, 0, 0.0, False)]}" ] }, "metadata": {}, "execution_count": 125 } ], "source": [ "state = 1\n", "env.env.P[state]" ] }, { "cell_type": "markdown", "metadata": { "id": "8oNz3tP-0JMX" }, "source": [ "Здесь видны вероятности переходов для каждого возможного действия.\n", "Например, выполнение действия 2 имеет:\n", "* 33% шансов привести к состоянию 9\n", "* 33% шансов привести к состоянию 2\n", "* 33% шансов привести к состоянию 1\n", "\n", "Каждая строка составлена следующим образом: (вероятность, следующее состояние, награда, конец эпизода)" ] }, { "cell_type": "code", "source": [ "state = 62\n", "env.env.P[state]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "i81eMF1-EahC", "outputId": "b1f68517-0448-49be-fd60-e4f783e41688" }, "execution_count": 129, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{0: [(0.3333333333333333, 54, 0.0, True),\n", " (0.3333333333333333, 61, 0.0, False),\n", " (0.3333333333333333, 62, 0.0, False)],\n", " 1: [(0.3333333333333333, 61, 0.0, False),\n", " (0.3333333333333333, 62, 0.0, False),\n", " (0.3333333333333333, 63, 1.0, True)],\n", " 2: [(0.3333333333333333, 62, 0.0, False),\n", " (0.3333333333333333, 63, 1.0, True),\n", " (0.3333333333333333, 54, 0.0, True)],\n", " 3: [(0.3333333333333333, 63, 1.0, True),\n", " (0.3333333333333333, 54, 0.0, True),\n", " (0.3333333333333333, 61, 0.0, False)]}" ] }, "metadata": {}, "execution_count": 129 } ] }, { "cell_type": "markdown", "metadata": { "id": "qc1cA1rn0JMX" }, "source": [ "#Расчет политики (правил действий)" ] }, { "cell_type": "markdown", "metadata": { "id": "YE6pGM_c0JMX" }, "source": [ "Let's now try to solve the FrozenLake problem using the value policy algorithm.\n", "![](https://github.com/DavidBert/N7-techno-IA/blob/master/code/reinforcement_learning/images/policy_iter.png?raw=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "kJEgMDCN0JMX" }, "source": [ "Запишем реализацию расчета алгоритма политики (правила):\n", "(параметр: ```env.env.P[s][a]``` возвратит список \n", "$$[(p_1, s'_1, r_1, done)\\\\ \n", "...\\\\\n", "(p_n, s'_n, r_n, done)]$$\n", "Выполним перебор по этому списку с расчетом $\\sum_{s',r}p(s',r|s,a)[r+\\gamma V(s')]$)" ] }, { "cell_type": "code", "execution_count": 130, "metadata": { "id": "t-7tCF8v0JMX" }, "outputs": [], "source": [ "def compute_sum(env, V, s, a, gamma):\n", " # V is a list containing the estimated value for every state\n", " # len(V) = nb_states\n", " total = 0 # state value for state s\n", " for p, s_prime, r, _ in env.env.P[s][a]:\n", " total += p * (r + gamma * V[s_prime])\n", " return total" ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "id": "pXsNtJGO0JMY" }, "outputs": [], "source": [ "import numpy as np \n", "\n", "def policy_iteration(env, gamma, theta):\n", " nb_states = env.observation_space.n\n", " nb_actions = env.action_space.n\n", " # 1. Initialization\n", " V = np.zeros(nb_states)\n", " pi = np.zeros(nb_states)\n", " \n", " while True:\n", " \n", " # 2. Policy Evaluation\n", " while True:\n", " delta = 0\n", " for s in range(nb_states):\n", " v = V[s]\n", " V[s] = compute_sum(env, V=V, s=s, a=pi[s], gamma=gamma)\n", " delta = max(delta, abs(v - V[s]))\n", " if delta < theta: break\n", "\n", " # 3. Policy Improvement\n", " policy_stable = True\n", " for s in range(nb_states):\n", " old_action = pi[s]\n", " pi[s] = np.argmax([compute_sum(env, V=V, s=s, a=a, gamma=gamma) for a in range(nb_actions)])\n", " if old_action != pi[s]: policy_stable = False\n", " if policy_stable: break\n", " return V, pi" ] }, { "cell_type": "markdown", "metadata": { "id": "ae9B6bYE0JMY" }, "source": [ "Запустим расчет ледяного озера с параметрами ( $\\gamma=1$ и $\\theta=1e-6$)" ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "id": "zP1rZGJC0JMY" }, "outputs": [], "source": [ "V, pi = policy_iteration(env, gamma=1.0, theta=1e-6)" ] }, { "cell_type": "markdown", "metadata": { "id": "uYIDHOVo0JMY" }, "source": [ "Выведем на экран полученные значения политики (правила)" ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "id": "VN7HmF3W0JMZ", "colab": { "base_uri": "https://localhost:8080/", "height": 408 }, "outputId": "d8a8477f-2ff3-4ec5-ca41-bfa0c7e7760b" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "[['↓' '→' '↓' '→' '→' '→' '→' '→']\n", " ['↑' '↑' '↑' '↑' '↑' '↑' '↑' '→']\n", " ['←' '←' '←' '←' '→' '↑' '↑' '→']\n", " ['←' '←' '←' '↓' '←' '←' '→' '→']\n", " ['←' '↑' '←' '←' '→' '↓' '↑' '→']\n", " ['←' '←' '←' '↓' '↑' '←' '←' '→']\n", " ['←' '←' '↓' '←' '←' '←' '←' '→']\n", " ['←' '↓' '←' '←' '↓' '→' '↓' '←']]\n" ] }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ], "source": [ "import seaborn as sns\n", "sns.color_palette(\"YlOrBr\", as_cmap=True)\n", "sns.heatmap(V.reshape([8, -1]), cmap=\"coolwarm\", annot=True)\n", "policy = np.array([action_map[x] for x in pi]).reshape([-1, 8])\n", "print((policy))" ] }, { "cell_type": "markdown", "metadata": { "id": "A3_DC3Sr0JMZ" }, "source": [ "Можно заметить, что правило для состояний 55(плитка (6,7)) и 62(плитка (6,6)) отличается от той, которую мы ожидали.\n", "Используя ``env.env.P[s]`` для проверки переходов среды, можете ли вы объяснить это поведение и почему состояние 62 имеет низкое значение?" ] }, { "cell_type": "code", "execution_count": 134, "metadata": { "id": "0_AsY3-M0JMZ", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "26c73a27-2eba-4c2d-fafc-18c0c71c0ddf" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([[[180, 200, 230],\n", " [180, 200, 230],\n", " [180, 200, 230],\n", " ...,\n", " [180, 200, 230],\n", " [180, 200, 230],\n", " [180, 200, 230]],\n", "\n", " [[180, 200, 230],\n", " [204, 230, 255],\n", " [204, 230, 255],\n", " ...,\n", " [204, 230, 255],\n", " [204, 230, 255],\n", " [180, 200, 230]],\n", "\n", " [[180, 200, 230],\n", " [235, 245, 249],\n", " [204, 230, 255],\n", " ...,\n", " [204, 230, 255],\n", " [204, 230, 255],\n", " [180, 200, 230]],\n", "\n", " ...,\n", "\n", " [[180, 200, 230],\n", " [235, 245, 249],\n", " [235, 245, 249],\n", " ...,\n", " [204, 230, 255],\n", " [235, 245, 249],\n", " [180, 200, 230]],\n", "\n", " [[180, 200, 230],\n", " [235, 245, 249],\n", " [235, 245, 249],\n", " ...,\n", " [204, 230, 255],\n", " [204, 230, 255],\n", " [180, 200, 230]],\n", "\n", " [[180, 200, 230],\n", " [180, 200, 230],\n", " [180, 200, 230],\n", " ...,\n", " [180, 200, 230],\n", " [180, 200, 230],\n", " [180, 200, 230]]], dtype=uint8)" ] }, "metadata": {}, "execution_count": 134 } ], "source": [ "env.env.s = 62\n", "env.render()" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "id": "DB9HlvpG0JMZ", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "d2a52c28-0e7c-4c8f-fc22-4ac7a9788f4b" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{0: [(1.0, 61, 0.0, False)],\n", " 1: [(1.0, 62, 0.0, False)],\n", " 2: [(1.0, 63, 1.0, True)],\n", " 3: [(1.0, 54, 0.0, True)]}" ] }, "metadata": {}, "execution_count": 118 } ], "source": [ "env.env.P[62]" ] }, { "cell_type": "markdown", "metadata": { "id": "S_PxqHvk0JMZ" }, "source": [ "# Расчет ценности:" ] }, { "cell_type": "markdown", "metadata": { "id": "GWVPdW150JMZ" }, "source": [ "Решим задачу ледяного озера по алгоритму расчета ценности.\n", "![](https://github.com/DavidBert/N7-techno-IA/blob/master/code/reinforcement_learning/images/value_iteration.png?raw=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "lxQZhVJB0JMa" }, "source": [ "Код в ячейке реализует алгоритм расчета ценности: " ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "id": "aFXjvo670JMa" }, "outputs": [], "source": [ "def value_iteration(env, gamma, theta):\n", " nb_states = env.observation_space.n\n", " nb_actions = env.action_space.n\n", " V = np.zeros(nb_states)\n", " \n", " while True:\n", " delta = 0\n", " for s in range(nb_states):\n", " v = V[s]\n", " V[s] = max([compute_sum(env, V=V, s=s, a=a, gamma=gamma) for a in range(nb_actions)]) \n", " delta = max(delta, abs(v - V[s]))\n", " \n", " if delta < theta: break\n", "\n", " # Output a deterministic policy\n", " pi = np.zeros(nb_states)\n", " for s in range(nb_states):\n", " pi[s] = np.argmax([compute_sum(env, V=V, s=s, a=a, gamma=gamma) for a in range(nb_actions)])\n", " \n", " return V, pi" ] }, { "cell_type": "markdown", "metadata": { "id": "wy-pDmZ30JMa" }, "source": [ "запустим расчет (со значениями $\\gamma=1$ и $\\theta=1e-6$)" ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "id": "oEn5DenG0JMa" }, "outputs": [], "source": [ "V, pi = value_iteration(env, gamma=1.0, theta=1e-6)" ] }, { "cell_type": "markdown", "metadata": { "id": "qx9l4EZ60JMa" }, "source": [ "Выведем на консоль полученные значения и действия" ] }, { "cell_type": "code", "execution_count": 121, "metadata": { "id": "lXgW0Vov0JMa", "colab": { "base_uri": "https://localhost:8080/", "height": 408 }, "outputId": "433913c1-eccd-481d-e6ea-5835676dbd87" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "[['←' '←' '←' '←' '←' '←' '←' '←']\n", " ['←' '←' '←' '←' '←' '←' '←' '←']\n", " ['←' '←' '←' '←' '↓' '←' '←' '←']\n", " ['←' '←' '←' '←' '←' '←' '↓' '←']\n", " ['←' '←' '←' '←' '↓' '←' '←' '←']\n", " ['←' '←' '←' '↓' '←' '←' '←' '↓']\n", " ['←' '←' '↓' '←' '←' '↓' '←' '↓']\n", " ['←' '←' '←' '←' '↓' '←' '←' '←']]\n" ] }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ], "source": [ "sns.color_palette(\"YlOrBr\", as_cmap=True)\n", "sns.heatmap(V.reshape([8, -1]), cmap=\"coolwarm\", annot=True)\n", "policy = np.array([action_map[x] for x in pi]).reshape([-1, 8])\n", "print((policy))" ] }, { "cell_type": "markdown", "metadata": { "id": "pXcs1Yg40JMa" }, "source": [ "На этом практическом занятии мы обучили агента решать проблему ледяного озера, даже не взаимодействуя с ним.\n", "Алгоритмы итерации политики и итерации значений являются алгоритмами, основанными на модели. Они могут быть вычислены только в том случае, если мы знаем вероятности переходов, используемые в окружающей среде. \n", "Но это не всегда так.\n", "На следующем примере обучим агента RL взаимодействию со средой с использованием алгоритма без модели (model-free algorithm)." ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "id": "oWCqAJqi0JMb", "colab": { "base_uri": "https://localhost:8080/", "height": 300 }, "outputId": "ba5cf1c2-6404-4eb2-d295-2dd24a86904a" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "State: 0\n", "Action: 1\n", "Reward: 0.0\n" ] }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ], "source": [ "import matplotlib.pyplot as plt\n", "env = gym.make(\"FrozenLake8x8-v1\",render_mode='rgb_array')\n", "env.reset()\n", "# Sample random action\n", "action = env.action_space.sample()\n", "next_state, reward, done, foo, loo = env.step(action)\n", "# Print output\n", "print(f\"State: {next_state}\")\n", "print(f\"Action: {action}\")\n", "print(f\"Reward: {reward}\")\n", "# Render and plot an environment frame\n", "frame = env.render()\n", "plt.imshow(frame)\n", "plt.axis(\"off\")\n", "plt.show()" ] }, { "cell_type": "markdown", "source": [ "# Модель агента случайного выбора" ], "metadata": { "id": "BY9B5_0ugtfm" } }, { "cell_type": "code", "source": [ "epoch = 0\n", "num_failed = 0\n", "experience_buffer = []\n", "acum_reward = 0\n", "done = False\n", "scene = 1\n", "step = 0\n", "env.reset()\n", "\n", "while not done:\n", " # Sample random action\n", " action = env.action_space.sample()\n", " state, reward, danet, foo, loo = env.step(action)\n", " acum_reward += reward\n", " # Store experience in dictionary\n", " experience_buffer.append({\n", " \"frame\": env.render(),\n", " \"episode\": scene,\n", " \"step\": step,\n", " \"state\": state,\n", " \"action\": action,\n", " \"reward\": acum_reward,\n", " }\n", " )\n", " if danet and reward < 1:\n", " num_failed += 1\n", " scene += 1\n", " env.reset()\n", " step += 1\n", " if reward ==1 or step >5000: done =True\n", "\n", "# Run animation and print console output\n", "# run_animation(experience_buffer)\n", "print(f\"# steps: {step}\")\n", "print(f\"# failed: {num_failed}\")\n", "print(f\"# reward: {acum_reward}\")\n", "print(f\"# scene: {scene}\")\n", "# Render and plot an environment frame\n", "frame = env.render()\n", "plt.imshow(frame)\n", "plt.axis(\"off\")\n", "plt.show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 335 }, "id": "JCkXT_F1gxC3", "outputId": "73c4b29c-955f-4274-fa41-16752d2dd5a3" }, "execution_count": 172, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "0\n", "# steps: 2622\n", "# failed: 89\n", "# reward: 1.0\n", "# scene: 90\n" ] }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "code", "source": [ "s=0\n", "pim = np.zeros(111) #nb_states\n", "if reward ==1:\n", " for sce in experience_buffer:\n", " if sce[\"episode\"]==scene:\n", " pim[s]=sce[\"action\"]\n", " s +=1\n", " print (pim[:s])\n", "print (s)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bwaBlDFhliOe", "outputId": "a399e365-4bd0-4774-912e-9f3aa356071b" }, "execution_count": 173, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "[0. 1. 3. 1. 0. 1. 3. 0. 3. 1. 2. 2. 2. 3. 1. 1. 0. 2. 0. 0. 1. 2. 2. 0.\n", " 3. 2. 2. 2. 1. 1. 1. 0. 1. 2. 1. 2.]\n", "36\n" ] } ] }, { "cell_type": "markdown", "source": [ "Вариант политики\n", "\n", "[0. 1. 3. 1. 0. 1. 3. 0. 3. 1. 2. 2. 2. 3. 1. 1. 0. 2. 0. 0. 1. 2. 2. 0. 3. 2. 2. 2. 1. 1. 1. 0. 1. 2. 1. 2.]" ], "metadata": { "id": "yqvQflosnH4b" } }, { "cell_type": "code", "source": [ "from scipy.special import softmax # импортируем функцию softmax\n", "import math\n", "\"\"\"Training the agent\"\"\"\n", "nb_states=env.observation_space.n\n", "nb_actions=env.action_space.n\n", "q_table = np.zeros([nb_states, nb_actions]) # начальные значения 0\n", "\n", "# Hyperparameters\n", "lm = 0.1 # Learning rate - гиперпараметр (скорость обучения)\n", "gamma = 1.0 # Discount rate\n", "episodes = 7000 # Number of episodes\n", "eps1, eps2, decays = 1, 0.001, 5000 # параметры epsilon\n", "epsilon = eps1 # Exploration rate\n", "\n", "q_table = np.random.uniform( size=(nb_states, nb_actions) )\n", "aQ = q_table.mean(axis=1, keepdims=True) # средние значения (num_states, 1)\n", "pimc = softmax(q_table - aQ, axis=1) # условные вероятности pi(a|s)\n", " \n", "s = 1 # номер состояния\n", "a = np.random.choice(np.arange(nb_actions), p=pimc[s]) \n", "decay = math.exp(math.log(eps2/eps1)/decays)\n", "\n", "def policy(s): # epsilon-жадная политика\n", " if np.random.random() < epsilon: # случайно любое действие\n", " return np.random.randint(nb_actions)\n", " return np.argmax(q_table[s]) # иначе лучшее\n", "\n", "def run_episode(ticks=5000): \n", " s0, _ = env.reset()\n", " a0 = policy(s0)\n", " for t in range(ticks): \n", " s1, r1, done, foo, _ = env.step(a0)\n", " a1 = policy(s1)\n", " # \n", " q_table[s0, a0] += lm * (r1 + gamma * q_table[s1, a1] - q_table[s0, a0])\n", " # \n", " if done: return\n", " # \n", " s0, a0 = s1, a1\n", "\n", "for episode in range(episodes): # проходы по эпизодам\n", " run_episode(1111)\n", " epsilon *= decay # становимся жаднее\n", " if epsilon < eps2: epsilon = 0\n", "\n", "pim = np.zeros(nb_states)\n", "for s in range(nb_states):\n", " pim[s] = np.argmax(q_table[s,0:4])" ], "metadata": { "id": "zNBcWtF43laa" }, "execution_count": 161, "outputs": [] }, { "cell_type": "code", "source": [ "print (pim)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Mfx1JxdiN7Ca", "outputId": "02320901-e4aa-424e-91e6-8bd47b768089" }, "execution_count": 162, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "[0. 0. 0. 3. 2. 2. 2. 0. 0. 0. 0. 3. 3. 3. 2. 2. 2. 0. 0. 1. 1. 1. 0. 3.\n", " 1. 2. 0. 1. 1. 3. 2. 2. 3. 3. 0. 0. 0. 3. 3. 3. 2. 2. 0. 1. 0. 3. 1. 0.\n", " 3. 3. 2. 3. 0. 3. 2. 0. 3. 0. 0. 2. 3. 1. 0. 3.]\n" ] } ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.12" }, "colab": { "provenance": [] } }, "nbformat": 4, "nbformat_minor": 0 }