PyBrain - 強化學習模組

強化學習 (RL) 是機器學習中的重要組成部分。強化學習使智慧體能夠根據環境的輸入學習其行為。

強化學習過程中相互作用的元件如下：

環境
智慧體
任務
實驗

強化學習的佈局如下：

在 RL 中，智慧體與環境迭代互動。在每次迭代中，智慧體接收包含獎勵的觀察結果。然後，它選擇動作並將其傳送到環境。在每次迭代中，環境都會轉移到新的狀態，並且每次接收到的獎勵都會被儲存。

RL 智慧體的目標是儘可能多地收集獎勵。在迭代過程中，智慧體的效能會與以良好方式運作的智慧體的效能進行比較，並且效能差異會導致獎勵或失敗。RL 主要用於解決機器人控制、電梯、電信、遊戲等問題。

讓我們看看如何在 Pybrain 中使用 RL。

我們將使用迷宮環境，它將使用二維 NumPy 陣列表示，其中 1 代表牆壁，0 代表空地。智慧體的職責是在空地上移動並找到目標點。

以下是使用迷宮環境的分步流程。

步驟 1

使用以下程式碼匯入所需的包：

from scipy import *
import sys, time
import matplotlib.pyplot as pylab # for visualization we are using mathplotlib

from pybrain.rl.environments.mazes import Maze, MDPMazeTask
from pybrain.rl.learners.valuebased import ActionValueTable
from pybrain.rl.agents import LearningAgent
from pybrain.rl.learners import Q, QLambda, SARSA #@UnusedImport
from pybrain.rl.explorers import BoltzmannExplorer #@UnusedImport
from pybrain.rl.experiments import Experiment
from pybrain.rl.environments import Task

步驟 2

使用以下程式碼建立迷宮環境：

# create the maze with walls as 1 and 0 is a free field
mazearray = array(
   [[1, 1, 1, 1, 1, 1, 1, 1, 1],
   [1, 0, 0, 1, 0, 0, 0, 0, 1],
   [1, 0, 0, 1, 0, 0, 1, 0, 1],
   [1, 0, 0, 1, 0, 0, 1, 0, 1],
   [1, 0, 0, 1, 0, 1, 1, 0, 1],
   [1, 0, 0, 0, 0, 0, 1, 0, 1],
   [1, 1, 1, 1, 1, 1, 1, 0, 1],
   [1, 0, 0, 0, 0, 0, 0, 0, 1],
   [1, 1, 1, 1, 1, 1, 1, 1, 1]]
)
env = Maze(mazearray, (7, 7)) # create the environment, the first parameter is the 
maze array and second one is the goal field tuple

步驟 3

下一步是建立智慧體。

智慧體在 RL 中扮演著重要的角色。它將使用 getAction() 和 integrateObservation() 方法與迷宮環境進行互動。

智慧體具有控制器（它將狀態對映到動作）和學習器。

在 PyBrain 中，控制器就像一個模組，其輸入是狀態，並將它們轉換為動作。

controller = ActionValueTable(81, 4)
controller.initialize(1.)

ActionValueTable 需要 2 個輸入，即狀態和動作的數量。標準迷宮環境有 4 個動作：北、南、東、西。

現在我們將建立一個學習器。我們將使用 SARSA() 學習演算法作為智慧體使用的學習器。

learner = SARSA()
agent = LearningAgent(controller, learner)

步驟 4

此步驟是將智慧體新增到環境中。

為了將智慧體連線到環境，我們需要一個名為任務的特殊元件。任務的作用是在環境中尋找目標，以及智慧體如何透過動作獲得獎勵。

環境有其自己的任務。我們使用的迷宮環境具有 MDPMazeTask 任務。MDP 代表“馬爾可夫決策過程”，這意味著智慧體知道它在迷宮中的位置。環境將作為任務的引數。

task = MDPMazeTask(env)

步驟 5

將智慧體新增到環境後的下一步是建立實驗。

現在我們需要建立實驗，以便任務和智慧體可以相互協調。

experiment = Experiment(task, agent)

現在我們將執行實驗 1000 次，如下所示：

for i in range(1000):
   experiment.doInteractions(100)
   agent.learn()
   agent.reset()

當以下程式碼執行時，環境將在智慧體和任務之間執行 100 次。

experiment.doInteractions(100)

每次迭代後，它都會將新的狀態返回給任務，任務決定應該將哪些資訊和獎勵傳遞給智慧體。我們將在 for 迴圈內學習並重置智慧體後繪製一個新表。

for i in range(1000):
   experiment.doInteractions(100)
   agent.learn()
   agent.reset()
   pylab.pcolor(table.params.reshape(81,4).max(1).reshape(9,9))
   pylab.savefig("test.png")

這是完整的程式碼：

示例

maze.py

from scipy import *
import sys, time
import matplotlib.pyplot as pylab

from pybrain.rl.environments.mazes import Maze, MDPMazeTask
from pybrain.rl.learners.valuebased import ActionValueTable
from pybrain.rl.agents import LearningAgent
from pybrain.rl.learners import Q, QLambda, SARSA #@UnusedImport
from pybrain.rl.explorers import BoltzmannExplorer #@UnusedImport
from pybrain.rl.experiments import Experiment
from pybrain.rl.environments import Task

# create maze array
mazearray = array(
   [[1, 1, 1, 1, 1, 1, 1, 1, 1],
   [1, 0, 0, 1, 0, 0, 0, 0, 1],
   [1, 0, 0, 1, 0, 0, 1, 0, 1],
   [1, 0, 0, 1, 0, 0, 1, 0, 1],
   [1, 0, 0, 1, 0, 1, 1, 0, 1],
   [1, 0, 0, 0, 0, 0, 1, 0, 1],
   [1, 1, 1, 1, 1, 1, 1, 0, 1],
   [1, 0, 0, 0, 0, 0, 0, 0, 1],
   [1, 1, 1, 1, 1, 1, 1, 1, 1]]
)
env = Maze(mazearray, (7, 7))

# create task
task = MDPMazeTask(env)

#controller in PyBrain is like a module, for which the input is states and 
convert them into actions.
controller = ActionValueTable(81, 4)
controller.initialize(1.)

# create agent with controller and learner - using SARSA()
learner = SARSA()

# create agent
agent = LearningAgent(controller, learner)

# create experiment
experiment = Experiment(task, agent)

# prepare plotting
pylab.gray()
pylab.ion()

for i in range(1000):
experiment.doInteractions(100)

agent.learn()
agent.reset()

pylab.pcolor(controller.params.reshape(81,4).max(1).reshape(9,9))
pylab.savefig("test.png")

輸出

python maze.py

空地中的顏色將在每次迭代時更改。

列印頁面