基于试验的启发式树搜索

Page content

[TOC]

概念

给定一个MDP,一个agent可以访问:

  • 环境先验模型(a-priori model of environment)

    在agent与环境交互之前就呈现给他了。

which is revealed to the agent before its interaction with the environment starts.

  • 生成模型(a generative model)

    输入一个状态和动作,输出奖励和根据转移函数的到的随机下一个状态。

  • 声明模型(a declarative model)

    声明模型提供了MDP的描述,对于所有的状态、后继状态和动作提供了概率转移和奖励函数。

我们在此处区分学习agents (未提供先验模型)和规划agents(可以访问生成模型或者声明模型)

多臂赌博机问题(MAB)

Figuratively, the agent faces a slot machine (a bandit) with multiple arms, and it has to decide which arm to pull based only on the payouts it received so far. Each arm provides a random reward according to a probability distribution that is specific to the arm and constant over time, but the agent has no initial information on the distribution.

探索和获取收益的窘境(exploration-exploitation dilemma)

简单的说就是:agent要探索尝试各种动作,牺牲一部分收益来寻找到最好的动作,但又不能牺牲的太多。要平衡探索时牺牲的收益和获取最大的收益。

on the one hand, since the agent has no a-priori knowledge of the environment, it has to explore its possibilities to learn from the feedback the environment provides. And on the other, since it aims to maximize the accumulated reward over all runs, it has an incentive to exploit the knowledge it has gathered by executing the action it believes to be best. In other words, the agent learns from its trial-and-error interaction with the environment, and it has to make sure that the balance between exploration and exploitation is such that it learns the best action without sacrificing too much reward in early decisions.

image.png

一个三臂的例子MAB问题。 边缘用转移概率p和奖励r标记为p:r。

一个MAB问题有三个拉杆al,am,ar(分别代表左、中、右拉杆)。拉动左拉杆可以获得20或者0的奖励,概率分别为0.5。拉动右拉杆分别或者70或者30的奖励,概率分别为0.5。拉动中间的拉杆,有0.1的概率获得100的奖励,0.5的概率获得60的奖励,0.4的概率获得30的奖励。我们可以通过访问声明模型计算出3个拉杆的期望奖励分别为:Q*(al) = 10, Q*(am) = 52, Q*(ar) = 50。最好的选择是拉动中间的拉杆。

但是,即使模型可以被访问计算的期望值是微不足道的,因为agent没有关于定义回报的分布的先验知识,所以MAB是具有挑战性的问题。 因此,它必须探索机器的拉杆,以获得分布(或期望的奖励)的是什么样的,并在其信念充分确定的情况下利用其知识。 如果它放弃探索太早,它可能永远也找不到最好的拉杆 - 例如,考虑到中间选项在前几次运行中不能获得100的奖励的不太可能的情况。 在这种情况下,它的行动价值估计Qk(am)可以预计低于50,并且可能ar看起来更有前景。 因此,如果agent过早相信信念,即使am是更好的,也会继续选择右侧的拉杆。 另一方面,如果它继续探索太久,它往往会拉下左或者右拉杆更多次,牺牲的收益。

ETE strategy: Explore-then-Exploit strategy 探索开发策略 exploration phase of length l, aims to deciside the best arm with high probability. action selection: uniform action selection 一致动作选择,意思是随机选择一个拉杆l次。 exploitation phase: greedy action, choose action with highest Q value