强化学习体系结构：SAC，TAC和ESAC

论文标题

强化学习体系结构：SAC，TAC和ESAC

Reinforcement Learning Architectures: SAC, TAC, and ESAC

论文作者

Masadeh, Ala'eddin, Wang, Zhengdao, Kamal, Ahmed E.

论文摘要

趋势是实施能够分析可用信息并有效利用它的智能代理。这项工作介绍了许多强化学习（RL）体系结构；其中一个是为智能代理而设计的。所提出的架构称为选择器 - 演员批判性（SAC），调谐器-Critic-Critic（TAC）和估算器 - 选择器 - 词素（ESAC）。这些体系结构是改进的RL中知名体系结构的模型，称为Actor-Critic（AC）。在AC中，演员优化了使用的政策，而评论家则估计价值功能并评估参与者的优化政策。 SAC是配备演员，评论家和选择者的建筑。根据评论家的最后估计，选择者根据当前状态确定了最有希望的行动。 TAC由调谐器，模特学习者，演员和评论家组成。从评论家那里接收当前状态行动对的近似值和从模型学习者那里获得的模型后，调谐器使用钟声方程来调整当前状态行动对的值。提议ESAC根据两种想法，即直觉和直觉实施智能代理。 LookAhead出现在估计下一个状态下可用动作的值时出现，而直觉出现在最大化选择最有前途的动作的概率方面出现。新添加的元素是基础模型学习者，估计器和选择器。模型学习者用于近似基础模型。估算器使用近似值函数，学习的基础模型和钟形方程来估计下一个状态下所有动作的值。选择器用于确定下一个状态下最有前途的动作，演员将使用该动作来优化使用的策略。最后，结果表明与其他体系结构相比，ESAC的优越性。

The trend is to implement intelligent agents capable of analyzing available information and utilize it efficiently. This work presents a number of reinforcement learning (RL) architectures; one of them is designed for intelligent agents. The proposed architectures are called selector-actor-critic (SAC), tuner-actor-critic (TAC), and estimator-selector-actor-critic (ESAC). These architectures are improved models of a well known architecture in RL called actor-critic (AC). In AC, an actor optimizes the used policy, while a critic estimates a value function and evaluate the optimized policy by the actor. SAC is an architecture equipped with an actor, a critic, and a selector. The selector determines the most promising action at the current state based on the last estimate from the critic. TAC consists of a tuner, a model-learner, an actor, and a critic. After receiving the approximated value of the current state-action pair from the critic and the learned model from the model-learner, the tuner uses the Bellman equation to tune the value of the current state-action pair. ESAC is proposed to implement intelligent agents based on two ideas, which are lookahead and intuition. Lookahead appears in estimating the values of the available actions at the next state, while the intuition appears in maximizing the probability of selecting the most promising action. The newly added elements are an underlying model learner, an estimator, and a selector. The model learner is used to approximate the underlying model. The estimator uses the approximated value function, the learned underlying model, and the Bellman equation to estimate the values of all actions at the next state. The selector is used to determine the most promising action at the next state, which will be used by the actor to optimize the used policy. Finally, the results show the superiority of ESAC compared with the other architectures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题