论文标题
隐性的两塔政策
Implicit Two-Tower Policies
论文作者
论文摘要
我们提出了一系列新的结构化强化学习政策 - 构造,即隐式两位(ITT)政策,其中根据其可学习潜在表示的注意力分数选择了这些动作。通过明确将行动与政策堆栈中的国家处理脱颖而出,我们实现了两个主要目标:实质性计算提高和更好的绩效。我们的体系结构都与:离散和连续的动作空间兼容。通过对OpenAI健身房和DeepMind Control Suite的15个环境进行测试,我们表明ITT构造特别适合黑框/进化优化,相应的策略培训算法优于其无结构的隐范对应物以及常用的明确政策。我们通过展示诸如哈希和懒惰的塔更新(严格依靠ITT的两个较高的ITT结构)等技术如何应用来获得其他计算改进,从而补充了我们的分析。
We present a new class of structured reinforcement learning policy-architectures, Implicit Two-Tower (ITT) policies, where the actions are chosen based on the attention scores of their learnable latent representations with those of the input states. By explicitly disentangling action from state processing in the policy stack, we achieve two main goals: substantial computational gains and better performance. Our architectures are compatible with both: discrete and continuous action spaces. By conducting tests on 15 environments from OpenAI Gym and DeepMind Control Suite, we show that ITT-architectures are particularly suited for blackbox/evolutionary optimization and the corresponding policy training algorithms outperform their vanilla unstructured implicit counterparts as well as commonly used explicit policies. We complement our analysis by showing how techniques such as hashing and lazy tower updates, critically relying on the two-tower structure of ITTs, can be applied to obtain additional computational improvements.