在对话AI系统中，受控自我学习的策略优化有限

论文标题

在对话AI系统中，受控自我学习的策略优化有限

Constrained Policy Optimization for Controlled Self-Learning in Conversational AI Systems

论文作者

Kachuee, Mohammad, Lee, Sungjin

论文摘要

最近，基于用户满意度指标和上下文匪徒的自学习方法显示出令人鼓舞的结果，以使对话AI系统的一致改进。但是，通过非政策匪徒学习目标直接针对此类指标通常会增加发生突然的政策更改以破坏当前用户体验的风险。在这项研究中，我们引入了一个可扩展的框架，用于通过用户定义的约束来支持单个领域的细粒度探索目标。例如，我们可能希望确保商业关键领域（例如购物）的政策偏差更少，同时将更多的勘探预算分配给音乐域，例如音乐。此外，我们提出了一种新颖的元梯度学习方法，可扩展且实用，可以解决此问题。所提出的方法通过元目标自适应地调整约束违规条款，从而鼓励跨领域的均衡约束满意度。我们使用来自现实世界中的AI的数据进行了广泛的实验，对一组现实的约束基准测试。根据实验结果，我们证明了所提出的方法能够在策略价值和约束满意度之间达到最佳平衡。

Recently, self-learning methods based on user satisfaction metrics and contextual bandits have shown promising results to enable consistent improvements in conversational AI systems. However, directly targeting such metrics by off-policy bandit learning objectives often increases the risk of making abrupt policy changes that break the current user experience. In this study, we introduce a scalable framework for supporting fine-grained exploration targets for individual domains via user-defined constraints. For example, we may want to ensure fewer policy deviations in business-critical domains such as shopping, while allocating more exploration budget to domains such as music. Furthermore, we present a novel meta-gradient learning approach that is scalable and practical to address this problem. The proposed method adjusts constraint violation penalty terms adaptively through a meta objective that encourages balanced constraint satisfaction across domains. We conduct extensive experiments using data from a real-world conversational AI on a set of realistic constraint benchmarks. Based on the experimental results, we demonstrate that the proposed approach is capable of achieving the best balance between the policy value and constraint satisfaction rate.

下载PDF全文

下载文献需遵守相关版权规定

论文标题