BCRLSP：连续有针对性晋升的离线加固学习框架

论文标题

BCRLSP：连续有针对性晋升的离线加固学习框架

BCRLSP: An Offline Reinforcement Learning Framework for Sequential Targeted Promotion

论文作者

Chen, Fanglin, Liu, Xiao, Tang, Bo, Xiong, Feiyu, Hwang, Serim, Zhuang, Guomian

论文摘要

我们利用离线加固学习（RL）模型在现实世界中有预算限制的情况下进行连续的目标促销。在我们的应用程序中，移动应用程序旨在通过向客户发送现金奖金并在每个时间段内控制此类现金奖金的成本来促进客户保留。为了实现多任务目标，我们提出了预算约束的加固学习，以进行顺序促销（BCRLSP）框架，以确定要发送给用户的现金奖金的价值。我们首先找到了使用RL模型最大化用户保留率的目标策略和相关的Q值。然后添加线性编程（LP）模型以满足促销成本的限制。我们通过最大化从RL模型中汲取的动作的Q值来解决LP问题。在部署期间，我们将离线RL模型与LP模型相结合，以在预算约束下生成强大的策略。使用在线和离线实验，我们通过证明BCRLSP达到的长期客户保留率和比各种基线更高的成本来证明我们方法的功效。利用近乎实时的成本控制方法，提出的框架可以轻松地适应嘈杂的行为政策和/或满足灵活的预算约束。

We utilize an offline reinforcement learning (RL) model for sequential targeted promotion in the presence of budget constraints in a real-world business environment. In our application, the mobile app aims to boost customer retention by sending cash bonuses to customers and control the costs of such cash bonuses during each time period. To achieve the multi-task goal, we propose the Budget Constrained Reinforcement Learning for Sequential Promotion (BCRLSP) framework to determine the value of cash bonuses to be sent to users. We first find out the target policy and the associated Q-values that maximizes the user retention rate using an RL model. A linear programming (LP) model is then added to satisfy the constraints of promotion costs. We solve the LP problem by maximizing the Q-values of actions learned from the RL model given the budget constraints. During deployment, we combine the offline RL model with the LP model to generate a robust policy under the budget constraints. Using both online and offline experiments, we demonstrate the efficacy of our approach by showing that BCRLSP achieves a higher long-term customer retention rate and a lower cost than various baselines. Taking advantage of the near real-time cost control method, the proposed framework can easily adapt to data with a noisy behavioral policy and/or meet flexible budget constraints.

下载PDF全文

下载文献需遵守相关版权规定

论文标题