通过基于延迟奖励的观察数据通过模拟进行销售渠道优化：LinkedIn的案例研究

论文标题

通过基于延迟奖励的观察数据通过模拟进行销售渠道优化：LinkedIn的案例研究

Sales Channel Optimization via Simulations Based on Observational Data with Delayed Rewards: A Case Study at LinkedIn

论文作者

Negoescu, Diana M., Khosravi, Pasha, Zhao, Shadow, Chen, Nanyu, Ahammad, Parvez, Gonzalez, Humberto

论文摘要

从随机实验获得的数据培训模型是做出良好决策的理想选择。但是，随机实验通常是耗时，昂贵，风险，不可行的或不道德的，而决策者几乎没有选择，但依靠培训模型时在历史策略下收集的观察数据。这不仅为实践中的决策政策发挥了最佳作用，还为不同的数据收集协议对培训数据培训的各种策略的绩效的影响，或政策绩效在问题特征上的变化（例如行动或奖励特定的延误）在观察结果中的变化而产生的疑问。我们的目的是为了在LinkedIn优化销售渠道分配的问题回答此类问题，其中销售帐户（线索）需要分配给三个渠道之一，目的是在一段时间内最大程度地提高成功转换的数量。关键问题特征构成了观察分配结果的随机延迟，其分布既是通道和结果依赖性的。我们构建了一个离散的时间模拟，可以处理我们的问题功能并将其用于评估：a）基于历史规则的策略； b）有监督的机器学习政策（XGBoost）； c）多臂强盗（MAB）策略，在涉及的不同情况下：i）用于培训的数据收集（观察性与随机分组）； ii）线索转换方案； iii）延迟分布。我们的仿真结果表明，Linucb是一种简单的mAB策略，始终优于其他政策，相对于基于规则的策略，实现了18-47％的提升

Training models on data obtained from randomized experiments is ideal for making good decisions. However, randomized experiments are often time-consuming, expensive, risky, infeasible or unethical to perform, leaving decision makers little choice but to rely on observational data collected under historical policies when training models. This opens questions regarding not only which decision-making policies would perform best in practice, but also regarding the impact of different data collection protocols on the performance of various policies trained on the data, or the robustness of policy performance with respect to changes in problem characteristics such as action- or reward- specific delays in observing outcomes. We aim to answer such questions for the problem of optimizing sales channel allocations at LinkedIn, where sales accounts (leads) need to be allocated to one of three channels, with the goal of maximizing the number of successful conversions over a period of time. A key problem feature constitutes the presence of stochastic delays in observing allocation outcomes, whose distribution is both channel- and outcome- dependent. We built a discrete-time simulation that can handle our problem features and used it to evaluate: a) a historical rule-based policy; b) a supervised machine learning policy (XGBoost); and c) multi-armed bandit (MAB) policies, under different scenarios involving: i) data collection used for training (observational vs randomized); ii) lead conversion scenarios; iii) delay distributions. Our simulation results indicate that LinUCB, a simple MAB policy, consistently outperforms the other policies, achieving a 18-47% lift relative to a rule-based policy

下载PDF全文

下载文献需遵守相关版权规定

论文标题