基于强化学习的动态最优治疗分配

411

收藏 2022-03-06

摘要翻译：
设计关于如何分配个人治疗的指导是实证研究的一个重要目标。在实践中，个人通常是按顺序到达的，规划者面临各种限制，如有限的预算/容量，或借款限制，或需要将人员安排在队列中。例如，一个政府机构可能在年初收到预算支出，它可能需要决定如何最好地在一年内将资源分配给顺序到达的个人。在这个和其他涉及时际权衡的例子中，以前关于在静态环境中设计最优策略规则的工作要么不适用，要么次优。在这里，我们展示了如何使用离线观察数据来估计一个最优的政策规则，使期望福利在这种动态环境中最大化。我们允许出于法律、道德或激励相容的原因限制这类政策规则。该问题等价于一个约束策略类下的最优控制问题，我们利用强化学习(RL)的最新发展提出了一个求解该问题的算法。该算法易于实现，并通过多个RL代理在并行过程中学习实现了加速。我们还利用我们的估计策略规则，通过将每个策略下的值函数的演化转化为偏微分方程(PDE)形式，并利用偏微分方程的粘性解理论来描述统计遗憾。我们发现，在大多数例子中，保单后悔以$n^{-1/2}$速率衰减；这与静态情况下的速率相同。
---
英文标题：
《Dynamically optimal treatment allocation using Reinforcement Learning》
---
作者：
Karun Adusumilli, Friedrich Geiecke, Claudio Schilter
---
最新提交年份：
2020
---
分类信息：

一级分类：Economics 经济学
二级分类：Econometrics 计量经济学
分类描述：Econometric Theory, Micro-Econometrics, Macro-Econometrics, Empirical Content of Economic Relations discovered via New Methods, Methodological Aspects of the Application of Statistical Inference to Economic Data.
计量经济学理论，微观计量经济学，宏观计量经济学，通过新方法发现的经济关系的实证内容，统计推论应用于经济数据的方法论方面。
--
一级分类：Computer Science 计算机科学
二级分类：Machine Learning 机器学习
分类描述：Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.
关于机器学习研究的所有方面的论文（有监督的，无监督的，强化学习，强盗问题，等等），包括健壮性，解释性，公平性和方法论。对于机器学习方法的应用，CS.LG也是一个合适的主要类别。
--

---
英文摘要：
Devising guidance on how to assign individuals to treatment is an important goal in empirical research. In practice, individuals often arrive sequentially, and the planner faces various constraints such as limited budget/capacity, or borrowing constraints, or the need to place people in a queue. For instance, a governmental body may receive a budget outlay at the beginning of a year, and it may need to decide how best to allocate resources within the year to individuals who arrive sequentially. In this and other examples involving inter-temporal trade-offs, previous work on devising optimal policy rules in a static context is either not applicable, or sub-optimal. Here we show how one can use offline observational data to estimate an optimal policy rule that maximizes expected welfare in this dynamic context. We allow the class of policy rules to be restricted for legal, ethical or incentive compatibility reasons. The problem is equivalent to one of optimal control under a constrained policy class, and we exploit recent developments in Reinforcement Learning (RL) to propose an algorithm to solve this. The algorithm is easily implementable with speedups achieved through multiple RL agents learning in parallel processes. We also characterize the statistical regret from using our estimated policy rule by casting the evolution of the value function under each policy in a Partial Differential Equation (PDE) form and using the theory of viscosity solutions to PDEs. We find that the policy regret decays at a $n^{-1/2}$ rate in most examples; this is the same rate as in the static case.
---
PDF链接：
https://arxiv.org/pdf/1904.01047

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群