论文标题
GFPOP:用于单变量图约束更改点检测的R软件包
gfpop: an R Package for Univariate Graph-Constrained Change-Point Detection
论文作者
论文摘要
在一个具有快速和突然变化的数据的世界中,准确检测这些变化很重要。在本文中,我们描述了Hocking等人最近提出的算法的广义版本的R软件包。 [2020]惩罚受约束多个变更点模型的最大可能性推断。该算法可用于查明大数据序列中突然变化的精确位置。此类模型有许多应用领域,例如医学,神经科学或基因组学。通常,从业者对所需的变化有先验的了解。例如,在基因组数据中,生物学家有时会期待峰值:加强变化,然后减少变化。利用此类事先信息可以大大提高我们可以检测和估计变化的准确性。 Hocking等。 [2020]描述了一个图形框架,以编码许多此类信息的示例和一种通用算法来推断最佳模型参数,但仅对单个场景实现了算法。我们提出了以R/C ++的通用方式实现算法的GFPOP软件包。 GFPOP适用于用户定义的图表,该图可以编码有关变更类型的先前假设,并实现了多个损失功能(高斯,Poisson,binmial,Biinomial,Biaweight和Huber)。然后,我们说明GFPOP在等渗模仿真和生物学中的几种应用中的使用。对于多个图,算法在10^5个数据点的一秒钟或分钟内运行。
In a world with data that change rapidly and abruptly, it is important to detect those changes accurately. In this paper we describe an R package implementing a generalized version of an algorithm recently proposed by Hocking et al. [2020] for penalized maximum likelihood inference of constrained multiple change-point models. This algorithm can be used to pinpoint the precise locations of abrupt changes in large data sequences. There are many application domains for such models, such as medicine, neuroscience or genomics. Often, practitioners have prior knowledge about the changes they are looking for. For example in genomic data, biologists sometimes expect peaks: up changes followed by down changes. Taking advantage of such prior information can substantially improve the accuracy with which we can detect and estimate changes. Hocking et al. [2020] described a graph framework to encode many examples of such prior information and a generic algorithm to infer the optimal model parameters, but implemented the algorithm for just a single scenario. We present the gfpop package that implements the algorithm in a generic manner in R/C++. gfpop works for a user-defined graph that can encode prior assumptions about the types of change that are possible and implements several loss functions (Gauss, Poisson, binomial, biweight and Huber). We then illustrate the use of gfpop on isotonic simulations and several applications in biology. For a number of graphs the algorithm runs in a matter of seconds or minutes for 10^5 data points.