论文标题
新的公式公式,包括自然科学的机器学习中的测量错误
New Metric Formulas that Include Measurement Errors in Machine Learning for Natural Sciences
论文作者
论文摘要
在科学文献中广泛发现了机器学习到物理问题的应用。回归和分类问题均由涉及学习算法的大量技术解决。不幸的是,几乎总是忽略用于训练机器学习模型的数据的测量错误。这导致对模型的性能(及其概括能力)的估计非常乐观,因为始终假定目标变量(一个人想要预测的)是正确的。在物理学中,这是一种巨大的缺陷,因为它可以导致人们的信念,即存在理论或模式,而实际上它们没有。本文通过得出针对目标变量的测量误差的常用指标(用于回归和分类问题)来解决这种缺陷。新公式给出了指标的估计,而指标总是比经典的指标更加悲观,而不是考虑到测量误差。这里给出的公式具有一般有效性,完全独立于模型,并且可以在没有限制的情况下应用。因此,凭借统计信心,可以在处理任何类型的错误测量时分析关系的存在。该公式在物理外部具有广泛的适用性,可用于与研究结论相关的所有问题。
The application of machine learning to physics problems is widely found in the scientific literature. Both regression and classification problems are addressed by a large array of techniques that involve learning algorithms. Unfortunately, the measurement errors of the data used to train machine learning models are almost always neglected. This leads to estimations of the performance of the models (and thus their generalisation power) that is too optimistic since it is always assumed that the target variables (what one wants to predict) are correct. In physics, this is a dramatic deficiency as it can lead to the belief that theories or patterns exist where, in reality, they do not. This paper addresses this deficiency by deriving formulas for commonly used metrics (both for regression and classification problems) that take into account measurement errors of target variables. The new formulas give an estimation of the metrics which is always more pessimistic than what is obtained with the classical ones, not taking into account measurement errors. The formulas given here are of general validity, completely model-independent, and can be applied without limitations. Thus, with statistical confidence, one can analyze the existence of relationships when dealing with measurements with errors of any kind. The formulas have wide applicability outside physics and can be used in all problems where measurement errors are relevant to the conclusions of studies.