论文标题
基于机器学习和数据挖掘的丹佛市犯罪的比较研究
A Comparative Study on Crime in Denver City Based on Machine Learning and Data Mining
论文作者
论文摘要
为了确保一般群众的安全,预防犯罪是任何政府最高的优先事项之一。准确的犯罪预测模式可以帮助政府,执法部门防止暴力,事先检测罪犯,分配政府资源并认识到造成犯罪的问题。要构建任何面向未来的工具,请尽早检查和理解犯罪模式至关重要。在本文中,我从2014年1月至2019年5月分析了美国丹佛县的现实世界中的犯罪和事故数据集,其中包含478,578起事件。该项目旨在预测并强调发生的趋势,这些趋势将作为回报,支持执法机构和政府从预测率中发现预防措施。首先,我采用了几种统计分析,并通过多种数据可视化方法支持。然后,我实施了各种分类算法,例如随机森林,决策树,Adaboost分类器,额外的树分类器,线性判别分析,K-Neighbors分类器和4种集合模型,以对15种不同的犯罪类别进行分类。结果是使用两种流行的测试方法捕获的:火车测试拆分和K折线交叉验证。此外,为了完美地评估性能,我还利用精度,召回,F1得分,平方误差(MSE),ROC曲线和配对T检验。除了Adaboost分类器外,大多数算法表现出令人满意的精度。随机森林,决策树,合奏1、3和4甚至使我的精度超过90%。在所有方法中,集合模型4在每个评估基础上都提供了卓越的结果。这项研究对于提高人们对发生地点的认识可能很有用,并协助安全机构预测特定时间内特定领域中暴力爆发的未来爆发。
To ensure the security of the general mass, crime prevention is one of the most higher priorities for any government. An accurate crime prediction model can help the government, law enforcement to prevent violence, detect the criminals in advance, allocate the government resources, and recognize problems causing crimes. To construct any future-oriented tools, examine and understand the crime patterns in the earliest possible time is essential. In this paper, I analyzed a real-world crime and accident dataset of Denver county, USA, from January 2014 to May 2019, which containing 478,578 incidents. This project aims to predict and highlights the trends of occurrence that will, in return, support the law enforcement agencies and government to discover the preventive measures from the prediction rates. At first, I apply several statistical analysis supported by several data visualization approaches. Then, I implement various classification algorithms such as Random Forest, Decision Tree, AdaBoost Classifier, Extra Tree Classifier, Linear Discriminant Analysis, K-Neighbors Classifiers, and 4 Ensemble Models to classify 15 different classes of crimes. The outcomes are captured using two popular test methods: train-test split, and k-fold cross-validation. Moreover, to evaluate the performance flawlessly, I also utilize precision, recall, F1-score, Mean Squared Error (MSE), ROC curve, and paired-T-test. Except for the AdaBoost classifier, most of the algorithms exhibit satisfactory accuracy. Random Forest, Decision Tree, Ensemble Model 1, 3, and 4 even produce me more than 90% accuracy. Among all the approaches, Ensemble Model 4 presented superior results for every evaluation basis. This study could be useful to raise the awareness of peoples regarding the occurrence locations and to assist security agencies to predict future outbreaks of violence in a specific area within a particular time.