Comparisons of machine learning techniques for detecting fraudulent criminal identities

Kazemian, Hassan and Subeksha, Shrestha (2023) Comparisons of machine learning techniques for detecting fraudulent criminal identities. Expert Systems with Applications, 229 A (120591). pp. 1-13. ISSN 0957-4174

[img] Text
ML detecting fraudulent_15th Mar-final final.pdf - Accepted Version
Restricted to Repository staff only until 1 November 2025.
Available under License Creative Commons Attribution Non-commercial No Derivatives 4.0.

Download (1MB) | Request a copy
Official URL: https://doi.org/10.1016/j.eswa.2023.120591

Abstract / Description

This paper focuses on applications of various machine learning techniques on an anonymized policing dataset used in EU SPIRIT Horizon 2020 project to identify fraudulent identities and help Law Enforcement Agencies (LEAs) in their investigation in finding potential criminals and identity resolution. Lack of qualitative data and appropriate methodology to carry out research on criminal fraudulent identities is a common reason for fewer research in this area. Additionally, it is a very sensitive data to work with and minor inaccuracy in prediction of result causes massive impact in the society as genuine people could be questioned whereas criminals could be sent free. Both of these issues are addressed in this paper by application of 39 million records from policing dataset and working towards higher accuracy while building the model. Various machine learning approaches are applied to train the dataset to make predictions and the research focus on being able to predict the 5 suspected fraudulent identities out of 39 million records in the policing dataset. One of the applied machine learning techniques include TensorFlow along with Keras model which has seldomly been applied by researchers in detection of criminal data. To compare the results and test accuracy of TensorFlow model, other machine learning techniques such as Support Vector Machine, Naïve Bayes and K-nearest Neighbours are also applied to have a comparative study on the obtained outcomes from each model. The goal of this research is to find fraudulent IDs amongst all the anonymized IDs in the criminal dataset using TensorFlow and three other machine learning models and select the most optimal model out of them. Since the model is comparing two names so string-matching techniques such as Levenshtein edit distance, Hamming Distance, Jaro-Winkler and Soundex were applied to select an effective approach first before building the model and analysing the results. TensorFlow model demonstrated highest accuracy with relatively least execution time and the only model to successfully predict all the 5 suspects from the policing dataset.

Item Type: Article
Uncontrolled Keywords: Identity resolution; Policing dataset; TensorFlow; Support vector machine; K-nearest neighbour; Naive Bayes
Subjects: 000 Computer science, information & general works
000 Computer science, information & general works > 020 Library & information sciences
600 Technology
Department: School of Computing and Digital Media
Depositing User: Hassan Kazemian
Date Deposited: 23 Nov 2023 12:37
Last Modified: 23 Nov 2023 12:37
URI: https://repository.londonmet.ac.uk/id/eprint/8911

Actions (login required)

View Item View Item