UNIVERSIDAD DE INVESTIGACIÓN DE TECNOLOGÍA
EXPERIMENTAL YACHAY
Escuela de Ciencias Matemáticas y Computacionales
TÍTULO: Credit Card Fraud Detection using
Artificial Intelligence
Trabajo de integración curricular presentado como requisito para la
obtención del título de
Ingeniero en Tecnologías de la Información
Autor:
Zhinin Vera Luis Fernando
Tutor:
Ph.D Chang Tortolero Oscar Guillermo
UNJVERS!DAD
m
YACHAY
TECH ·
SECRETARfA GENERAL (Vicerrectorado Academico/Canclllerfa)
Urcuqui, 4 de marzo de 2020
ESCUELA DE CIENCIAS MATEMATICAS Y COMPUTACIONALES CARRERA DE TECNOLOGfAS DE LA INFORMACl6N
ACTA DE DEFENSA No. UITEY-ITE-2020-00006-AD
En la ciudad de San Miguel de Urcuqul, Provincia de lmbabura, a los 4 dfas del mes de marzo de 2020, a las 10:00 horas, en el Aula CHA-01 de la Universidad de lnvestigaci6n de Tecnologla Experimental Yachay y ante el Tribunal Calificador, integrado por los docentes:
_Pres...,ld .... ent...,e ... T_rl..,bu_n_al_d;;;.;:e_D;,;;efe= nsa;_,. __ Dr. IZA PAREDES, CRISTHIAN RENE I Ph.D.
_
M_ie_m,..b_ro_N_o..,.T ... ut_o_r _ __ _ _ _ Dr. ACOSTA ORELLANA, ANTONIO RAMON I Ph.D.
_Tutor= - - -- - - - -- - Dr. CHANG TORTOLERO, OSCAR GUILLERMO I Ph.D.
Se presenta el(la) serior(ita) estudiante ZHININ VERA, LUIS FERNANDO, con cedula de identidad No. 0503745994, de la ESCUELA DE CIENCIAS MATEMATICAS Y COMPUTACIONALES, de la Carrera de TECNOLOGIAs DE LA INFORMACION, aprobada por el Consejo de Educaci6n Superior (CES), mediante Resoluci6n RPC-S0-43-No.498-2014, con el objeto de rendir la sustentaci6n de su trabajo de titulaci6n denominado: CREDIT CARD FRAUD DETECTION USING ARTIFICIAL INTEWGENCE, previa a la obtenci6n del titulo de INGENIERO/A EN TECNOLOGfAS DE LA INFORMACION.
El citado trabajo de titulaci6n, fue debidamente aprobado por el(los) docente(s):
_T...,utor_,_ __________ Dr._Ct-:tANG TORTOLERO, OSCAR GUILLERMO, Ph.D.
Y recibi6 las observaciones de los otros miembros del Tribunal Calificador, las mismas que han sido incorporadas por el(la) estudiante.
Previamente cumplidos los requisites legales y reglamentarios, el trabajo de titulaci6n fue sustentado por el(la) estudiante y examinado por los miembros del Tribunal Calificador. Escuchada la sustentaci6n del trabajo de titulaci6n, que integr6 la exposici6n de el(la) estudiante sobre el contenido de la misma y las preguntas formuladas por los miembros del Tribunal, se califica la sustentaci6n del trabajo de titulaci6n con las siguientes calificaciones:
Tlpo
-.. Presidente Tribunal De Defensa Miembro Tribunal De Defensa
Tutor
,. .
calificaciol'l
Dr. IZA PAREDES, CRISTHIAN RENE, Ph.D. 10,0 Dr. ACOSTA ORELLANA, ANTONIO RAMON, Ph.D. 10,0 Dr. CHANG TORTOLERO, OSCAR GUILLERMO, 10,0 Ph.D.
Loque da un promedio de: 10 (Diez punto Cero), sobre 10 (diez), equivalente a: APROBADO
ZHINI V
Estudiante
actuado, firman los miembros del Tribunal Calificador, el/la estudiante y el/la secretario ad-hoc.
Dr. IZA 8 DE , CRISTHIAN RENE , Ph.D. e Tri nal de Defen~
c2)
.
Dr. CHANG TOR~ OSCAR GUI ERMO , Ph.D. Tutor
'-hr;icmlr. .:;an Jo:,,, s:ri y Proyecto Yachay. UrcuqUi I Tlf· +593 6 2 999 500 I [email protected] ec
( I L \ ~ ~
Dr. ACOSTA ORELLANA, ANTONIO RAMON , Ph.D.
Mi T
EDINA BRITO, DAY Y MARGARITA
Secretario Ad-hoc
U•JlVEm~ ll) AD
YACHAY
TECH
Hacienda San Jose s/n y Proyecto Yact1ay. Urcuqui
i
T!f~ + 593 6 2 999 o00I
[email protected] ecAUTORÍA
Yo, LUIS FERNANDO ZHININ VERA, con cédula de identidad 0503745994, declaro que las ideas, juicios, valoraciones, interpretaciones, consultas bibliográficas, definiciones y conceptualizaciones expuestas en el presente trabajo; así cómo, los procedimientos y herramientas utilizadas en la investigación, son de absoluta responsabilidad de el/la autora (a) del trabajo de integración curricular. Así mismo, me acojo a los reglamentos internos de la Universidad de Investigación de Tecnología Experimental Yachay.
Urcuquí, marzo 2020.
___________________________ Luis Fernando Zhinin Vera
AUTORIZACIÓN DE PUBLICACIÓN
Yo, , LUIS FERNANDO ZHININ VERA, con cédula de identidad 0503745994, cedo a la
Universidad de Tecnología Experimental Yachay, los derechos de publicación de la presente obra, sin que deba haber un reconocimiento económico por este concepto. Declaro además que el texto del presente trabajo de titulación no podrá ser cedido a ninguna empresa editorial para su publicación u otros fines, sin contar previamente con la autorización escrita de la Universidad.
Asimismo, autorizo a la Universidad que realice la digitalización y publicación de este trabajo de integración curricular en el repositorio virtual, de conformidad a lo dispuesto en el Art. 144 de la Ley Orgánica de Educación Superior
Urcuquí, marzo 2020.
___________________________ Luis Fernando Zhinin Vera
Dedicatoria
A mi mami Tita: mi motivación y la razón de todo.
Agradecimiento
Para Oscar Chang, una gran persona. Fue mi maestro, asesor y mentor al principio de mi camino de investigación ya que cada una de sus clases y enseñanzas me ayudaron a fortalecer mi pasión por la ciencia A mis otros maestros, que me hicieron creer que soy capaz de lograr mucho más de lo que espero.
A mi familia, especialmente a mis tíos: Abigail, Gonzalo, Pedro y Cornelio. A mi padre y mi mami Tita. Quienes a veces, como yo, no entendían lo que estudio, sin embargo, siempre creyeron en mí y por eso estoy muy agradecido.
Finalmente, a mis compañeros de casa, a mis amigos y personas especiales, quienes fueron una parte fundamental de mi estadía en la universidad y fueron la razón de tantos momentos felices.
Resumen
Cada año, se pierden miles de millones de dólares debido al fraude con tarjetas de crédito, lo que causa grandes pérdidas para los usuarios y la industria financiera. Este tipo de actividad ilícita es quizá la más común y la que causa más preocupaciones en el mundo financiero. En los últimos años se ha prestado gran atención a la búsqueda de técnicas para evitar esta pérdida significativa de dinero. En este proyecto de grado abordamos el fraude con tarjetas de crédito mediante el uso de un conjunto de datos desbalanceado que contiene transacciones realizadas por usuarios de tarjetas de crédito. Nuestro sistema Q-Credit Card Fraud Detector clasifica las transacciones en dos clases: genuinas y fraudulentas, y está construido con técnicas de inteligencia artificial: Deep Learning, Autoencoders y Neural Agents, elementos que adquieren sus habilidades de predicción a través de un algoritmo Q-learning. Nuestros experimentos de simulación por computadora muestran que el modelo ensamblado puede producir respuestas rápidas con un notable valor de exactitud (98.1) y un alto rendimiento en la clasificación de fraudes, lo cual es necesario para que este modelo sea confiable y tenga relevancia en futuras investigaciones.
Palabras Clave:
Abstract
Every year, billions of dollars are lost due to credit card fraud, causing huge losses for users and the financial industry. This kind of illicit activity is perhaps the most common and the one that causes most concerns in the finance world. In recent years great attention has been paid to the search for techniques to avoid this significant loss of money. In this degree project, we address credit card fraud by using an imbalanced dataset that contains transactions made by credit card users. Our Q-Credit Card Fraud Detector system classifies transactions into two classes: genuine and fraudulent and is built with artificial intelligence techniques comprising Deep Learning, Autoencoder, and Neural Agents, elements that acquire their predicting abilities through a Q-learning algorithm. Our computer simulation experiments show that the assembled model can produce quick responses with a remarkable accuracy value (98.1) and high performance in fraud classification, which is necessary for this model to be reliable and have relevance in future research.
Key Words:
Contents
List of Figures 12
List of Tables 14
1 Introduction 15
1.1 Objectives . . . 16
1.1.1 General Objective . . . 16
1.1.2 Specific Objectives . . . 16
1.2 Chapter Description . . . 16
2 Problem Statement 17 3 Related Work 20 3.1 Credit Card Fraud Detection . . . 20
3.2 Imbalanced Classification . . . 21
3.3 Reinforcement Learning in Classification . . . 21
4 Technical Background 23 4.1 Artificial Neural Networks . . . 23
4.1.1 ANN Learning Paradigms . . . 27
4.1.2 Data Splitting Methods and Common Problems . . . 29
4.2 Autoencoder . . . 30
4.3 Backpropagation Algorithm . . . 31
4.4 Gradient Descent . . . 33
4.5 Confusion Matrix . . . 34
4.6 Principal Component Analysis . . . 35
4.7 Agents . . . 35
4.8 Markov Decision Process . . . 36
4.9 Q-Learning . . . 37
4.10 Deep Q-network . . . 38
5 Data Description 40 5.1 Credit Card Fraud Dataset . . . 40
5.1.1 Time Feature . . . 41
5.1.2 Amount Feature . . . 41
5.1.3 Class Feature . . . 41
5.1.4 V-features . . . 43
6 Proposed Solution 48
6.1 Network Architectures . . . 48
6.1.1 Autoencoder . . . 48
6.1.2 Mediator Network . . . 50
6.1.3 Agent . . . 50
6.2 Parameter Setting . . . 51
6.2.1 Autoencoder . . . 51
6.2.2 Mediator Network . . . 51
6.2.3 Agent . . . 52
7 Implementation 54 7.1 Technology . . . 54
7.2 Code Structure . . . 54
8 Results 58 8.1 Preliminary Results . . . 58
8.1.1 Autoencoder . . . 58
8.1.2 Mediator Network . . . 58
8.1.3 Agent . . . 61
8.2 Final Results . . . 61
8.2.1 Comparison with other algorithms . . . 62
9 Conclusions 66 9.1 Future Work . . . 67
List of Figures
2.1 Credit Card . . . 17
2.2 Evolution of the total value of card fraud using cards issued within SEPA. Left-hand scale: total value (EUR millions); right-hand scale: value of fraud as share of value of transaction (%) . . . 19
4.1 Natural Neurons . . . 24
4.2 Artificial Neuron . . . 24
4.3 Transfer Functions. . . 25
4.4 Supervised Learning . . . 28
4.5 Unsupervised Learning . . . 29
4.6 Reinforcement Learning . . . 29
4.7 A visualization of the data splits. . . 30
4.8 Simplified structure of an Autoencoder. . . 30
4.9 Complex structure of an Autoencoder. . . 31
4.10 Confusion Matrix . . . 34
4.11 Example of a Principal Component Analysis . . . 36
4.12 Agents interact with environment. . . 37
4.13 Q-learning vs Deep Q-Networks . . . 38
5.1 Distribution of Time Feature . . . 41
5.2 Distribution of Monetary Value Feature . . . 42
5.4 Count of Fraudulent and Non-Fraudulent Transactions . . . 42
5.3 Money per Transaction . . . 43
5.5 Frequency of each Feature of the dataset . . . 44
5.6 Correlation Matrix of the Dataset . . . 45
5.7 V4-V22 . . . 45
5.8 V5-V26 . . . 46
5.9 V3-V4 . . . 46
5.10 V14-V17 . . . 47
6.1 General Architecture of Q-Credit Card Fraud Detector . . . 49
6.2 Autoencoder Architecture . . . 50
6.3 Mediator Network Architecture . . . 51
6.4 Agent Architecture . . . 52
8.1 Interface of Autoencoder with 15 neurons in the hidden layer. . . 59
8.2 Interface of Mediator Network with 11 neurons in the hidden layer. . . . 60
8.3 Interface of Agent with 15 neurons in the hidden layer. . . 61
8.4 Error of three main components. . . 63
8.5 Accuracy of three main components. . . 64
List of Tables
5.1 Credit Card Fraud Dataset . . . 40
8.1 Accuracy obtained by training Autoencoder . . . 59
8.2 Accuracy obtained by training and testing Mediator Network . . . 60
8.3 Accuracy obtained by training and testing Agent . . . 62
8.4 General specifications of Q-Credit Card Fraud Detector. . . 62
8.5 Results of Q-Credit Card Fraud Detector. . . 63
8.6 Comparison with other existing algorithms. *Balanced dataset. . . 64
Chapter 1
Introduction
Currently, fraud is the number one enemy in the business world. It affects industries and organizations and accounts for the big money invested in fraud prediction researching [1]. The constant growth of this problem has strongly promoted the development of new technologies to counteract fraudsters. The last advances in credit card fraud detection include top technology themes such as Artificial Neural Network (ANN), Deep Learning and Intelligent Agents [2] [3] [4]. In particular the implementation of agents has be-come important since it produces effective, quick acting monitoring of credit card fraud transactions, reducing the risk of fraud or other financial traps that could signify losses. A quick response is essential because fraudsters are constantly creating new elaborated treachery mechanisms.
This project aims to develop a fraud detection system using artificial intelligence tech-niques through supervised, unsupervised and reinforcement learning so that the system is finally able to clearly detects between legal and fraudulent transactions. The design of a fraud detector in credit cards is a tough challenge since genuine and fraudulent behaviors are very variant and fraudsters continuously innovate their methods to avoid existing pre-vention measures. Another reason that complicates the development of these systems is the restriction of the data, necessary to protect the cardholders’ privacy. For this reason, our system uses a publicly available credit card database available in Kaggle, with 284,807 two-day transactions made by European cardholders in September 2013. This dataset considers fraud transactions as the “one class” and genuine ones as the “zero class”. The data set is highly imbalanced, this because about 0.172% are fraudulent transactions and the remaining transactions are genuine.
In terms of a robust functional system the main goal is to detect the highest possible number of fraudulent transactions using a finite dataset, in our case treated by Principal Component Analysis (PCA) approaches to anonymize the user and minimizes/maximize data correlation. Since frauds occurs with more frequency than regular transactions, databases are always imbalanced. This document develops a fraud detection methodol-ogy that resolves the problem of imbalanced classification by combining the processing capacities of neural agents and Q-learning, establishing a promising way to satisfy quick acting and high precision requirement. The Reinforcement Learning method slightly outperforms neural networks while a similar representation is used [5].
School of Mathematical and Computational Sciences YACHAY TECH
1.1
Objectives
1.1.1
General Objective
Develop a fraud detection system using artificial intelligence techniques which can identify the fraudulent transactions in credit card transaction dataset.
1.1.2
Specific Objectives
• Find an efficient deep neural architecture based on accuracy, for the fraud detection system.
• Find an efficient agent architecture based on accuracy, for the fraud detection sys-tem.
• Design and implement a system for detecting fraudulent transactions in the available database where the number of fraud cases is very small as compare with legal cases.
1.2
Chapter Description
This thesis is structured in the following chapters:
• Chapter 2 - Problem Statement presents an analysis of the importance of finding solutions to fraud on credit cards.
• Chapter 3 - Related Work presents a review of recent work related to the detection of credit card fraud.
• Chapter 4 - Technical Background shows a technical background of the artifi-cial intelligence technique presented in this work.
• Chapter 5 - Data Description describes and analyzes the database.
• Chapter 6 - Proposed Solutionpresents an alternative solution to avoid fraud. An intelligent system that learns to correctly detect legal transactions.
• Chapter 7 - Implementation details the process and the decisions that were made when the system was developed.
• Chapter 8 - Results presents the analysis performed, the preliminary and final results.
Chapter 2
Problem Statement
Figure 2.1: Credit Card
A credit card is a plastic document with a security band and chip issued by a banking or financial entity, which is used to make purchases of products or services (see Figure 2.1). The financial institution imposes the condition that the borrowed money must be paid back with an extra interest value previously defined.
The modem credit card was the successor of a variety of merchant credit schemes. It was first used in the 1920s, in the United States, to specifically sell fuel to a growing number of automobile owners. In 1938 several companies started accepting each other’s cards. Western Union had begun issuing charge cards to its frequent customers by 1921. Some charge cards were printed on paper card stock, but were easily counterfeited [6]. The concept of customers paying different merchants using the same card was expounded in 1950 by Ralph Schneider and Frank McNarnara, founders of Diners Club, to eliminate multiple cards. Although credit cards reached very high adoption levels in the US, Canada and the UK in the mid twentieth century, many cultures were more cash-oriented, or developed alternative forms of cash-less payments, such as Carte Bleue or the Euro card. The design of the credit card itself has become a major selling point in recent years. The value of the card to the issuer is often related to the customer’s use of the card, to the customer’s financial worth [6].
Credit cards are a fundamental part of the economic and commercial growth of emerg-ing and developed countries since for example, these cards allow the transfer of money online, which contributes in an accelerated way to the expansion of electronic commerce [7]. However, as it is a very beneficial and simple payment method, it gives fraudsters ease to create fraud and money theft methods [8].
The advances in computation and scientists give rise to the creation of tools that allow detecting and predicting fraud before being committed. Companies like FICO Falcon Fraud Manager Platform offer services to manage and prevent fraud. This company uses data analysis and artificial intelligence [9].
School of Mathematical and Computational Sciences YACHAY TECH
Fraud can often be avoided with good management of corporate or personal informa-tion. There are currently many techniques used by fraudsters, for example, the classic technique called phishing. It consists of a malicious program seizing the information, which was previously sent by some malicious email [8].
The task of fraud detection is a complicated issue to solve, due to each day the fraud modalities and techniques are rapidly evolving. Data mining and machine learning techniques employ efficient probabilistic models such as: generalized regression models, artificial neural networks, decision trees, and Bayesian networks to determine and predict fraud [10]. These techniques use an autonomous learning system for the recognition of patterns and trends based on historical facts, the data of transactions made by customers are used to determine the patterns. These patterns allow to quickly identify circumstances beyond the daily behavior of a client that may be indications of fraud [10].
The problem is that if there are financial losses due to fraud affect not only merchants and banks (e.g. reimbursements), but also individual clients. If the bank loses money, customers eventually pay as well through higher interest rates, higher membership fees, etc. Fraud may also affect the reputation and image of a merchant causing non-financial losses that, though difficult to quantify in the short term, may become visible in the long period. For example, if a cardholder is victim of fraud with a certain company, he may no longer trust their business and choose a competitor [11].
There are some actions that are executed to prevent fraud. The first is fraud pre-vention, which attempts to block illegal transactions in real time. On the other hand, fraud detection is the action that is taken once the fraud is detected. The first strate-gies to prevent fraud are Address Verification Systems (AVS), Card Verification Method (CVM) and Personal Identification Number (PIN) [12]. This data is personal and only the cardholder should know.
There are many ways in which fraud can occur, for example:
• Stolen card fraud is the most common type of fraud. In this type of fraud, the fraudster tries to spend a lot and as quickly as possible. This fraud is detected because this is out of the patterns carried by the cardholder [13].
• Cardholder-not-present fraud is when the fraudster only needs the credit card in-formation. This fraud demands a prompt detection since the official card owner is not aware that his own data have been stolen [14].
• Application fraud corresponds to the application for a credit card with false personal information. This type of fraud is easier to detect since during the application the information given could be verified [13].
Although many forms of fraud are known, it is difficult to establish a general alter-native for all cases of fraud. This is because fraudsters adapt and change their strategies at the same time as technology advances. For this reason fraud detection methods must be constantly updated. In addition, it is difficult to exchange ideas between detection systems since this could give fraudsters new strategies. This is one of the reasons why databases with fraudulent transactions are not publicly available to investigators.
Credit card fraud has become a big problem today [15]. Banks do not easily reveal how much money they lose due to fraud, so it is difficult to give an exact figure. In addition, the
School of Mathematical and Computational Sciences YACHAY TECH
existing data is only from frauds that were detected, which generates greater inaccuracy when trying to approximate the total value of losses.
The European Central Bank (ECB) reports [16] that the total value of fraudulent transactions conducted using cards issued within SEPA and acquired worldwide amounted to e1.8 billion in 2016 – a decrease of 0.4% compared with 2015. A part of the total value of the transactions, fraud dropped by 0.001 percentage point to 0.041% in 2016, down from 0.042% in 2015. In contrast with the levels of fraud in 2012, fraud increased by 0.003 percentage points in 2016. Although there was an upward trend in card fraud between 2012 and 2015, it seems the trend is changing, given that fraud went down in 2016. Fraud involving cards issued inside SEPA increased for CNP transactions and decreased across the other transaction channels. In 2016 CNP fraud accounted for 73% of total fraud losses on cards issued inside SEPA, compared with 71% in 2015.
The total value of card fraud using cards issued in SEPA amounted to e1.8 billion in 2016. The total value of card transactions using cards issued in SEPA amounted to
e4.38 trillion in 2016 (Fig. 2.2 adapted from [16]). Credit Card fraud increased in terms of volume by 27.2% compared with 2015, and by 92% compared with 2012.
Figure 2.2: Evolution of the total value of card fraud using cards issued within SEPA. Left-hand scale: total value (EUR millions); right-hand scale: value of fraud as share of value of transaction (%)
Card Fraud is increasing as a booming business. Nilson reports that U.S. card fraud (credit, debt, etc.) was $30 billion in 2019 and expected to increase to $32 billion by 2020. For instance, in 2019 both PayPal’s and Mastercard’s revenue was only $15 billion each, which means that credit card fraud is the main problem that needs to be researched [15] [17].
There are few banking entities that show studies on cases of fraud, however this does not mean that these illegal acts do not exist. It is a big problem, which keeps constant research and investment. The new technologies are part of the strategies that aim to significantly reduce the loss of money due to fraud using credit cards.
Chapter 3
Related Work
This chapter presents an overview of the work related to the topic. In order to facilitate the grouping of works, the chapter is divided into closely related sections like credit card fraud detection, imbalanced classification and reinforcement learning in classification.
3.1
Credit Card Fraud Detection
Credit card fraud detection has a growing investigation with the help of Artificial Intel-ligence techniques. This field is of great importance for banks, therefore they are the institutions most interested in contracting methods that help them avoid losing money.
An approach called Long Short-term Memory Recurrent Neural Network (LSTM) is
used. Authors implement an ANN for detecting credit card fraud, taking into account sequences of transactions occurred in the past, in order to determine whether a new transaction is legitimate or fraudulent [18].
Checking the usage patterns of a user in previous transactions to detect a credit card frauds is suggested [19]. They compare the usage pattern and current transaction, to classify it as either fraud or a legitimate transaction. Among the techniques implemented are KNN, Na¨ıve Bayes, CFLANN, M-Perceptron and DTrees.
Credit cards frauds have no constant patterns is stated [3]. Therefore, the use of an unsupervised learning is necessary. They take account that the frauds are committed once through online mediums and then the techniques change. To solve this issue, they implement a deep Auto-encoder model and a restricted Boltzmann machine, that can reconstruct normal transactions to find anomalies in the patterns.
An intelligent agent can obtain a high rate of fraud transaction with low false alarm rate, providing a convenient way to detect frauds [2]. Their implementation of the in-telligent agent is focus on detect the fraud when transaction is in progress, taking into account the costumers pattern, and any deviation from the regular pattern is considered to the fraudulent transaction. While [20] presents multi-agent techniques for fraud anal-ysis. This approach uses a mathematical model for credit card detection and compare different intelligent agents. Authors tested agents as against cases of credit card fraud over time at different rates with which customer received fraud alerts. Finally, the work models a security system that will promote trust in communication channels by imple-menting hybrid technology that will combine both adaptive data mining and intelligent
School of Mathematical and Computational Sciences YACHAY TECH
agents to authenticate the credit card transaction. The final model shows that the per-formance of credit card fraud detection using multi-agents is in agreement with other detection software, but performs 94% better.
Niu et al. [4] perform a comparison study of credit card fraud detection by using various supervised and unsupervised approaches. Specifically, 6 supervised classifica-tion models, i.e., Logistic Regression (LR), K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Decision Tree (DT), Random Forest (RF), Extreme Gradient Boost-ing (XGB), as well as 4 unsupervised anomaly detection models, i.e., One-Class SVM (OCSVM), Auto-Encoder (AE), Restricted Boltzmann Machine (RBM), and Genera-tive Adversarial Networks (GAN). The experimental results show that supervised models perform slightly better than unsupervised models. This work also concludes that the unsupervised approaches are still promising for credit card fraud transaction detection due to the insufficient annotation and the data imbalance issue in real-world applications. These authors evaluate their study using a Kaggle credit card transaction dataset in a supervised vs unsupervised way.
3.2
Imbalanced Classification
Wang et al.[21] establish an alternative loss function in deep neural network that can capture the classification errors from both minority class and majority class. T, Dong et al. [22] extract hard samples of minority classes and improve the bootstrapping sampling algorithm which ensures the training data in each mini-batch, by batch-wise optimiza-tion with Class Rectificaoptimiza-tion Loss funcoptimiza-tion. While in [23] a cost-sensitive (CoSen) deep neural network, which can automatically learn robust feature representations for both the majority and minority classes is proposed. The approach is used to both binary and multi-class problems. This method shows a better performance against popular data sampling techniques and CoSen classifiers.
According to [24], the research work in imbalanced data classification concentrate mainly on two levels: the data level, and the algorithmic level. The objective of the data level methods is balance the class distribution by manipulating the training sam-ples, taking into account the over-sampling minority class, the under sampling majority class and the combinations of the two above methods. The author take into account that the over-sampling can potentially lead to over fitting while under-sampling may lose valuable information on the majority class. On the other hand, the objective of the algo-rithmic level methods, is lift the importance of minority class by improving the existing algorithms, including cost-sensitive learning, ensemble learning, and decision threshold adjustment. The paper introduces a novel model for imbalanced classification using a deep reinforcement learning. The model formulates the classification problem as a se-quential decision-making process, using deep Q-learning algorithm to find the optimal classification policy for Imbalanced Classification Markov Decision Process (ICMDP).
3.3
Reinforcement Learning in Classification
According to [24], recently the deep reinforcement learning has had excellent results, because it can assist classifiers to learn advantageous features or select high-quality
School of Mathematical and Computational Sciences YACHAY TECH
stances from noisy data. The classification task can be understanding as sequential decision-making process, that uses a multiple agents to interact with the environment to obtain the optimal classification policy. However, the interaction be- tween agents and environment, generate an extremely high time complexity.
Abdi and Hashemi [25] propose an ensemble pruning approach which is based on Reinforcement Learning framework. They use Markov Decision Process and considered the ensemble pruning problem as a one player game, and select the best classifiers. Finally, this method is inefficient to select classifiers when there are plenty of sub-classifiers.
In the work realized by Fenget al. [26], establish a deep reinforcement learning based model divided into instance selector and relational classifier, with the aim to learn the relationship classification in noisy text data. The instance selector part implements an agent selects high quality sentence from noisy data while the relational classifier part learns from the selected clean data and give a reward to the instance selector. Finally, the model proposed obtains a better classifier and high-quality data set.
Another important approach is proposed in [27], this paper establishes a deep re-inforcement learning framework for time series data classification, where use a specific reward function and a clearly formulated Markov Process.
Chapter 4
Technical Background
This chapter presents each tool and concept of the field of artificial intelligence that is used in the development of work. These concepts are explained clearly, trying to make them easily to understand, starting with Artificial Neural Networks, Autoencoder, Backpropagation Algorithm and finally about Agents.
4.1
Artificial Neural Networks
Artificial Neural networks (ANN) were designed as a mathematical generalization of the human brain components, specifically networks of neurons which receive data to learn features and take actions according to the objective of the ANN [28]. During the training process, artificial neural networks find the best combination of parameters that fit a given problem. The study of artificial neural networks is motivated by their similarity to successfully working biological systems, which consists of very simple but numerous nerve cells that work massively in parallel and have the capability to learn. After successful training, a ANN can find reasonable solutions for similar problems of the same class that were not explicitly trained, this because they have the ability to generalize and associate data. This in turn results in a high degree of fault tolerance against noisy input data [28].
The model that remains is that of natural neurons. Figure 4.1 shows a simple draw of a biological neuron. Natural neurons receive signals through synapses located on the dendrites or membrane of the neuron. When the signals received are strong enough, the neuron is activated and emits a signal though the axon. This signal might be sent to another synapse, and might activate other neurons [29].
Natural neurons have enormous complexity if we compare them with artificial neurons. However, we can extract features that are relevant when developing our neural networks. Inputs are needed, which are multiplied by initial weights, to be subsequently activated by a mathematical function. Figure 4.2 shows an example of artificial neuron.
Artificial Neuron
An artificial neuron is a computational model inspired in the natural neurons [29]. Its design and functionalities are derived from observation of a biological neuron that is basic building block of biological neural networks which includes the brain, spinal cord and
School of Mathematical and Computational Sciences YACHAY TECH
Figure 4.1: Natural Neurons
Figure 4.2: Artificial Neuron
peripheral ganglia [30]. An artificial neuron can be studied using a simple mathematical model, such as the following:
y(k) =F
m
X
i=0
wi(k)·xi(k) +b
(4.1)
where:
• xi(k) is the input in discrete timek,
• wi(k) is weight value in discrete time k,
• b is bias,
• F is a transfer function.
• yi(k) is output value in discrete time k.
Loss Function
The standard method for training a deterministic regression or classification model is to minimize the so-called loss on the training set. The loss is defined as a function of the model parameters θ:
L(θ) = 1
S
S
X
s=1
L(yθ(Xs), Ts) (4.2)
School of Mathematical and Computational Sciences YACHAY TECH
Figure 4.3: Transfer Functions.
whereXs is a training sample with corresponding targetTs andyθ(x) is the prediction of
the model given input x. The loss measureL:RD×
RD →R assigns a loss value to each
sample based on the difference between the model’s prediction and the ground truth [31]. Loss function is used for evaluation and minimization of the difference between expected and actual output.
One of the most used loss functions is Mean Squared Error (MSE):
M SE(x,x) =ˆ
PN
i=1kxi−xˆik 2
N (4.3)
where x is the target and ˆx is the obtained value.
Transfer Function
In the equation 4.1 one of the most important variables is the transfer function. Activation Functions is another name by which transfer functions are called. Activation functions are functions used in neural networks to computes the weighted sum of input and biases, of which is used to decide if a neuron can be fired or not [32]. These functions establish the properties of the Artificial Neural Network. There are many types of activation functions, which are used according to the problem that needs to be solved. Figure 4.3 (retrieved from [33]) shows the graphs of the most general functions, which are defined as:
• Linear Function Most real models have non-linear input/output characteristics. But there are some models, when operated within nominal parameters have
School of Mathematical and Computational Sciences YACHAY TECH
ior that is close enough to linear. This function can be an acceptable representation of the input/output behavior in these kinds of situation [34]. The linear function does not apply thresholds and the output is identical to the input.
f(x) =x (4.4)
• Step FunctionThis is used to model the classic ’All-or-none’ behaviour. It resem-bles a Ramp function, but changes the function value abruptly when a threshold valueθ is reached [35].
f(x) =
(
0 if x≤θ
1 if x > θ (4.5)
• Ramp Function
The Ramp function combines the Step function with a Linear output function. As long as the activation is smaller than the threshold value θ1, the neuron shows
the output f(x) = 0; if the activation exceeds the threshold value θ2 The
out-put is f(x) = 1. The neuron’s output for activations in the interval between the two threshold values θ1 < x < θ2 is determined by a linear interpolation of the
activation. [35]
f(x) =
0 if x≤θ1
x−θ1
θ2−θ1 if θ1 ≤x≤θ2
1 if x > θ2
(4.6)
• Sigmoid Function This transfer function takes the input and compresses the output into the range 0 to 1. This transfer function is commonly used in multilayer networks that are trained using the Backpropagation Algorithm1, in part because
this function is differentiable in its whole range [34]. Mathematically, this function is defined as:
f(x) = 1
1 +−x (4.7)
• Hyperbolic Target Function This function in the term of neural networks, is related to a bipolar sigmoid which has an output in the range of −1 to +1. This function is a good trade-off for neural networks, where speed is more important than the exact shape of the transfer function [34].
f(x) = e
x−e−x
ex+e−x (4.8)
• Gaussian Function The maximal function value of a Gaussian function is found for zero activation. The function is even:f(−x) = f(x). The function value is decreasing with increasing absolute value of activation [35].
f(x) =e−x (4.9)
1The term ‘Backpropagation Algorithm’ is extended in Section 4.3: Backpropagation Algorithm.
School of Mathematical and Computational Sciences YACHAY TECH
Learning Rate
In neural networks, the learning rate is a parameter that determines how much the weights can change in response to an observed error on the training set. The choice of this learning rate can have a dramatic effect on generalization accuracy as well as the speed of training [36]. This value is a constant of proportionality which configures the size of the weight adjustments. The value of this constant commonly ranges in the interval [0,1]. If the learning rate is too large, the average loss will increase and get stuck in a local minimum or even to diverge. While if the value is too small learning rate may lead to slow convergence. [37].
Epochs
When the training set is finite, training proceeds by sweeps through the training set called an epoch, and full training usually requires many epochs (iterations through the training set) [37]. For example, the back-propagation learning algorithm builds a different model for each model (i.e.) a network with a different set of weights. If a neural network is trained to 1000 epochs, the learning algorithm investigates or moves through 1000 different models [38].
4.1.1
ANN Learning Paradigms
There are more and more types of ANNs, which are applied to many fields. Therefore they must be classified correctly, to facilitate their use. There are many ways to classify them, such as the type of transfer functions, the topology, applications, type of algorithm, etc. This section shows a brief classification according to the learning paradigms.
Learning can refer to either acquiring or enhancing knowledge [39]. Learning process is a method for updating the architecture as well as the connection weights of an ANN to optimize its efficiency to perform a specific task. The three main learning paradigms are the following: supervised, unsupervised (or self-organized), and reinforcement. Each category includes numerous algorithms [40].
Supervised Learning
In supervised learning the training set consists of input patterns as well as their correct results in the form of the precise activation of all output neurons [28]. Then, each output produced by the training set is compared with the correct solution (target) and according to this comparison the synaptic weights of the neural network are adjusted. The main objective of this training is to adjust the weights so that the difference between output and target is minimal. Learning through training in a supervised ANN model is generally solved by the Error Backpropagation Algorithm [39].
Supervised Learning is a common technique because it allows neural networks to have the ability to generalize, that is, to give correct results even with new data without prior knowledge of the target. This kind of learning is normally used for classification for which there are many options for each type of problem. Choosing a suitable classi-fier (Multilayer Perceptron, Support Vector Machines, K-nearest Neighbour Algorithm,
School of Mathematical and Computational Sciences YACHAY TECH
Figure 4.4: Supervised Learning
Gaussian Mixture Model, Gaussian, Naive Bayes, Decision Tree, Radial Basis Function Classifiers,. . . ) for a given problem is however still more an art than a science [30].
Unsupervised Learning
Unsupervised learning is the biologically most plausible method, but is not suitable for all problems. Only the input patterns are given; the network tries to identify similar patterns and to classify them into similar categories [28]. Neural networks that are trained using unsupervised methods are called self-organizing because they receive no direction on what the desired or correct output should be. When presented with a series of input patterns, the output processing units self-organize by initially competing to recognize the pattern, and then cooperating to adjust their connection weights [41].
Unsupervised learning is mostly used in applications that fall within the domain of estimation problems such as statistical modelling, compression, filtering, blind source separation and clustering. The last one is one common form of unsupervised learning where we try to categorize data in different clusters by their similarity. Self-organizing maps are the ones that the most commonly use unsupervised learning algorithms [30].
Reinforcement Learning
The training set consists of input patterns, after completion of a sequence a value is returned to the network indicating whether the result was right or wrong and, possibly, how right or wrong it was [28]. The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy for the environment [42]. Rein-forcement learning is learning through interaction with an environment by taking different actions and experiencing many failures and successes while trying to maximize the re-ceived rewards. The agent2 is not told which action to take [43]. Reinforcement learning
is particularly suited to problems which include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including robot control, telecommunications, and games such as chess and other sequential decision making tasks [30]. Figure 4.4, 4.5 and 4.6 show a general representation of each learning paradigm (adapted from [40]).
2The term ‘Agent’ is extended in Section 4.7: Agent
School of Mathematical and Computational Sciences YACHAY TECH
Figure 4.5: Unsupervised Learning
Figure 4.6: Reinforcement Learning
4.1.2
Data Splitting Methods and Common Problems
In the training of a neural network, two common problems can arise, which can be prevented by choosing a correct data splitting method. These problems are:
• Underfittinghappens when the training data given to the neural network are very few and therefore do not allow generalizing the learning. That is, the model has not learned enough, which results in low generalization and unreliable results.
• Overfitting is a key problem in the supervised machine learning tasks. It is the phenomenon detected when a learning algorithm fits the training data set so well that noise and the peculiarities of the training data are memorized. This problem leads to deterioration of generalization properties of the model, and results in its untrustworthy performance when applied to novel measurements [44].
There are many splitting methods that can be used, however one of the most common is to divide the data into three subsets [45] (Figure 4.7):
• Training: the data used to ‘teach’ (train) the algorithm to perform its task.
• Validation: the data used to tune the hyperparameters of a learning algorithm.
• Testing: the data used to validate machine learning model behaviour.
Determining the division of data that goes for each subset is a task that requires many tests, which then yield the best model for each problem. Although, there are many suggestions such as manually dividing or even using empirical tests [46].
School of Mathematical and Computational Sciences YACHAY TECH
Figure 4.7: A visualization of the data splits.
Figure 4.8: Simplified structure of an Autoencoder.
The input X is mapping to an output called X0. The internal representation is Y, f is the encoder and g is the decoder.
4.2
Autoencoder
An autoencoder (AE) is a neural network that specializes in learning from unlabeled data, replicating the input data at its output, minimizing the reconstruction error between both layers [47].
Another useful form of auto encoding uses the position offset of an object to produce highly compressed data which is in turn released in a two dimensional matrix to yield new space-time representations [48]. Through the replicating process, the autoencoder learns important features of unlabeled data, eventually these features become an essential part of the learning of other downstream layers.
Internally, this neural network has a hidden layer Y that describes a code used to represent the input. The network has two important parts: an encoder functionY =f(X) and a decoder that produces a reconstruction X0 = g(Y) [47]. This simple architecture is shown in Figure 4.8.
Defining a data sample X with n samples and m features, the output of encoder Y represents the reduced representation of X and the decoder is tuned to reconstruct the original dataset X from the encoder’s representationY by minimizing the difference between X and X0 [49]. Figure 4.9 shows a more complex structure.
The first process is defined by:
Y =f(X) =sf(W X+bX) (4.10)
Where sf is a nonlinear activation function. Weight matrix W and a bias vectorb ∈Rn,
are parameters of the encoder. The decoder function g maps Y back to a reconstruction X0 using the formula:
X0 =g(Y) =sg(W0Y +bY) (4.11)
where decoder’s activation function is sg. Matrix W0 and bias vector bY are decoder’s
parameters.
School of Mathematical and Computational Sciences YACHAY TECH
Figure 4.9: Complex structure of an Autoencoder.
The training phase consists of finding parameters θ = (W, bX, bY) that minimize the
reconstruction loss on the dataset X and the objective function is defined as:
Θ = min
θ L(X, X
0
) = min
θ L(X, g(f(X))) (4.12)
In the case of linear reconstruction, the reconstruction loss (L1) is generally from the
squared error:
L1(θ) = n
X
i=1
kxi−x0ik
2 =
n
X
i=1
kxi−g(f(xi))k2 (4.13)
On the other hand, for non-linear reconstruction, the reconstruction loss (L2) is
gen-erally form cross-entropy:
L2(θ) =− n
X
i=1
[xilog(yi) + (1−xi) log(1−yi)] (4.14)
where xi ∈X, x0i ∈X
0 and y
i ∈Y.
4.3
Backpropagation Algorithm
The term backpropagation (BP) is known to define one of the most popular NN algorithms [50]. After choosing the weights of the network randomly, the backpropagation algorithm is used to compute the necessary corrections. The algorithm can be decomposed in the following four steps:
i Feed-forward computation
School of Mathematical and Computational Sciences YACHAY TECH
ii Back propagation to the output layer
iii Back propagation to the hidden layer
iv Weight updates
The algorithm is stopped when the value of the error function has become sufficiently small [51]. Most training algorithms involve an iterative procedure for minimization of an error function, with adjustments to the weights being made in a sequence of steps. At each such step, we can distinguish between two distinct stages. In the first stage, the derivatives of the error function with respect to the weights must be evaluated. The im-portant contribution of the backpropagation technique is in providing a computationally efficient method for evaluating such derivatives. Because it is at this stage that errors are propagated backwards through the network, we shall use the term backpropagation specifically to describe the evaluation of derivatives. The second stage of weight ad-justment using the calculated derivatives can be tackled using a variety of optimization schemes, many of which are substantially more powerful than simple gradient descent [52].
It is intended to minimize the cost function which is given by the mean square error function:
E(W, b) = 1
2ky−xk
2
(4.15)
The total cost function, given a set of m samples is as follows:
E(W, b) =
1 m m X i=1 1
2ky−xk
2 + λ 2
nl−1 X
l=1 sl X
i=1 sl+1 X
j=1
(Wij(l))2 (4.16)
In this expression, the first term is the mean square error and the second term is the term of regularization, known as weight decay, which has to decrease the magnitude of the weights and helps prevent an over-adjustment of the data [53].
The Gradient Descent3 updates the parameters W, b as follows:
Wij(l) =Wij(l)−α ∂ ∂Wij(l)
E(W, b)
b(l) =b(l)−α
∂ ∂b(l)E(W, b)
The learning rate is defined byα. This algorithm is used to efficiently calculate partial derivatives and total cost function such as:
∂
∂Wij(l)E(W, b) =
1 m m X i=1 ∂
∂Wij(l)E(W, b, x, y)
+λW
(l) ij
3This concept appears in the following section, but here it is mentioned because it is necessary for
this algorithm.
School of Mathematical and Computational Sciences YACHAY TECH
∂ ∂b(l)
E(W, b) = 1
m m X i=1 ∂ ∂b(l)
E(W, b, x, y)
The procedure behind the BP is that given a training set a first forward propagation is made to calculate the values that activate the network next to the output value. Then we proceed to calculate the error termδ(l)i for each nodeiand layerl [53]. Theδ variable is defined by:
δj ≡
∂En
∂aj
(4.17)
Backpropagation Algorithm is summarized as follows:
1 Perform a first forward pass to calculate network activations.
2 Calculate the error term producing in the output layer.
δout =−(y−x) 3 Obtain the error term produced in the hidden layers.
δl = ((W(l))Tδ(l+1))·f0(z(l)) 4 Calculate the derivatives.
∂
∂W(l)E(W, b, x, y) = δ (l+1)
(a(l))T (4.18)
∂ ∂b(l)
E(W, b, x, y) = δ(l+1) (4.19)
4.4
Gradient Descent
The Gradient is an important part to quickly and efficiently evaluate the minimum error function. The gradient information can be approximated by updating the weights:
W(l) :=W(l)−α
∆W(l)+λW(l)
(4.20)
b(l) :=b(l)−α
1 m∆b(l)
(4.21)
where ∆W(l)∆b(l)correspond respectively to the equations (4.18 4.19) present in step
4 of the back propagation algorithm.
The gradient value is modified with each process update. This approximation is called “gradient descent” because with each update the vector of weights moves in the direction of the highest rate of decrease of the error function [52].
In order to find a sufficiently good minimum, it may be necessary to run a gradient-based algorithm multiple times, each time using a different randomly chosen starting point, and comparing the resulting performance on an independent validation set. There is, however, an on-line version of gradient descent that has proved useful in practice for training neural networks on large data sets [54].
School of Mathematical and Computational Sciences YACHAY TECH
4.5
Confusion Matrix
The calculation of the performance of a trained system is one of the measures that allow to determine if the model is reliable. Therefore it is important to choose a tool that allows us to visualize this performance. Confusion matrix summarizes the classification performance of a classifier with respect to some test data [55].
Figure 4.10: Confusion Matrix
The entries in the confusion matrix have the following meaning in the context of our study [56]:
• a is the number of correct predictions that an instance is negative,
• b is the number of incorrect predictions that an instance is positive,
• c is the number of incorrect of predictions that an instance negative, and
• d is the number of correct predictions that an instance is positive. New terms are originated from the confusion matrix:
Accuracy (AC) is the proportion of the total number of predictions that were cor-rect. It is defined by the equation:
AC = a+d
a+c+c+d (4.22)
The recall or true positive rate (TP) is the proportion of positive cases that were correctly identified:
T P = d
c+d (4.23)
The false positive rate (FP) is the proportion of negatives cases that were incor-rectly classified as positive. It is defined by:
F P = b
a+b (4.24)
The true negative rate (TN) is defined as the proportion of negatives cases that were classified correctly, as calculated using the equation:
T N = a
a+b (4.25)
School of Mathematical and Computational Sciences YACHAY TECH
The false negative rate (FN) is the proportion of positives cases that were incor-rectly classified as negative, as calculated using the equation:
F N = c
c+d (4.26)
Precision (P) is the proportion of the predicted positive cases that were correct. This term is defined by the next equation:
P = d
b+d (4.27)
4.6
Principal Component Analysis
Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss [57]. This technique extracts features, specifically changing the inputs, which elim-inates variables that do not provide important information to the data. Finding such new variables, the principal components, reduces to solving an eigenvalue/eigenvector problem, and the new variables are defined by the dataset at hand, not a priori, hence making PCA an adaptive data analysis technique [57].
Principal component analysis (PCA) technique is one of the most famous unsupervised dimensionality reduction techniques. The goal of the technique is to find the PCA space, which represents the direction of the maximum variance of the given data [58]. PCA technique finds a lower dimensional space or PCA space (W) that is used to transform the data (X =x1, x2, ..., xN) from a higher dimensional space (RM) to a lower dimensional
space (Rk), where N represents the total number of samples or observations and x i
represents ith sample, pattern, or observation. All samples have the same dimension
(xi ∈RM). In other words, each sample is represented byM variables, i.e., each sample is
represented as a point inM-dimensional space [59]. Figure 4.11 [60] shows an example of the two-dimensional data (x1, x2), where the original data are on the left with the original
coordinate, i.e.,x1 andx2, the variance of each variable is graphically represented and the
direction of the maximum variance, i.e., the P C1, is shown; on the right the original data
are projected on the first (blue stars) and second (green stars) principal components.
4.7
Agents
An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators [42]. The idea about Agents is illustrated in Figure 4.12. This concept software can be considered a system situated within and a part of an environment that senses that environment and acts on it, over time, in pursuit of its own agenda and so as to affect what it senses in the future [61].
Agent consists of a sensing element that can receive events, a recognizer or classifier that determines which event occurred, a set of logic ranging from hard-coded programs to rule-based inferencing, and a mechanism for taking action in the world. Other attributes that are important include mobility and learning [41].
School of Mathematical and Computational Sciences YACHAY TECH
Figure 4.11: Example of a Principal Component Analysis
Two basic properties of software agents are that they are autonomous and that they are situated in an environment. There is a consensus that autonomy, the ability to act without the intervention of humans or other systems, is a key feature of an agent [62]. The condition of being situated does not constrain the notion of an agent very much since virtually all software can be considered to be positioned in an environment. Agents are often situated in dynamic environments that change rapidly. This means that an agent must respond to significant changes in its environment. In other words, agents need to be reactive, responding in a timely manner to changes in their environment. Another key property of agents is that they pursue goals over time, that is, they are proactive. However, if the agent is not sufficiently reactive, then it will waste time trying to follow plans that are no longer relevant or applicable. Since failure of actions is a possibility in challenging environments, agents must be able to recover from such failures, that is, they must be robust. A natural approach to achieving robustness is to be flexible. Finally, agents almost always need to interact with other agents, that is, agents are social [63].
4.8
Markov Decision Process
The Markov Decision Process (MDP) [64] is the most commonly used approach to solve reinforcement learning problems. This approach is defined by a five-tuple (S,A,T,R,E) where:
• S is the set of all possible agent states. Where St refers to the state at a specific
time.
• A is the set of all possible actions that the agent can take. Where At, t is the set
of all possible actions in a state s at time t.
• T is a probabilistic transition function. Which gives the probability if transitioning into state s0 from taking action A at the current state s. This is denoted by
School of Mathematical and Computational Sciences YACHAY TECH
Figure 4.12: Agents interact with environment.
P(s0t+1|st, at)
• Ris the reward value, which is returned when it goes to states0 after taking action a in states.
• E is the set of terminal states. Which generally give the most significant rewards. The goal of planning in an MDP is to find a policyπ :S → A, a mapping from states to actions, that maximizes the expected future discounted reward when the agent chooses actions according to π in the environment. A policy that maximizes the expected future discounted reward is an optimal policy and is denoted by π∗ [65].
4.9
Q-Learning
Q-Learning is one far-reaching reinforcement learning techniques that does not require a model of the environment to learn to execute complex tasks. Essentially Q-Learning makes possible for an algorithm, to learn a sequential task, where rewards are released in a step by step way, until a journey called “Episode” is completed. After training the “educated” agent develops a road map memory called “policy”, usually represented by a Q-matrix, which optimizes rewards capture trajectories in any definable environment. Q(st, at) gives the value of taking actionatin a state st. The equation 4.28 is the leading
actor of the Q-learning algorithm, derived from the Bellman equation by considering the first and second term of an infinite series [66]:
Qobs(st, at) =r+γmaxaQ(st+1, at+1) (4.28)
Where γ is the discount factor which manages the balance between the meaning
of immediate and future reward. In this equation the value of Q(st, at) of state and
action is given by the sum of the reward r with the discounted factor times maximum future expected reward after moving to the next state St+1. The value of Qobs(st, at)
School of Mathematical and Computational Sciences YACHAY TECH
is computed by an agent, to later use the following equation 4.29 and update the own estimate of Q∗(st, at) in a Q-table. The equation is defined by:
Q∗(st, at) = Q(st, at) +α[r+λmaxaQ(st+1, at+1)−Q(st, at)] (4.29)
Where α is the learning rate. The maxaQ(st+1, at+1) gives the maximum value for all
actions in the following state. Q-learning is an off-policy algorithm since it updates the Q-values without making any assumptions about the actual policy being followed [67]. It is important to know that in the low-dimensional finite state space, Q-functions are recorded by a table. However, in the high-dimensional continuous state space, Q functions cannot be resolved until deep Q-learning algorithm was proposed, which fits the Q-function with a deep neural network [24].
Figure 4.13: Q-learning vs Deep Q-Networks
4.10
Deep Q-network
A Deep Q-network (DQN) is a deep feed-forward convolutional network that uses rein-forcement learning instead of a supervised training approach [67]. This research approach requires knowledge of some important concepts such as: agents, neural networks, Markov Decision Process and Q-Learning.
The Deep Q-Network [68] is a variation of the classic Q-Learning algorithm (see Figure 4.13 (adapted from [69])). In this new approach, three main contributions are used:
School of Mathematical and Computational Sciences YACHAY TECH
• A deep convolutional neural net architecture for Q-function approximation.
• This network uses mini-batches of random training data rather than single-step updates on the last experience.
• It uses older network parameters to estimate the Q-values of the next state.
DQN is one of the first Deep Reinforcement Learning (DRL) algorithms able to suc-ceed in several high-dimensional challenging tasks [70]. Pseudo-code for DQN [68] is shown in Algorithm 1. learning with neural networks eliminates the usage of the Q-table as the neural network acts as a Q-function solver [71]. DQN stores the experience it acquires in a memory space. This experience is a five-tuple (s, a, s0, r, T) which describes the agent taking an action at, in state s, which gives a reward r for changing to state s0,
being T an indicative of whether s0 is a terminal state. After some iterations the agent uses mini-batch of this memory and begins to perform its Q-function updates. This pro-cess is known as experience replay [72] and that is what makes DQN a novel and useful method in Deep Reinforcement Learning.
Algorithm 1: Deep Q-learning with experience replay Initialize replay memory D to capacity N
Initialize action-value function Q with random weightsθ Initialize target action-value function ˆQwith weights θ− =θ
for episode 1, M do
Initialize sequence s1 =x1 and preprocessed sequence φ1 =φ(s1) for t= 1, T do
With probability ε select a random action at otherwise select
at = argmaxaQ(φ(st), a;φ)
Execute action at in the emulator and observe reward rt and imagext+1
Set st+1 =st, at, xt+1 and preprocess φt+1 =φ(st+1)
Store experience (φt, at, rt, φt+1) in D
Sample random mini-batch of experiences (φj, aj, rj, φj+1) from D
Set yj =
rj if episode terminates at step j+ 1
rj +γmaxa0Q(φˆ j+1, a0;θ−) otherwise
Perform a gradient descent step on (yj −Q(θj, aj;θ))2 with respect to the
weights θ
Every C steps reset ˆQ=Q
end end
Chapter 5
Data Description
This chapter presents an analysis of the database used in this project. This is important since you can display data and information necessary to ensure good strategic decision making.
5.1
Credit Card Fraud Dataset
The dataset contains transactions made by credit cards in September 2013 by European cardholders, it represents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions [73], notice the strong unbalancing between fraud and legal trades, typical of this kind of big business. The database contains numerical variables which have been hidden using PCA (Principal Component Analysis) transformation. These 28 features named V1, V2, ..., V28 contain confidential data of the users and are non-reversible, thus protecting the original characteristics of the data. There are two special features that have not been transformed using PCA, these areT imeandAmount. There is also an important variable called Class, which is a fundamental value in the database.
# Time V1 ... V28 Amount Class
1 0 -1.3598071336738 ... -0.02105305... 149.62 0
2 0 1.19185711131486 ... 0.01472416919... 2.69 0
3 1 -1.35835406159823 ... -0.0597518405... 378.66 0
4 1 -0.966271711572087 ... 0.061457628... 123.5 0
5 2 -1.15823309349523 ... 0.2151531474... 69.99 0
... ... ... ... ... ... ...
284803 172786 -11.8811178854323 ... 0.823730961... 0.77 0
284804 172787 -0.732788670658956 ... -0.0535273892... 24.79 0
284805 172788 1.91956500980048 ... -0.02656082...8 67.88 0
284806 172788 -0.240440049680947 ... 0.104532821... 10 0
284807 172792 -0.53341252200504 ... 0.01364891433... 217 0
Table 5.1: Credit Card Fraud Dataset
School of Mathematical and Computational Sciences YACHAY TECH
Figure 5.1: Distribution of Time Feature
5.1.1
Time Feature
Time shows the seconds elapsed since the first transaction. Once the values are repre-sented graphically (see figure 5.1) it can be verified that the database stores transactions that occurred during the period of two days. The data show bimodal behavior, in which after a period of approximately 24 hours, there is a significant drop in the number of transactions. It is reasonable to conclude that this fall originates because they are night hours. Finally, it is considered remove this variable because it is not relevant in learning the model, since the data is very close to others until the last transaction.
5.1.2
Amount Feature
The feature called Amount is the amount of money in each transaction. The largest transaction that has this data set is $25,691.16 while the average of the transactions is $88.35. Figure 5.2 shows that the data is mostly concentrated at very small values close to zero while only a few transactions approximate the maximum value found. On the other hand, the representation of the amount of money for each transaction (see Figure 5.3) shows some values that differ from the others. These are called outliers and in this case they are transactions in which a large amount of money is transferred. Logically, these values attract the attention of being possible frauds, however this is something that fraudsters want to totally avoid. Existing information shows that fraudsters frequently transferred small amounts of money to continue stealing in an undetectable manner.
5.1.3
Class Feature
Figure 5.4 represents the feature called Class, which gives information that if the trans-actions are fraudulent or not, this variable takes value 1 in case of fraud and 0 otherwise. This feature shows that there is a minimum percentage of fraudulent cases which repre-sent 0.17% of all data. While non-fraudulent cases equal 99.83%. It is concluded that
School of Mathematical and Computational Sciences YACHAY TECH
Figure 5.2: Distribution of Monetary Value Feature
the data is highly imbalanced, which requires choosing appropriate measures to divide the data and make the training of the system effective.
Figure 5.4: Count of Fraudulent and Non-Fraudulent Transactions
School of Mathematical and Computational Sciences YACHAY TECH
Figure 5.3: Money per Transaction
5.1.4
V-features
It is also useful to observe the histogram representation of V-features (see figure 5.5). This gives a basic idea of data distribution. It is also important to check if there is any signifi-cant correlation between the features, especially concerning the Class feature. Figure 5.6 shows a matrix of correlations between all features. This representation emphasizes that there are few features related to the Class and that although there are many features in the data there are relatively few significant correlations. This means that the features are effectively Principal Components which is the result of the previous preparation with PCA that the dataset had. Finally, it can be seen that the Time1 and Amount features have no correlation with the Class feature, therefore they are not relevant in the system learning process.
Visualization
The visualization of the database presents interesting characteristics that can contribute significantly to the fraud detection process. One of these alternatives is the display of the number of clusters. This can help us identify anomalous behaviors and discard features that are not relevant.
Figures 5.7, 5.8, 5.9 and 5.10 show that the characteristics form a single dense cluster. This happens if each of the characteristics is represented. However, there are features like Figures 5.9 and 5.10 which show certain values that have a different distribution from the main cluster. These values could be anomalies that serve to detect fraud, however, it is
1Previously it had already been considered to delete the Time feature because it does not show
relevance in the learning of the system.