Stock prediction with machine learning

(1)

Stock prediction with machine learning

A Degree Thesis Submitted to the Faculty of the Escola

T`ecnica d’Enginyeria de Telecomunicaci´

o de Barcelona

Universitat Polit`ecnica de Catalunya

by

(2)

(3)

Abstract

During the last few months, there has been increased attention in the stock market due to the Covid pandemic. The new-found leisure time has driven many people to buy and sell stocks with-out any knowledge on the matter at hand. The number of affiliations on investing or trading apps has increased drastically since the last year. It is natural to think that the field of predicting the stock market has increased accordingly. However, only two main approaches have been made. One focusing on day trading and using technical analysis of the markets to predict the immediate value, and the other focusing on the stocks as long-time investments and using fundamental analysis to predict the future value of the stock in the long run.

Following the outbreak of the Coronavirus, we have seen an increasing gap between the economy and stock market that comes from the instability of the current times. This may worsen the predic-tions created by the fundamental analysis until normality is achieved because the macroeconomic and microeconomic factors that usually are key in the long-term predictions are not affecting the stock market in the same way. In contrast, the technical analysis can predict short-term stock values although it is difficult to stretch the length of time which this analysis can predict. ¿How long into the future can technical analysis predict? ¿Will it be an accurate prediction? ¿Which algorithm will help us obtain the best prediction with technical analysis? These were the main questions that were asked at the beginning of the project.

The usual prediction time with technical analysis can go from hours to a month or two. How-ever, the goal of this project is to compare different algorithms to obtain a predictor that is able to know whether a stock will go up or down in value in 3 to 5 months using technical analysis.

The project started by doing an introduction to Deep Learning and Machine Learning. Af-terwards, the process of obtaining an adequate amount of data to create a proper dataset began. With enough data and the defined variables, the dataset was used to experiment with different algorithms and different configurations to obtain the many predictors. Once the predictors were designed a comparison was made among their results, and more data was added to the dataset to try to improve the scores. Then, the second round of prediction started and the comparison of the scores among the different algorithms was made again to obtain the results. After adding more stock values to the dataset, a mistake was found on some of the rows. The coma value was misplaced because of the difference in format in the API the data was being obtained. A third and final round of prediction was done with the problem solved. Among the five algorithms that were tested, the Random Forest one offered the best results, with an accuracy of 71% for the last dataset.

(4)

Resum

Durant els últims mesos hi ha hagut un increment de l’interès cap al mercat de valors a causa de la pandèmia de la Covid. El nou temps lliure trobat ha portat a molta gent a comprar i vendre accions sense tenir el coneixement suficient. El nombre d’usuaris a les aplicacions d’inversió o de negociació borsària ha augmentat dràsticament des de l’any passat. És normal pensar que el camp de la predicció del mercat de valors hagi augmentat de la mateixa manera. Tot i això, només s’han fet dos tipus d’enfocaments. El primer es centra en la negociació borsària a diari, fent servir l’anàlisi tècnica del mercat per predir el valor immediat de les accions, i l’altre centrant-se en les accions com a inversió a llarg termini i fent servir l’anàlisi fonamental per predir el valor a futur de les accions a la llarga.

Després del brot del Coronavirus s’ha vist que la diferència entre el mercat de valors i l’economia ha anat augmentant, i prové de la inestabilitat dels temps actuals. Aquest fet pot empitjorar les prediccions fetes amb l’anàlisi fonamental fins que s’aconsegueixi la normalitat degut a que els fac-tors macroeconòmic i microeconòmic que normalment són claus en les prediccions a llarg termini no afecten el mercat de valors de la mateixa manera. En canvi l’anàlisi tècnica pot predir els val-ors de les accions a curt termini, tot i que és complicat allargar la quantitat de temps en la qual podem predir. Fins a quin punt en el futur podem predir amb l’anàlisi tècnica? Serà una predicció acurada? Quin algorisme ens ajudarà a obtenir la millor predicció amb anàlisi tècnica? Aquestes han estat les qüestions que es van formular a l’inici del projecte.

El temps normal de predicció amb anàlisi tècnica pot anar d’hores a un mes o dos. La finali-tat d’aquest projecte és comparar diferents algorismes per obtenir un predictor que permet saber si el valor d’una acció pujarà o baixarà en un rang de temps d’uns 3 a 5 mesos fent servir anàlisi tècnica.

El projecte va comen¸car amb una introducció a l’aprenentatge automàtic i l’aprenentatge pro-fund. Posteriorment es va comen¸car a obtenir suficients dades com per crear un conjunt de dades funcional. Amb les dades suficients i les variables definides es va comen¸car a usar el conjunt de dades per experimentar amb diferents algorismes i configuracions per obtenir predictors. Un cop dissenyats els predictors, es va fer una comparació entre ells, i es va afegir més informació al conjunt de dades per intentar millorar les avaluacions. Llavors va comen¸car la segona ronda de prediccions i es va tornar a fer una comparació dels resultats per obtenir el valor del millor predictor. Després d’afegir més valors d’accions al conjunt de dades, es va trobar un error en algunes files en el que la coma dels decimals estava moguda degut a la diferència de formats a la API de on s’obtenen les dades. Posteriorment es va procedir a fer una tercera i última ronda de prediccions amb el problema solucionat. D’entre els cinc algorismes provats, el de Random Forest va oferir els millors resultats, amb un percentatge d’encerts del 71% per l’últim conjunt de dades.

(5)

Resumen

Durante los últimos meses ha habido un incremento del interés hacia el mercado de valores debido a la pandemia del Covid. El tiempo libre encontrado ha llevado a mucha gente a comprar y vender acciones sin tener el conocimiento suficiente del tema. El número de usuarios de la apli-caciones de inversiones o negociación bursátil ha aumentado drásticamente desde el año pasado. Es normal pensar que el campo de la predicción del mercado de valores haya crecido acordemente. Pese a esto, únicamente se han hecho dos tipos de enfoque. El primero se centra en la negociación bursátil diaria, usando el análisis técnico del mercado para predecir el valor inmediato de las ac-ciones, el otro se centra en las acciones como inversión a largo plazo, usando el análisis fundamental para predecir los valores de las acciones a futuro.

Después del brote de Coronavirus se ha visto una brecha entre el mercado de valores i la econom´ıa, que ha ido aumentando y proviene de la inestabilidad del momento que vivimos ac-tualmente. Este hecho puede empeorar las predicciones realizadas mediante el análisis fundamental hasta que volvamos a una normalidad, ya que los factores macroeconómico y microeconómico que normalmente son clave en las predicciones a largo plazo no afectan al mercado de valores de la misma manera. En contraste el análisis técnico puede predecir el valor de las acciones a corto plazo, pese a que alargar la cantidad de tiempo que se puede predecir es complejo. ¿Hasta qué punto en el futuro podemos predecir con el análisis técnico? ¿Será una predicción acertada? ¿Qué algoritmo nos permitirá obtener la mejor predicción con análisis técnico? Estas han sido las pre-guntas formuladas al inicio del proyecto.

El tiempo normal de predicción con análisis técnico puede variar desde horas a un mes o dos. La finalidad del proyecto es comparar diferentes algoritmos para obtener un predictor que permita saber si el valor de una acción subirá o bajará en un rango de tiempo de unos 3 a 5 meses usando análisis técnico.

El proyecto empezó con una introducción al aprendizaje automático y aprendizaje profundo. Posteriormente se obtuvieron suficientes datos para generar un conjunto de datos funcional. Con los datos y las variables definidas se usó el conjunto de datos para experimentar con diferentes algoritmos y configuraciones para obtener predictores. Una vez finalizados los predictores se realizó la comparación y se decidió añadir más datos al conjunto de datos para intentar mejorar las evalua-ciones. As´ı empezó la segunda ronda de predicciones y se volvieron a comparar los resultados para obtener el valor del mejor predictor. Se volvió a aumentar el conjunto de datos y se encontraron errores en algunas filas, en las que la coma de los decimales estaba movida debido a la diferencia de formato entre la API de dónde se obtienen los datos. Al solucionar el problema se procedió a realizar una tercera y última evaluación de los diferentes algoritmos. Entre los cinco algoritmos usados en el proyecto, el de Random forest ofreció los mejores resultados, con un porcentaje de aciertos del 71% para el último conjunto de datos.

(6)

Acknowledgements

I would like to thank first my thesis supervisor at Ernst & Young, Ana Jimenez Castellanos, who has guided me and has given me the recommendations to learn during the whole project while giving me the space to grow as an Engineer and allowed me to work in this incredible project that has engaged me to improve my skills in Machine Learning applied to economics.

I would also like to thank Prof. Climent Nadeu from the department of Communications and Signal Theory at UPC, Barcelona. He gave me key indications to obtain knowledge prior to the beginning of this project that has been helpful during the whole execution of the thesis.

I can’t forget my friends and college classmates, who have taught me many life lessons and have made me who I am today. In particular I would like to thank Luis Ram´on Rodr´ıguez Javier, who has helped me motivate during the project and has taught me the power of perseverance.

Finally, I have to express my eternal gratitude to my parents for teaching me so many valuable lessons, for listening to the progress of this project even when they didn’t understand a word I said, and for supporting me in every step I take.

(7)

List of Figures

1 Candle graph of the Apple stock (AAPL) with some technical indicators. Image

from: Plus 500 trading platform[11] . . . 10

2 Candle graphic of the Tesla stock (TSLA) showcasing a bullish market with the three EMAs: blue(6), red(70), white(200). Image from: Plus 500 trading platform[11] . . . 12

3 Candle graphic of the Gilead stock (GILD) showcasing a bear market with the three EMAs: blue(6), red(70), white(200). Image from: Plus 500 trading platform[11] . . . 12

4 Candle graphic of the Ford stock (F) showcasing the selling point (red dotted line) and the buying point (white dotted line) using the MACD indicator. Image from: Plus 500 trading platform[11] . . . 13

5 Perceptron schema. Image from: DeepAI webpage[5] . . . 16

6 K-NN example with k=3 and k=7. Image from: Data Camp webpage[4] . . . 17

7 3-category SVM examples using Lineal Kernel, RBF kernel and Polynomial kernel using the Iris dataset. Image from: Scikit-learn webpage [14] . . . 19

8 Cross Validation example withk = 4 . . . 20

9 Shuffle Split example with 4 iterations. Image from: Scikit-learn webpage[15] . . . . 20

10 Architecture implemented in the project . . . 22

11 Dataset Creation Process . . . 23

12 SVM’s time comparison between models and datasets . . . 28

13 Random Forest’s time comparison between models and datasets . . . 28

14 Random Forest’s time comparison between models and datasets . . . 29

15 One configuration of each algorithm with the lowest time and best accuracy. The MLP and KNN values are difficult to see in the graph as they ar too similar to the ones obtained with Linear Regression . . . 30

16 Accuracy results for the KNN algorithm when modifying the number of neighbors . 30 17 Accuracy results for the Random Forest algorithm when modifying the number of trees . . . 31

18 Cost calculation . . . 32

(10)

List of Tables

1 Number of rows and stocks for each dataset . . . 23

2 Results for the first dataset . . . 25

3 Results for the second dataset . . . 26

4 Results for the third dataset . . . 27

5 Best results for each algorithm and the iteration of the dataset . . . 29

(11)

1 Introduction

In this fast-paced world we live in, it is impossible to be an expert in everything; that is why we use technology to help us with many day to day activities. Furthermore, we want technology to improve our quality of life in many ways. Technology has given regular people access to the stock market, bringing new opportunities to many. However, increasing people’s quality of life without the proper knowledge is seemingly impossible. Artificial Intelligence has entered the game to level the field. Through the use of technology and AI in the stock market, everyone will be able to buy and sell stocks with a certain probability of success.

Two main approaches have been made in the field of stock prediction. The first approach is mostly focused on stocks as long term investments through fundamental analysis1_{. With this kind}

of study, the stock value will ideally trend to the predicted intrinsic value. This approach does not work appropriately on short or mid-time predictions because it does not indicate a stock’s movement. On the other hand, an increasing trend in the trading community has been predicting using technical analysis2. Focusing our attention on the stock market’s direction may help us predict in more volatile scenarios, although the window of time we will be able to predict will be shorter.

This project will be using technical analysis and has two main goals. The first one is to find out the better algorithm to predict whether a stock will go up or down in value. The second one will be to try to stretch the prediction window to a range of 3 to 5 months.

The main milestones of this project are going to be obtaining the data, storing and processing it to create a dataset, use different algorithms to predict whether a stock will go up or down in value, and finally check the results to see which algorithm better predicted the trends of the many stocks.

This thesis is structured by stating the context of the project, the objectives, and methodol-ogy. Finally, it presents the results, the environmental and economic impact, and the project’s conclusions.

During the thesis’s execution, some deviation appeared mainly because of the difference in the formating of the data from the API where it was collected and Excel. A significant restraint found during the first part of the project, data collection, was the limitation of the number of queries in the API per day and second.

All of the process followed during the thesis is well represented in the Gantt diagram in Next Steps, in the figure 19.

1_{Fundamental analysis is a method of assessing the intrinsic value of a security by analyzing various macroeconomic} and microeconomic factors. The ultimate goal of fundamental analysis is to quantify the intrinsic value of a security.[8] 2_{Technical analysis is a method used to predict the probable future price movement of a security – such as a stock} or currency pair – based on market data. [22]

(12)

Figure 1: Candle graph of the Apple stock (AAPL) with some technical indicators. Image from: Plus 500 trading platform[11]

2 Objectives

In this undergraduate thesis, there are two main objectives. One of them is to create a predictor that is able to know whether a stock will go up or down in value in a medium time range (3 to 5 months) using technical analysis. The second one is to figure out which algorithm will help us better predict the stock’s movement.

2.1 Achieve a 3 to 5 months prediction

As we stated earlier, technical analysis is mainly used in trading to predict a stock’s movement in a short time range, from hours to a month or two. This makes technical analysis an excellent method to buy and sell repeatedly during a short period. To do so, two things may be required:

The use of a bot.

Constant monitoring of the stock market.

However, not many people have the knowledge or the time to do neither of those things. This project aims to stretch this constraint to a more extended time period, reducing the number of movements3 the investor will have to do and removing the necessity of the constant monitoring of the stock market.

2.2 Figure out the best algorithm to make the prediction

Many implemented algorithms can be used to make the prediction for this thesis; hence choosing one, using intuition alone, would be a mistake. Each algorithm has its strengths and weaknesses and is in the scope of this project to find the most suitable one to make this prediction.

To complete the first objective, we have to think of the prediction as a classification problem4

because we will have to decide whether the stock value will go up (1) or down (0). There are many classification techniques, but we will be focusing on the following:

3_{By a movement we refer to the action of buying or selling a stock} 4_{Classification is the process of predicting a label from given data points}

(13)

Logistic Regression

Multi Layer Perceptron (MLP) K-Nearest Neighbors (KNN) Random Forest

Support Vector Machine (SVM)

To figure out which of the previous algorithms are best suited, we will be comparing their accuracy, the deviation of the accuracy, and the time that it takes to complete the training with each algorithm.

3 State of the art of the technology used or applied in this

thesis

In this section, there will be an in-depth description of the indicators used to build the prediction model as well as the different algorithms used during the whole process.

3.1 Indicators

The indicators that were calculated and added to the dataset were the following:

Exponential Moving Average (EMA)

Moving Average Convergence Divergence (MACD) 3.1.1 Exponential Moving Average (EMA)

The exponential moving average tracks the stock value and is a type of weighted moving average that gives more importance to the most recent data. The algorithm used to calculate the different EMA in the project is the following:

1 def c a l c u l a t e _ e x p o n e n t i a l _ m o v i n g _ a v e r a g e ( self , l i s t) :

2 w e i g h t s = np . exp ( np . l i n s p a c e ( -1. , 0. , s e l f . m e a n _ v a l u e s ) )

3 w e i g h t s /= w e i g h t s .sum()

4 ema = np . c o n v o l v e (list, w e i g h t s ) [:len(l i s t) ]

5 ema [: s e l f . m e a n _ v a l u e s ] = ema [ s e l f . m e a n _ v a l u e s ]

Where self.mean_values and list are the amount of days that we are using in the moving average and the array that we are calculating the EMA from.

In the dataset three EMAs were used with mean values of 6, 70 and 200. This method is explained in the book Ganar en la bolsa es posible by Josef Ajram[1]. The idea behind this is to have a clear trigger to buy or sell.

(14)

If the stock value is above the 6 days EMA, and the 6 days EMA is above the 70 days EMA and this one is above the 200 says EMA, then the market has a higher probability to keep going upwards.

Figure 2: Candle graphic of the Tesla stock (TSLA) showcasing a bullish market with the three EMAs: blue(6), red(70), white(200). Image from: Plus 500 trading platform[11]

If the stock value is below the 6 days EMA, and the 6 days EMA is below the 70 days EMA, and this one is below the 200 days EMA, then we are in a bear market situation and it might be a good time to sell, but never to buy.

Figure 3: Candle graphic of the Gilead stock (GILD) showcasing a bear market with the three EMAs: blue(6), red(70), white(200). Image from: Plus 500 trading platform[11]

The 200 days EMA represents the mean if we look back for a year (long time), the 70 days EMA represents the mean if we look back a medium amount of time, and the 6 days EMA represents the mean if we look back a short amount of time. With these three indicators, we can see the different

(15)

trends in long, medium, and short-time. Furthermore, with the logic explained previously, we can use these indicators as a trigger to know whether we have to buy, sell, or maintain the position.

3.1.2 Moving Average Convergence Divergence (MACD)

The MACD indicator is composed by three functions:

MACD Signal Histogram

The MACD function is the difference between a 26 period EMA of the closing values and a 12 period EMA of the closing values. These two indicators are called slow EMA and fast EMA respectively.

M ACD = 12P eriodEM A− 26P eriodEMA

The Signal function is a nine-day EMA of the MACD, and it is used as a trigger signal. When the Signal function crosses the MACD in an upwards direction, it indicates a change in the bear market trend. It is a selling call. When the Signal function crosses the MACD in a downwards direction, it suggests a shift to a bullish market’s tendency. It is a buying call.

To help to visualize previously mentioned outcomes, the histogram function is usually repre-sented. It is the difference between the MACD and Signal functions. If the histogram is zero, it indicates a change in the market’s trend. We have to check the value of the histogram before it reaches 0. If the histogram value is positive before reaching zero, it indicates a change to a bear market trend. If the Histogram value before reaching zero is negative, it shows a bullish market.

Figure 4: Candle graphic of the Ford stock (F) showcasing the selling point (red dotted line) and the buying point (white dotted line) using the MACD indicator. Image from: Plus 500 trading platform[11]

(16)

To calculate the previous functions the following algorithm was used:

1 def c a l c u l a t e _ m a c d ( self , data , n a m e ) :

2 i n f o = [] 3 for d in d a t a : 4 i n f o . a p p e n d ( d [ n a m e ]) 5 s l o w _ m _ p r e d = m e a n _ p r e d i c t o r ( 2 6 ) 6 f a s t _ m _ p r e d = m e a n _ p r e d i c t o r ( 1 2 ) 7 n i n e _ m _ p r e d = m e a n _ p r e d i c t o r (9) 8 s l o w = s l o w _ m _ p r e d . c a l c u l a t e _ e x p o n e n t i a l _ m o v i n g _ a v e r a g e ( i n f o ) 9 f a s t = f a s t _ m _ p r e d . c a l c u l a t e _ e x p o n e n t i a l _ m o v i n g _ a v e r a g e ( i n f o ) 10 11 m a c d = f a s t - s l o w 12 s i g n a l = n i n e _ m _ p r e d . c a l c u l a t e _ e x p o n e n t i a l _ m o v i n g _ a v e r a g e ( m a c d ) 13 h i s t = m a c d - s i g n a l 14 t i m e = [] 15 for da in d a t a : 16 t i m e . a p p e n d ( da [" d a t e t i m e "])

17 r e t u r n time , macd , hist , s i g n a l

The data parameter is an array with candle like objects5_{. The name variable represents the}

value we want to calculate the MACD with (open, close, high, low). In this project the MACD function was calculated with the closing values of each stock.

3.2 Classification algorithms

As mentioned in the objectives section 2.2 Figure out the best algorithm to make the pre-diction the following algorithms have been used during the thesis:

Logistic Regression

Multi Layer Perceptron (MLP) K-Nearest Neighbors (KNN) Random Forest

Support Vector Machine (SVM) 3.2.1 Logistic Regression

When we think of a classification between two outcomes the first algorithm that comes to our mind is Binary Logistic Regression. Naturally this is the first algorithm tried in this project because it can be interpreted as knowing the probability of a stock going up.

Generally, to train a Binary Logistic Regression predictor we will have to follow the next steps[10]:

1. Create a weight matrix (W ) and multiply it by the input variables (X), X being a matrix withm rows and n features:

(17)

a = w0+w1∗ x1+w2∗ x2+... + wn∗ xn

2. Use the sigmoid6 function to do the transformation from all real numbers to a space of 0 to 1:

ypred= 1/(1 + e−a)

3. Calculate the cost function for that iteration:

costw= (−1/m)Pm_i=1yilog (ypredi) + (1− yi) log (1− ypredi)

4. Obtain the gradient of the cost function:

dwj=P n

i=1(ypred− y)xij

5. Update the weight matrixW :

wi=wj− (α ∗ dwj)

Whereα is the learning rate.

All this process in implemented in the module from scikit-learn

sklearn.linear_model.LogisticRegression[17] and the following parameters:

Solver: SAGA Penalty: L2

3.2.2 Multi Layer Perceptron (MLP)

To explain the process that follows a multi layer perceptron first we need to understand what a single perceptron does. The perceptron will do a weighted sum of all of the imputs given to create an output. At the end of the perceptron an activation function is added.

(18)

Figure 5: Perceptron schema. Image from: DeepAI webpage[5]

A MLP is different perceptrons combined together to create a fully-connected neural network. At the last layer a step function is added to create a binary classifier. There are different activation functions, the most common are:

Linear Sigmoid Tanh ReLU Leaky ReLU Softmax

The MLP has been implemented using the module from scikit-learn

sklearn.neural_network.MLPClassifier[18] and the following parameters: Hidden Layers: (7,6, 5, 4, 3, 2, 1)

Activation function: ReLu Learning Rate: invscaling

(19)

3.2.3 K-Nearest Neighbors (KNN)

The K-NN algorithm is a type of classification algorithm that uses the distance between the data point we want to predict and the data points we already have to predict the category of the new point.

The prediction can be done with euclidean distance or any other type that we decide. Another important variable we want to choose is the number of data points that we will use to predict the class of the new data entry. The k cannot be too high in case we have a small amount of training data points, but it cannot be too low because while it may have a lower bias, it may introduce a higher variance.

Figure 6: K-NN example with k=3 and k=7. Image from: Data Camp webpage[4]

In this project the K-NN has been implemented with the scikit-learn module: sklearn.neighbors.KNeighborsClassifier [16] and the following parameters: Number of neighbors:200

Weights: distance Number of jobs: 6 Leaf size: 30

(20)

3.2.4 Random Forest

The Radom Forest algorithm consists of a large ensemble of decision trees. The algorithm works under the basis that a large number of relatively uncorrelated models (trees) operating as a com-mittee will outperform any of the individual constituent models. Thus, the low correlation between the decision trees is crutial.

A decision tree will be able to categorize a data entry between a set of given classes, in our case, between 1 and 0. An ensemble of decision trees will obtain the result of each decision tree and determine the category comparing the amount of predicted 1 with the amount of predicted 0.

To ensure the low correlation between models in a random forest the algorithm uses two methods:

Bagging: Each tree will use the same amount of data from the dataset (N), although it will be randomly sampled from the dataset with replacement.

Feaute randomness: Each tree in a random forest will have a random subset of features that will be selected amongst the original ones.

In this project the Random Forest algorithm has been implemented using the scikit-learn module sklearn.ensemble.RandomForestClassifier[19] and the following parameters:

Number of estimators: 300 Criterion: entropy

3.2.5 Support Vector Machine (SVM)

The algorithm of Support Vector Machine tries to find a hyperplane in a N-dimensional space that fits the data points in different categories. In this project , the hyperplane will divide the N-dimensional space into the two classes, 0 and 1. Depending on the problem that we face a Linear SVM, a RBF SVM (gaussian), or a Polynomial SVM may be used. All the previous algorithms differ on the kernel they use:

(21)

Figure 7: 3-category SVM examples using Lineal Kernel, RBF kernel and Polynomial kernel using the Iris dataset. Image from: Scikit-learn webpage [14]

The Kernel we use in the algorithm will determine the shape of the function that classifies the data points.

In this project the Support Vector Machine algorithm has been implemented using the scikit-learn module

sklearn.svm.SVC[20], the rbf kernel (Gaussian) and the default parameters.

3.3 Cross Validation

The Cross Validation is a technique used to obtain a more accurate evaluation of the predictor that is being used. When we use Cross Validation, the dataset is divided in different subsets and, for each iteration, one subset is used as a validation set, and the remaining are used as a training. This way we can use Cross Validation to try to find the best parameters to train our model. Once the model is completed the accuracy is saved. At the end of all the iterations we will have K7_accuracy

scores.

In this project we will be using cross validation with the complete dataset, because once all the models are completed we will use the different scores to obtain the mean and variance of the model’s accuracy. This decision has been made because the amount of data on the first and second dataset was thought to be too low to divide in three subsets (testing, validation and training).

(22)

Figure 8: Cross Validation example withk = 4

3.4 Shuffle Split

Shuffle Split is a random permutation cross-validator. In this case, instead of splitting the dataset in k sets, all different from the others, it will be split in k randomly selected sets. Using Shuffle Split that two sets are identical may be a possibility although it is highly improbable.

Figure 9: Shuffle Split example with 4 iterations. Image from: Scikit-learn webpage[15]

3.5 Principal Component Analysis (PCA)

Principal Component Analysis is used to reduce the dimensionality of large data sets, by projecting a large set of variables into one that still contains most of the information in the large set and has a lower dimentional space. It is important to use PCA with some sort of scaling, because it is quite sensitive to the variances of the variables. If there are large differences between the ranges of the inital values, those variables with larger ranges will dominate over those with small ranges.

(23)

4 Methodology

The process followed in this project has been: researching the important topics (Machine Learning algorithms and finance), implementing the datat retrieval process to create the dataset, and writing the code to test each algorithm. For the last two parts of the project there have been three trials to test the algortihms with different datasets. Finally, al results were stored in an Excel Spredsheet to facilitate the comparison amongst all algorithms.

4.1 Machine Learning and Finance Research

To learn about Machine Learnign and classifiaction algorithms the course Machine Learning A-Z: Hands-On Python & R In Data Science[7], that focuses on how to use the different algorithms, and is mainly focused on a more practical use of Machine Learning. Another course used to research and obtain knowledge has been the Deep Learning Specialization Course[10], it focuses mainly in a theoretical approach on Neural Networks. The master’s thesis Predicting Stock Prices Using Technical Analysis and Machine Learning[9] has an introduction talking about using the crossing of Moving Averages as a signal to know whether to buy or sell a stock that inspired me to research about this topic. After finding this thesis the book Ganar en la bolsa es posible[1] was consulted to obtain more information about signs that tell a buy or sell call with Moving Averages.

4.2 Creation of the dataset

To begin the process of retrieving data, different APIs were explored. Among them were the Yahoo Finance API[23] and the Alpha Vantage API[2], the latter being the chosen one to use during the project. The choice was based on the structure of the data obtained from each API and the queries needed to receive said information.

The next important decision was to choose between two options. Creating the dataset by obtaining the data directly from the API or storing the information first in a database to process it later and build the dataset. The decision was guided by an important constraint; the Alpha Vantage API has a limitation on the number of queries that can be done in a second and in a day. Hence, a database was needed to store as much data as possible and then creating the dataset directly from the database. MongoDB was used in AWS8 to keep the information since the lack of a fixed structure allows more freedom in creating the objects of the database.

After all the previous choices were made a python module[13] was created to upload and down-load the information from the database using pymongo[12]. A script was written to collect all of the daily data from several stocks randomly selected and upload it to the database. At each of the three trials, more stocks were added to the list until the last execution of the code. In the end, the stocks were selected from a file containing all NASDAQ9stocks until the database ran out of space. This whole process is better described in the figure 10, where we can see all of the network architecture involved in the process.

8_{Amazon Web Service}

(24)

Figure 10: Architecture implemented in the project

Another script was written to download all the information from the database for each stock’s value, from the IPO10 _{to the present date. The stock value was compared to the values from 3 to 5}

months later. If the maximum value from the future was bigger than the current value, the row was cataloged as a buying opportunity (1), and if the minimum value from the future was smaller than the current value, the row was cataloged as a selling opportunity (0). Finally, one more module was programmed to calculate the different indicators used in the predictions and were added to each row. All of the rows were added to a .txt, .xlsx and .csv file, to be used as a dataset.

For all the versions of the dataset, it was composed11_{of the following columns:}

Stock name Initial Date Initial Value 6 day EMA 70 day EMA 200 day EMA Histogram function MACD function

10_{Initial Public Offering}

11_{Columns with the final value, EMAs and MACD were added to the dataset in case they were needed in some} stages of the project, although they were never used

(25)

Signal funtion Result

Three versions of the dataset were created, each of them with increasing number of rows. Thanks to the automatization of the data collection the diversity amongst the data was increased for each version:

Version of the dataset Number of rows Number of different stocks added

1 92120 170

2 101509 190

3 720668 1867

Table 1: Number of rows and stocks for each dataset

All the code used to collect the data and create the dataset is in a Github repository that can be acces using the link in the reference section [3]

As we see the number of different stocks in the last dataset increases drastically, thus rising the diversity.

The schema of the creation of the dataset as described in this section is as follows:

(26)

4.3 Testing the algorithms

Once the first version of the dataset was obtained the creation of the algorithms began. Through the use of Jupyter Notebook three different templates were created. Each template would have different parameters and would be used with every algorithm. The process of calculating the scores for each algorithm was done three times, one for each dataset.

4.3.1 Testing with Cross Validation (CV)

The template used in the testing with cross validation, as shown in Annex 1(7.2)_{, enables us to}

observe the different results with Standard Scaling for each algorithm using Cross Validation. It is important to note that a timer is set at the begining and ending of the cross validation to know the time that it takes to run the whole algorithm. For all algorithms the dataset will be split in 10 subsets to do the cross validation.

4.3.2 Testing with Cross Validation and Shuffle Split

In this template the Shuffle Split function is added before the Cross Validation to generate a random split in the dataset, in this case the dataset will not be evenly separated. The same timer is set before doing the Cross Validation to ensure we know how much it takes to obtain the results. The Shuffle Split will be done in groups of 10, with a test size of 0.2. We can see the code used in jupyter notebook in Annex 1(7.2).

4.3.3 Testing with Principal Component Analysis

The last template uses Principal Component Analysis combined with Standard Scaling before using Cross Validation in combination with Shuffle Split to obtain the accuracy scores. A timer was added to calculate the amount of time that we need in order to obtain said scores. We can see the template in the Annex 1(7.2)_.

5 Results

The goal of this section is to show the results for each iteration done with the different datasets,as well as representing the result in a ”user friendly” manner.

It is important to define two main terms that will be repeated during the whole section. Mean accuracy is the mean of all the accuracy obtained from the Cross Validation for each algorithm. The accuracy is obtained by finding the porportion of correctly predicted cases and the total amount of cases. The Deviation of the accuracy is the standard deviation of the accuracy obtained doing the Cross Validation.

(27)

5.1 First dataset

The first dataset was composed of 92120 rows and included 170 well-known different stocks.

Algorithm name Time spent training Mean accuracy Deviation of the ac-curacy

Logistic Regression with CV

0 : 00 : 03 65% 0%

Logistic Regression with CV & Shuffle-Split

0 : 00 : 03 65% 0%

Logistic Regression with CV Shuffle-Split & PCA

0 : 00 : 05 52% 2%

Random forest with CV

0 : 01 : 39 66% 1%

Random forest with CV & ShuffleSplit

0 : 09 : 18 68% 1%

Random forest with CV ShuffleSplit & PCA 0 : 12 : 50 61% 0% MLP with CV 0 : 00 : 34 65% 0% MLP with CV & ShuffleSplit 0 : 00 : 35 65% 0% MLP with CV ShuffleSplit & PCA

0 : 01 : 04 64% 0% K-NN with CV 0 : 00 : 26 65% 0% K-NN with CV & ShuffleSplit 0 : 00 : 23 65% 0% K-NN with CV ShuffleSplit & PCA

0 : 00 : 22 65% 0%

Gaussian SVM with CV

1 : 59 : 43 65% 0%

Gaussian SVM with CV & Shuffle-Split

3 : 55 : 05 65% 0%

Gaussian SVM with CV Shuffle-Split & PCA

0 : 40 : 06 65% 0%

(28)

5.2 Second dataset

The second dataset was composed of 101509 rows and included 190 well-known different stocks.

0 : 00 : 05 64% 1%

0 : 00 : 04 64% 1%

0 : 00 : 07 53% 1%

0 : 06 : 07 66% 1%

0 : 10 : 09 68% 0%

0 : 00 : 29 64% 0%

2 : 20 : 49 65% 0%

4 : 10 : 38 65% 0%

0 : 48 : 56 64% 1%

(29)

5.3 Third dataset

The third dataset was composed of 720668 rows and included 1867 different stocks. For the results of the SVM algorithm the data could not be obtained due to the large amount of samples and the processing power needed

0 : 00 : 57 63% 0%

0 : 00 : 50 63% 0%

0 : 00 : 38 50% 0%

1 : 11 : 40 70% 2%

2 : 52 : 58 71% 0%

0 : 07 : 31 63% 0%

N/A N/A N/A

(30)

5.4 Analysis of the results

As we can see in the tables above; for most cases, the usage of Principal Component analysis has diminished the training time, specially when used before the Support Vector Machine algorithm, as seen in figure 12.

Figure 12: SVM’s time comparison between models and datasets

PCA works best when used in a dataset with a high amount of features and samples; when we increase the number of samples, the time spent doing Cross Validtion will decrease, as seen in figure 13

Figure 13: Random Forest’s time comparison between models and datasets

When we take into account the accuracy in these graphs we can see that, in the case of the Random Forest algorithm, when we used Principal Component Analysis the slope is higher than only using Shuffle & Split. This means that the more we increase our dataset the more we will need PCA to avoid higher training times. Although, for the amount of data that is in the dataset in the

(31)

first three versions, the algorithm without using PCA will obtain a much more accurate prediction.

Figure 14: Random Forest’s time comparison between models and datasets

We can see in table 5 the best configurations for each algorithm. As we can see, most of the results come from from the first version of the dataset.

Algorithm name Version of the dataset

Time spent training Mean accuracy Deviation of the ac-curacy

3 2 : 52 : 58 71% 0%

MLP with CV 1 0 : 00 : 34 65% 0%

KNN with CV ShuffleSplit & PCA

1 0 : 00 : 22 65% 0%

SVM (Gaussian) with CV Shuffle-Split & PCA

1 0 : 40 : 06 65% 0%

Logistic regression with CV

1 0 : 00 : 03 65% 0%

Table 5: Best results for each algorithm and the iteration of the dataset

Observing the figure 15, we can see the evolution of the previous best configurations throughout the different datasets. For all algorithms except for Random Forest the mean accuracy decreases when the amount of data is increased. This may be due to overfitting in the first use of the dataset. The low amount of data and the low diversification of stocks may have caused overfitting in the models, thus resulting in a better accuracy (65%) with the first dataset. As the diversification on the dataset (amount of different stocks) grew the mean accuracy decreased.

(32)

Figure 15: One configuration of each algorithm with the lowest time and best accuracy. The MLP and KNN values are difficult to see in the graph as they ar too similar to the ones obtained with Linear Regression

Another possible explanation to the evolution in the figure 15 is that the drop on the accuracy may be caused by keeping the algorithm’s parameters as constants. When the amount of data, and diversity in the data is increased the parameters for all the algorithms have remained the same, therefore the ability to predict for each algorithm decreases.

To try to determine the case that is happening in this project a small experiment is done. Some of the algorithm’s parameters will be modified to try to better fit the last dataset.

Figure 16: Accuracy results for the KNN algorithm when modifying the number of neighbors

As we can see on the previous figure (16), by increasing the number of neighbors the accuracy does not increase.

(33)

amount of perceptrons in the layers, as we see in the table bellow:

Layers Number of layers Mean accuracy deviation of the ac-curacy 7,6,5,4,3,2,1 7 63% 0% 7,7,6,5,4,3,2,1 8 63% 2% 7,8,9,5,10,3,5,1 8 63% 3% 7,7,6,5,4,5,2,1 8 63% 0% 7,6,7,8,5,4,5,2,1 9 63% 0% 7,6,6,7,8,5,4,5,2,1 10 63% 0%

Table 6: MLP with CV results when increasing the number of layers

When increasing the number of trees in the Random Forest algorithm the maximum accuracy sitill remains the one predicted with the first parameters:

Figure 17: Accuracy results for the Random Forest algorithm when modifying the number of trees

6 Economic and Environmental Impact

6.1 Economical analysis

The economical analysis of this project isn’t complex, all of the code has been programmed during the thesis or is Open Source (sk-learn, pymongo, LaTex).

All of the thesis has been done on a computer, and the data has been stored in a Database using an AWS server with a 0, 65euros/h cost. The total cost of the server has been 468euros, to facilitate the calculations we have had the database server up und running during the whole six months.

(34)

Adding up the usual materials such as paper, pens, chair and desk, with the average salary of a junior engineer from UPC and the utitlities the result is a total cost of 9.936, 00euros.

Figure 18: Cost calculation

6.2 Environmental Impact

The ability to bring the stock market to everyone has an incredible potential, and may impact the environment in unexpected ways. All companies have a hard effect on the environment, some more than others, and increasing the amount of population investing in those companies might empower them to be even more ruthless in some cases.

Thankfully more and more pople are concerned with global warming and climate change, and even though a company might be more profitable while polluting, the number of people that are starting to invest only in environmentally friendly companies or funds has increased.

An additional aspect to mention in this section is the impact that having a computer/database server has on the environment specially if it is open all of the time, even when we are not using it. To facilitate the calculations on the Economical anlyisis (section 6.1) we said that the server was up and running during the six months, but this wasn’t really true, only when the server was being used it was opened.

7 Conclusions and next steps

The goal of this section is to make a review of the results and a comparison with similar projects to create some context for the thesis as well as defining the next steps that can be made to further develop a predictor that may help the average citizen.

7.1 Conclusions

After analizing the results in section 5 we can observe that Random Forest has outperformed the other algorithms, specially when increasing the diversity of data in the dataset. The more amount of data, and the more we diversified the data, the better the Random Forest algorithm perfomed12, while the remaining algorithms’ accuracy diminished when the dataset grew.

(35)

During the Analysis of the results (section 5.4) we observed that the testing accuracy did not improve when using more training data for most algorithms. Two explanations were discussed, either the decrease of the accuracy was caused by the fact that the algorithms’ parameters were not increased when more training data was used or it was caused by overfitting in the first dataset due to the poor diversity of stocks.

Regarding the first explanation, we have seen that the accuracy did not improve when increasing the number of parameters. This means that the models didn’t decrease their performances because of the choice of the number of parameters.

In the figure 15 we can see that the only algorithm that hasn’t decreased the accuracy is the Random Forest. This is due to the nature of the algorithm, as explained in the section 3.2.4 Random Forest, the algorithm is an ensemble, which means that is composed of many decision trees. In this project the number of trees in the algorithm is 300. An ensemble works under the assumption that many uncorrelated errors average out to zero. Since each tree learns from different subsets of our data, they are fairly uncorrelated from one another, thus making the Random Forest algorithm more robust to overfitting than the other algorithms. All of this would explain why all of the algorithms decreased all of their accuracy except for the Random Forest.

To create some context for the results, the article ”Predicting the daily return direction of the stock market using hybrid machine learning algorithms”[6] talks about the results of Machine Learning projects that aim to predict the movement of a stock for the next day. In the article the authors mention that, for direction forecast13_{, they have a lower accurcy (around 60%).}

Further-more, the aim of this goal was to help regular people enter into the world of stock markets, and the regular users of the trained algorithm will have a 50/50 chance of being right if their knowledge is null.

In conclusion, Random Forest would be the chosen algorithm as the most suited to predict whether to buy or sell a stock in a medium amount of time14_{, because even though the training}

time is far larger than the others, it increases the accuracy (71%) and will be more robust to overfitting.

7.2 Next Steps

To further develop this thesis there are mainly two good follow up projects that can be implemented:

Create a bot to test the Machine Learning algorithms in real time.

Using other algorithms to predict the stock’s future value with the same range of time.

13_{Direction forecast refers to the prediction of the trend, up or down} 14_{Medium means 3 to 5 months as explained in the Abstract}

(36)

References

[1] Josef Ajram. Ganar en la bolsa es posible. Plataforma Editorial, 2011. [2] Alpha Vantage API. https://www.alphavantage.co/.

[3] Code in GitHub. https://gitfront.io/r/user-6644703/ 1b98564a45c8f096a86892ec04283ce4ac2b0660/FinancialDataCollection/.

[4] DataCamp. https://www.datacamp.com/community/tutorials/ k-nearest-neighbor-classification-scikit-learn.

[5] DeepAI. https://deepai.org/machine-learning-glossary-and-terms/perceptron. [6] Xiao Zhong & David Enke. Predicting the daily return direction of the stock market using

hybrid machine learning algorithms. 2019.

[7] Kirill Eremenko. Machine Learning A-Z: Hands-On Python and R In Data Science. https: //www.udemy.com/course/machinelearning/learn/lecture/19678456#overview.

[8] Fundamental Analysis. https://corporatefinanceinstitute.com/resources/knowledge/ trading-investing/fundamental-analysis/.

[9] Jan Ivar Larsen. Predicting Stock Prices Using Technical Analysis and Machine Learning. https://core.ac.uk/download/pdf/52104888.pdf, 2010.

[10] Andrew NG. Deep Learning Specialization Course. https://www.coursera. org/specializations/deep-learning?utm_source=gg&utm_medium=sem&utm_ content=07-StanfordML-ROW&campaignid=2070742271&adgroupid=80109820241& device=c&keyword=machine%20learning%20mooc&matchtype=b&network=g& devicemodel=&adpostion=&creativeid=369041663186&hide_mobile_promo&gclid= Cj0KCQjwk8b7BRCaARIsAARRTL5I3M5ATdzhXM2-7o5zXJB2SMWK3RgRB7f1v9ulpKjh8k8kDUf6W_ QaAmYFEALw_wcB#courses. [11] Plus500. https://app.plus500.com/. [12] PyMongo. https://pymongo.readthedocs.io/en/stable/. [13] Python. https://www.python.org/. [14] Scikit-Learn. https://scikit-learn.org/stable/modules/svm.html.

[15] Shuffle Split in Scikit-Learn. https://scikit-learn.org/stable/modules/cross_ validation.html.

[16] Sklearn K-Nearest Neighbors module. https://scikit-learn.org/stable/modules/ generated/sklearn.neighbors.KNeighborsClassifier.html.

[17] Sklearn Logistic Regression module. https://scikit-learn.org/stable/modules/ generated/sklearn.linear_model.LogisticRegression.html.

(37)

[18] Sklearn Multi Layer Perceptron module. https://scikit-learn.org/stable/modules/ generated/sklearn.neural_network.MLPClassifier.html.

[19] Sklearn Random Forest module. https://scikit-learn.org/stable/modules/generated/ sklearn.ensemble.RandomForestClassifier.html.

[20] Sklearn Suport Vector Machine module. https://scikit-learn.org/stable/modules/ generated/sklearn.svm.SVC.html.

[21] Richard O. Duda & Peter E. Hart & David G. Stork. Pattern Classification. John Wiley & Sons Inc, 1973.

[22] Technical Analysis. https://corporatefinanceinstitute.com/resources/knowledge/ trading-investing/technical-analysis/.

(38)

Annex 1: Jupyter templates

[1]:

import

numpy

as

np

import

matplotlib.pyplot

as

plt

import

pandas

as

pd

from

datetime

import

datetime

from

dateutil.relativedelta

import

relativedelta

from

sklearn.preprocessing

import

StandardScaler

#Import classifier from sklearn

from

sklearn.model_selection

import

cross_val_score

[ ]:

#Import dataset

dataset

=

pd

.

read_csv(

'path/to/dataset'

)

X

=

dataset

.

iloc[:,

2 :

-9

]

.

values

y

=

dataset

.

iloc[:,

-1

]

.

values

[ ]:

#Scalind Data

sc

=

StandardScaler()

X

=

sc

.

fit_transform(X)

[4]:

#Create Model

classifier

=

[ ]:

#KFold split:

start

=

datetime

.

now()

scores

=

cross_val_score(classifier, X, y, cv

=10

)

finish

=

datetime

.

now()

t_diff

=

relativedelta(finish, start)

print

(

'{h}h

{m}m

{s}s'

.

format(h

=

t_diff

.

hours, m

=

t_diff

.

,→

minutes, s

=

t_diff

.

seconds))

[ ]:

#Print Accuracy with validation

print

(

"Accuracy:

%0.2f

(+/-

%0.2f)"

%

(scores

.

mean(), scores

.

(39)

[ ]:

import

numpy

as

np

import

matplotlib.pyplot

as

plt

import

pandas

as

pd

from

datetime

import

datetime

from

dateutil.relativedelta

import

relativedelta

from

sklearn.preprocessing

import

StandardScaler

#IMporting classifier from sklearn

from

sklearn.model_selection

import

cross_val_score

from

sklearn.model_selection

import

ShuffleSplit

[ ]:

#Import dataset

dataset

=

pd

.

read_csv(

'path/to/dataset'

)

X

=

dataset

.

iloc[:,

2 :

-9

]

.

values

y

=

dataset

.

iloc[:,

-1

]

.

values

[ ]:

#Scaling Data

sc

=

StandardScaler()

X

=

sc

.

fit_transform(X)

[ ]:

#Creating Model

[ ]:

#Calculating the cv parameter

cv

=

ShuffleSplit(n_splits

=10

, test_size

=0.2

)

[ ]:

#KFold split:

start

=

datetime

.

now()

scores

=

cross_val_score(classifier, X, y, cv

=

cv)

finish

=

datetime

.

now()

t_diff

=

relativedelta(finish, start)

print

(

'

{h}

h

{m}

m

{s}

s'

.

format(h

=

t_diff

.

hours, m

=

t_diff

.

minutes, s

=

t_diff

.

,→

seconds))

[ ]:

#Print Accuracy

(40)

[ ]:

import

numpy

as

np

import

matplotlib.pyplot

as

plt

import

pandas

as

pd

from

datetime

import

datetime

from

dateutil.relativedelta

import

relativedelta

from

sklearn.preprocessing

import

StandardScaler

#Import classifier model from sklearn

from

sklearn.model_selection

import

cross_val_score

from

sklearn.model_selection

import

ShuffleSplit

from

sklearn.decomposition

import

PCA

[ ]:

#Import dataset

dataset

=

pd

.

read_csv(

'path/to/dataset'

)

X

=

dataset

.

iloc[:,

2 :

-9

]

.

values

y

=

dataset

.

iloc[:,

-1

]

.

values

[ ]:

#Scaling data

sc

=

StandardScaler()

X

=

sc

.

fit_transform(X)

[ ]:

#Principle Component Analysis

n_samples

=

X[:,

0 ]

.

size

n_features

=

X[

0 ]

.

size

pca

=

PCA(n_components

=

min

(n_samples, n_features))

X

=

pca

.

fit_transform(X)

[ ]:

#Create Model

[ ]:

cv

=

ShuffleSplit(n_splits

=10

, test_size

=0.2

)

[ ]:

#Cross Validation:

start

=

datetime

.

now()

scores

=

cross_val_score(classifier, X, y, cv

=

cv)

finish

=

datetime

.

now()

t_diff

=

relativedelta(finish, start)

print

(

'

{h}

h

{m}

m

{s}

s'

.

format(h

=

t_diff

.

hours, m

=

t_diff

.

minutes, s

=

t_diff

.

,→

seconds))

[ ]:

#Print Accuracy

(41)

Annex 2: Gantt Diagram

To help with the planning of the whole project a Gantt diagram was done at the first two weeks to try to divide the thesis into smaller work packages that would cointain some tasks.

Figure 19: Gantt diagram

During the last work packages some difficulties arose as the last dataset was formed. A change in the configuration of the computer modified the way the decimal numbers were interpreted, from a dot to a coma. The database information remained the same, therefore some error would have been introduced into the models when the third training had began. After analyzing the dataset the error was spotted on some of the rows. The amount of rows affected by this problem were small compared to the size of the dataset and the decision to remove these rows was taken.

Stock prediction with machine learning