• No se han encontrado resultados

Descripción de las actividades a realizar

Capítulo III. Unidad didáctica “Madeira, Turismo y Cultura”

3.2 Descripción de las actividades a realizar

Machine Learning is a field of research that focuses on extracting information from datasets. If the dataset is very large, it is also often referred to as Big Data or Data Mining. There are countless algorithms in Machine Learning with inputs ranging from numeric over cat- egorical to text-based. The applications today seem endless: We have the first self-driving cars, which have learned to do this via Neural Networks, we have smartphone keyboards that predict the next word based on your individual writing style, researchers are working on algorithms that can predict illness from a set of measured attributes or even a persons genome, and many more. However, many of these application scenarios involve sensitive data – people do not feel safe sending e.g. their medical data to a service provider, because they either do not trust the provider or are worried about a data breach even if they do trust the provider. This has lead to Machine Learning being a popular topic in the con- text of privacy-preserving computations in general, and Fully Homomorphic Encryption in particular.

Generally, Machine Learning can be divided into two categories: supervised and unsuper- vised learning.

5.1.1.1 Supervised Learning

In supervised learning, there is a dataset consisting of inputs and the correct outputs (which can be numerical values or classes into which the data is split) for these inputs. The goal is to build a model that correctly assigns these inputs to the outputs. This model can then be used to compute the outputs for new inputs, for which we do not know the correct answers.

Thus, algorithms from supervised learning consist of two parts: In the training phase, we build the model from the set of data with known outputs. The details of the method for deriving the model from the data are of course specific to the concrete Machine Learning algorithm we are using. We will implement this phase on encrypted data in Section 5.3 for the Perceptron, which is the earliest Neural Network.

Once we have a model, we move on to the deployment phase. Here, we feed new data points into the algorithm and obtain predictions for the outcome. We show how to perform this phase on encrypted data for the Linear Means Classifier, which classifies data via weighted sums, in Section 5.2.

Note that we have not mentioned the testing phase here: The concept is that a few data entries from the training set are put aside before training and are then used to measure the performance of the model once it has been built by comparing the predicted to the actual known outcomes. This phase is technically located between training and deployment phase, but for our high-level view we will simply consider it a part of the training phase.

5.1.1.2 Unsupervised Learning

In unsupervised learning, the situation is slightly different: There are no “correct” labels provided, so there is no training set – instead, the algorithm attempts to find some struc- ture in the data on its own. An example of this is the clustering problem, which we will solve on encrypted data in Section 5.4, where the algorithm assigns the data entries to different clusters. Of course, the algorithms are not entirely automatic: In the clustering example, for many algorithms we need to specify how many clusters we expect, and the

algorithm must also have some kind of cost function, i.e., a way to compare two solutions to determine which is better. For many applications, this cost function is distance based (e.g., the average distance between two points in a cluster), but there exist many different application-specific cost functions.

Note that when developing new unsupervised algorithms, there may very well be a labeled testing set – these are usually artificially generated data sets for benchmarking purposes that allow the comparison of different algorithms on the same dataset, e.g. by comparing the percentage of correctly labeled data points. However, this takes place in the devel- opment phase of the algorithm and is not part of the Machine Learning process once the algorithm has been established.

5.1.1.3 Related Work

While some of this related work has already been covered in Section 1.1, we still choose to give a comprehensive overview at this point, repeating the former as necessary.

Machine Learning as an application for Fully Homomorphic Encryption was first proposed in [GLN12], and since then it has been a popular area of research. There are many areas of Machine Learning that have been studied in the context of FHE, and we give a brief overview of the most popular ones.

The first of these areas that many works have focused on is (Deep) Neural Networks, where input nodes are connected to output nodes through (sometimes numerous) intermediate layers. Our publication [JA16] implements the Perceptron [Ros57], which is a Neural Network without any intermediate layers and is thus a building block for the more compli- cated versions. In [BMMP17], the bootstrapping procedure of the underlying encryption scheme is modified to accommodate a discretized Neural Network, whereas [GDL+16] and [CdWM+17] adapt the different layers of a Deep Neural Network through polynomial approximations of the functions in question. Works like [PAH+17] and [JVC18] rely on in-

teractive solutions from the realm of Multiparty Computation, often in combination with FHE building blocks.

First suggested in [BLN14], there has been a recent surge of papers dealing with the task of logistic regression on encrypted data. This is a widely used algorithm in Machine Learning, but it is non-trivial to implement on encrypted data because of the non-polynomial Sigmoid function s(z) := 1+e1−z involved in the computation. In [KSW+18], this problem is tackled

by using a least-squares approximation for the Sigmoid function, whereas [KSK+18] and [BV18] use a local polynomial approximation, and [BCG+17] uses multiparty computation

and a Fourier approximation of the Sigmoid Function. In [CGH+18], the problem is solved in a manner very specific to the underlying FHE library HElib [LIBd], and the user must solve a linear system of equations to obtain the result of the computation after decrypting. Another popular area of research is (Linear) Regression or Hyperplane Decision, where a hyperplane is fitted and data points are classified according to which side of the hyperplane they lie on. Publications concerned with this task are [GLN12], [BPTG15], [LKS16] and [EAH17].

Other algorithm classes that have been considered include decision trees and random forests in [WFNL16], [BPTG15] and [AEH15], Support Vector Machines in [BSS+17], and Naive Bayes Classification in [AEH15] and [BPTG15], though many of these solutions rely on Multiparty Computation and thus interaction between the data owner and the computing party during the computation.

For the area of unsupervised learning, our publication [JA18], which implements the K- Means-Algorithm [M+67], is to our knowledge the only work concerned with this re- search area of unsupervised Machine Learning on encrypted data via FHE. The K-Means- Algorithm has been a subject of interest in the context of privacy-preserving computations for some time, but to our knowledge all previous works like [BO07], [JW05], [JPWU10], [LJY+15] and [XHY+17] require interaction between several parties, e.g. via Multiparty Computation (MPC). For a more comprehensive overview of the K-Means-Algorithm in the context of MPC, we refer the reader to [MB12]. While this interactivity may certainly be a feasible requirement in many situations, and indeed MPC is likely to be faster than FHE in these cases, we feel that there is nonetheless a need for a non-interactive solution as we present it: FHE reduces the computational load of the user to zero, and it also allows the computing party to keep the function secret (if the FHE scheme has circuit privacy, see Definition 1.7, which all current schemes do). Also, many of these interactive solutions rely on a vertical (in [VC03]) or horizontal (in [JKM05]) partitioning of the data between several users for security. In contrast, FHE allows a non-interactive setting with a single database owner who wishes to outsource the computation.