Búsqueda inteligente de contraejemplos para la inferencia de lenguajes

(1)

Universidad ORT Uruguay

Facultad de Ingenier´ıa

B´

usqueda inteligente de

contraejemplos para la inferencia

de lenguajes

Entregado como requisito para la obtenci´

on del

t´ıtulo de Licenciatura en Ingenier´ıa de Software

Kevin Mathias Chac´

on Levin - 190421

Diego Ignacio Zuluaga Gonz´

alez - 173642

Tutor: Sergio Yovine

Co-Tutor: Franz Ma¨

yr

(2)

Declaraci´

on de Autor´ıa

Nosotros, Kevin Mathias Chacón Levin y Diego Ignacio Zuluaga González de-claramos que, según nuestro entendimiento, el trabajo que se presenta en esta obra es de nuestra propia mano. Podemos asegurar que:

- La obra fue producida en su totalidad mientras realiz´abamos el Proyecto Final de Licenciatura en Ingenier´ıa de Software;

- Cuando hemos consultado el trabajo publicado por otros, lo hemos atribuido con claridad;

- Cuando hemos citado obras de otros, hemos indicado las fuentes. Con excep-ci´on de estas citas, la obra es enteramente nuestra;

- En la obra, hemos acusado recibo de las ayudas recibidas;

- Cuando la obra se basa en trabajo realizado conjuntamente con otros, hemos explicado claramente qu´e fue contribuido por otros, y qu´e fue contribuido por nosotros;

- Ninguna parte de este trabajo ha sido publicada previamente a su entrega, excepto donde se han realizado las aclaraciones correspondientes.

Kevin Mathias Chac´on Levin Diego Ignacio Zuluaga Gonz´alez

(3)

Dedicatoria

(4)

Abstract Espa˜

nol

(5)

Abstract

(6)

Keywords

(7)

1 Introduction

Thanks to the advances in computing power, artificial intelligence, in particular machine learning, has gotten a lot of traction in the last 10 years becoming a driving force in our day to day life. There has been a lot of discoveries and there is still a lot of active research in this area, making this field intertwine more and more with our lives, making significant impacts in even life-supporting fields such as medicine. One particular field that has made all of this possible within machine learning is that of deep learning.

Despite all the amazing benefits that this field brings into our lives, it has added a layer of obscurity to industry experts in understanding what is going on behind the scenes when Artificial Neural Networks are involved. Being able to shed some light on some of the deeper layers of these models could prove useful in understan-ding the rational behind the network which can lead to significant improvements and well-informed decisions on real life case scenarios. That’s why people like Ma¨yr and Yovine started working on the field of Explainable Artificial Intelligence with their work titled Regular Inference on Artificial Neural Networks”.

Both of us are interested on contributing to said field, so we are stepping on the foundation set by their paper by expanding onto some of the questions they had left.

(10)

1.1 Motivation

We are both students of Universidad ORT Uruguay where we have, for about two years each, been working as undergraduate teachers in different subjects.

Thanks to this opportunity we have been constantly in touch with a lot of researchers, which spiked an interest to dwell deeper into what researching implies. Furthermore, we were already very interested in Artificial Intelligence and Machine Learning, in particular Deep Learning. Because of this, we approached Yovine and asked him if he would be interested in tutoring a thesis in the subject. Not only was he willing to take us on our offer, but he was already working in something related to the subject in question with Ma¨yr, so it made complete sense for us to focus on trying to improve one area of their work.

The goal of Yovine’s and Ma¨yr’s work is to increase knowledge in the area of explainable artificial intelligence (explained later), a field which we find to be interesting and hence accepted their offer to work together. In particular, we focus our research in the area of language inference and wanted to see if a smart selection of counterexamples helped with the inferring of the given language.

Moreover, the reason for this document to be in English is due to the fact that, it is possible that we will continue our research until we are able to publish a paper about it, and having everything in English from the start simplifies the whole process.

1.2 Context

1.2.1 Artificial Intelligence (AI)

Artificial intelligence (AI) is a very prosperous and dynamic field with a huge amount of practical applications as well as active research topics. AI is being applied today in the automation of routine labors, speech and image recognition and even for making diagnostics in medicine and other fields.

(11)

1.2.2 Deep Learning

The solution to these more intuitive problems is twofold. First, it relies on ma-chine learning, that is, on allowing computers to learn from experience. By doing this, it allows human to avoid the need of formally specifying everything the com-puter in question needs to know in order to be able to solve the problem. Secondly, it models the world in a structured and compositional hierarchy of concepts. Be-cause of this compositional hierarchy, it is possible to stack up simpler concepts in order to build more complex ones. These concepts can be represented by a graph, and, if it has many layers, it can be seen as a deep graph. It is because of this that we call this approach to AI deep learning.[1]

1.2.3 Artificial Neural Networks

First of all, it is important to notice that Artificial Neural Networks (ANN) are the state of the art method for plenty of fields inside what we currently know by artificial intelligence.

Despite their popularity, ANNs have a major drawback which is that unders-tanding the underlying specifics taken into consideration by them in order to make a decision is a non-trivial task. This is a major concern since human understanding of the model is of utter importance in fields such as medicine, risk assessment or intrusion detection. [1]

Because of this limitation, there is a large amount of research that focuses on improving how explainable ANNs are. One approach to tackle this problem, that is of particular interest to our work, is that of black-box modeling. This approach consists simply of providing a human-understandable model that behaves as similarly as possible (ideally identically) as the ANN.

(12)

1.2.4 Recurrent Neural Networks (RNN)

Traditional ANNs however, have the problem that they “forget” information. All their decisions are made from an immutable model. Recurrent Neural Networks (RNN) are a way of addressing this issue. They are simply networks with loops inside them which allows them to remember and persist information.[1]

RNNs can just be thought of as several copies of a Neural Network connected to each other in order to feed the next network with the previous information. It is because of this, that it becomes clear why they are so useful to process sequences.

We can find a simple diagram of an RNN in the following figure 1.1:

Figure 1.1: Simple RNN Diagram

This special family of neural networks is really good when it comes to processing sequential data. RNNs have two main benefits. First, they can scale to much longer sequences than networks that do not specialize in sequences. Second, most RNNs can process sequences of variable length.

1.2.4.1 Long-Short Term Memory (LSTM)

(13)

A RNN composed of long-short term memory (LSTM) units is also known as an LSTM network. This is a very special kind of RNN that works much better than a traditional one in the majority of cases.

This so called LSTM units are composed of a cell, and three different gates, the input gate, the output gate and the forget gate. With regards to the “gates”, they are simply some artificial neurons in the sense that they just compute an activation function of a weighted sum. Because of this, they can be thought of as regulators of all the values that go through the LSTM. It is from this idea that the name gate is derived. In terms of architecture, this is translated into a total of four neural network layers per LSTM unit.

The cell can be thought of as the core of the LSTM unit. It has a state, and said gates are there in order to manage the addition and removal of information from said state. Usually the activation function used inside each of said gates is the Sigmoid activation function [3], and the way it interacts with the cell unit is through pointwise operations (vector operations). Basically, the output from the Sigmoid activation function, which is usually between 0 and 1, decides how much information goes through, 0 being nothing and 1 being everything. Almost all activation functions have an output between 0 and 1, except for the output which is usually between -1 and 1 and some other cases where needed.[2]

The first step of every LSTM is to decide what information the network is going to forget. This is done by the forget gate layer (which is simply a Sigmoid layer). For example, in a case where we are trying to predict the next word in an almost finished sentence, we want to forget the gender of the subject we had currently been using once a new subject is found.

The second step would be to decide what information we want to save. For doing this we need to take two sub-steps. The first substep is to use what is known as an input layer (again, simply a Sigmoid layer) which decides which values the network is going to be updating. The second substep is a new layer but this time, the activation function that it uses is the Hyperbolic Tangent activation function (or tanhfor short) [4]. We multiply pointwise what came out from the input layer with the result of the tanh activation function and output that as the result of the input gate. This is done in order to create new values that could potentially be added to the cell’s state. So basically, this whole step is where the networks receive some input and update the cell state. An example of information that would be fruitful for an LSTM to save would be that of gender of the last used subject in speech generation.

(14)

already been made, the only thing that is needed now to do a pointwise multi-plication of what the network should forget with what comes out from the forget gate and then we do a vector addition with what came out from the input gate. By doing this, we manage to forget no longer useful information, and to add the new relevant information into the cell state.

The last step is to decide what the network is going to be outputting through the output gate. It will simply be the cell state, but filtered. In order to do so, the first thing to do is to go through another layer which has a Sigmoid activation function at the end that decides which parts of the cell’s state is going to be passed on. Furthermore, the cell state is passed through a tanh activation function in order to get values between -1 and 1 and then it is multiplied by the output of the Sigmoid layer.

The whole architecture of an LSTM cell can be seen in the following figure 1.2:

Figure 1.2: LSTM cell architecture

(15)

LSTMs are especially good at classifying, processing and predicting the time series given time lags of unknown size as well as the duration between important events. They exist in order to deal with the gradient problem that appears when training traditional RNNs.

1.2.5 RNN Used

Even though we said that we do not really care about the architecture of the neural network used as long as it is a RNN, we are going to explain in more detail the one we are using for our research so that the reader has a clear understanding of all the aspects involved in it.

Since our work is supported by previous work done by Franz Ma¨yr and Dr. Sergio Yovine titled Regular Inference on Artificial Neural Networks, we thought that it was best to use the same RNN that they used in their own research. [5]

(16)

Figure 1.3: RNN architecture with Softmax and Argmax

1.2.5.1 Explainable Artificial Intelligence

The purpose of explainable artificial intelligence is to come up with artifacts capable of producing intelligent outcome together with appropriate rationalizations of them. It means that besides delivering the best possible model performance metrics (e.g. accuracy) and computational performance metrics (e.g. algorithmic complexity), they must also provide adequate and convincing reasons for effectively justifying reasoning besides any judgment in a human-understandable way.

1.2.6 Deterministic Finite Automata (DFA)

(17)

state, and that means that the state belongs to the language. Furthermore they are deterministic, that is, they produce a unique computation for every different input string.[9]

Graphically speaking, an accepting state is represented by a double circle. Non-accepting states are represented with a single circle and the start state is the one that has an arrow pointing to it. In the following figure 1.4, So is the starting

state, as well as an accepting state, hence the double circle. All other states are represented by single circles because they are non-accepting states.

The transitions are labeled with a symbol the alphabet on top of a one direc-tional arrow between two states.

Figure 1.4: Example Automaton

1.2.7 Probably Approximately Correct Learning

(PAC Learning)

One way to perform analysis of machine learning methods/techniques is mat-hematical analysis. One framework to this is the Probably Approximately Correct Learning Framework (orP AC for short). In this framework there is alearner, who receives samples and must find a generalization function called the hypothesis from them. The goal is to minimize the generalization error with high probability. To do this we need to be able to learn from any approximation ratio, probability of success or even distribution of the samples.[10][11]

(18)

Because we are working exclusively in the context of language learning, we will only describe briefly the PAC-learning setting for languages. More information can be found in [11]

The goal of the PAC learning algorithm is, if it terminates, to output a language

Lo that is ε-approximately correct with respect to the target language Lt with a

probability of at least 1−δ.ε-approximately correct means that the probability of a sequence of belonging to the symmetric difference of two languages is less than a given ε [5]. ε is an approximation parameter between 0 and 1 and δ being a confidence parameter between 0 and 1. Both these parameters are given as inputs to the PAC learning algorithm.

1.2.7.1 Oracle

EX

_D

We will make use of an oracle, called EXD, whose sole job is to draw some

example sequence following a given distribution D and to return this sequence tagged positive or negative depending on whether it belongs to the language or not respectively. Something to take into consideration is that these calls are inde-pendent of each other.

1.2.7.2 Oracle

EQ

The oracle (EQ) generates sufficiently large samples of tagged sequencesS that the PAC learning algorithm later uses to check the output languageLo against the

target languageLt. For this, the algorithm makes use of an approximate

equivalen-ce test that we will callEQ. If for everysthat belongs toSand the target language

Lt, it also belongs to the output language Lo, or vice versa, the algorithm stops

successfully and returns the output language foundLo. If not, it resorts to picking

any sequence that is a subset of S that intersects with the symmetric difference between Lo and Lt.

1.2.7.3 Oracle

M Q

(19)

1.2.7.4 Distribution Free (DF)

In the distribution free instance of the PAC Framework, no assumptions about the data distribution are made. We will then work with a uniform distribution.

1.2.8 L* Algorithm

There is a distinction that can be drawn between active and passive learning. Active learning consists in, given a set of positive and negative examples chosen by the teacher, learning the underlying language. This is, however, an NP-complete problem. On the other hand, passive learning, our area of focus, has a learner that has the ability to generate examples and ask membership queries to the teacher.

There is a well known algorithm for doing this, which was proposed by Angluin and it is known by L* [9] [12]. L* has the characteristic that the number of states of the minimal DFA, as well as the maximum length of any sequence exhibited by the teacher, is polynomial.

First we will introduce some terminology in order to better describe L*.

If we have a DFA A, we then use L(A) to denote the language recognized by

A (the set of sequences accepted by A).

If we call At and Ao the target and output automata respectively, we can say that Ao epsilon-approximates At if Lo ε-approximates Lt.

The goal of L* is to learn regular languages, or equivalently, deterministic finite automata (DFA) [12]. The way the algorithm does is pretty simple. It just builds a table with a set of rules in an iterative way. This table is used to keep track of which words are and are not accepted by the DFA. Then, it iteratively asks the teacher membership queries through the Membership Oracle (MQ) of the different words in order to fill the observation table. The table is built by asking the MQ, for every word in the table, if it is accepted by the hidden machine.

In L*, there is a need for a teacher that is able to answer membership queries with a boolean response as well as compare the DFA given by the learner to the original one and, if they differ, provide a counterexample that differentiates both.

(20)

sufficiently large so as to ensure that the algorithm’s total confidence is at least 1−δ. Whenever this statistical test is passed, we can conclude that the output is epsilon-approximately correct and has a confidence of at least 1−δ.

The information that is in the observation table has three characteristics. A nonempty finite prefix-closed set of strings (every prefix of every member is also a member of the set), a nonempty finite suffix-closed set of strings (every suffix of every member is also a member of the set), and a finite function that maps a string to either 1 or 0 if it is a member of our target language or not respectively. [12]

The observation table then has to parts: the ¨upperrows (or top part), that represent the elements from the prefix-cosed set of strings mentioned earlier, and the ”lowerrows (or bottom part), which represent the same elements from this set but united with the set of the letters that represent the language alphabet. Meanwhile, columns represent the suffix-closed set of strings, and each of the cells represent the mapping function, both also mentioned earlier. Here is an example in figure 1.5:

Figure 1.5: Observation table example extracted from [9]

(21)

with once it is ready. First of all, it needs to be closed. The table is closed if, for every row in the bottom part of the table, there is an equal row in the top part. It also needs to be consistent, that is, for every pair of elements in the top part of the table with the same row (same order of 0s and 1s), then all pairs of extensions with the alphabet in the bottom part that happened with the same letter also need to have the same row.

If the table is not closed, the algorithm moves the row in the bottom part that does not have an equal row in the top part and adds at the bottom all the appropriate rows.

To make it consistent, the algorithm expands the original set with suffixes in order to differentiate between the two rows that had the same row.

Once the table is closed and consistent, the algorithm proceeds to construct the conjectured DFA and then asks the oracle about its equivalence. If the answer is yes, it terminates and returns the new DFA. If the answer is no, then it also receives a counterexample that proves the DFA is wrong, and it proceeds to extend the observation table with this new counter example.

A pseudo code of Angluin’s algorithm can be found below so as to better understand it. The algorithm is as explained before, using membership queries. It also at a certain point asks the teacher whether the learned DFA is correct.

Algorithm 1 L* DFA Learning [5]

function _Learn

while currentDFA not equivalent to target do if observationTable is not closed then

Close(observationTable) end if

if observationTable is not consistent then MakeConsistent(observationTable) end if

end while

currentDFA ← GetDFA(observationTable)

equivalent, counterexample ← teacher.EquivalenceQuery(currentDFA) if not equivalent then

UpdateWithCounterexample(observationTable, counterexample)

end if

return currentDF A

(22)

1.2.9 Bounded L* Algorithm

As long as the language learnt is a regular language, it has been proven that L* will terminate. However, it cannot be guaranteed that L* will ever terminate when learning from a hidden machine that is strictly more expressive than DFAs. This is due to the fact that there may be languages that cannot be represented completely by one DFA. In simpler terms, there is not a finite amount of iterations in which we can ensure that L* will terminate.

(23)

2 Problem Statement

2.1 Introduction

Our work is supported by previous work done by Ma¨yr and Yovine titled Re-gular Inference on Artificial Neural Networks. [5]

The project we extended in order to carry on our research had a structure illustrated in the following figure 2.1:

Figure 2.1: Project Structure

In this thesis, we carry out two lines of work related to learning automata from neural networks.

(24)

However, we are not so much worried about the learning mechanisms as we are about choosing counterexamples. In simple terms, we compare the output of two different models to find out where they differ.

It is important to point out that PAC can learn from models that are not neural networks, we just decided to focus our work on this particular inference case, because we felt the biggest gains could be found when working with the confidence a neuronal network has in a given classification.

2.2 Counterexample Search

Our goal is to improve the generation of counterexamples in a context of model inference where the objective is to obtain similar models. For this we tried out different methods to generate counterexamples to see if we could do better than just random sampling.

We are motivated to investigate the following problem:

2.2.1 Research question.

What happens when we use a smart counterexample selection algorithm instead of a random one?

2.2.2 Distribution Learning (DL)

We started wondering if we could improve our PAC algorithm by inferring or learning distributions of the words that belonged to the target language we were trying to learn. We decided to modify our oracle EXD so that instead of just generating totally random examples, the oracle would take a given learned distribution and generate the examples based on that. A distribution could have been learned previously based on a set of examples of the target language, or could be inferred while PAC is running. Distributions could be based on accepted words length or accepted words composition.

(25)

In this case, we tackle the following problem:

Research question.

(26)

3 First Steps

Our first steps into explainable artificial intelligence in conjunction with Yovine and Mayr were researching whether or not techniques used for the crafting of adversarial examples could be used for improving the efficiency during the learning process of our inferring algorithm.

3.1 Adversarial Examples

Adversarial examples are simply inputs to machine learning models that were specifically crafted so as to trick the model into making a mistake.

(27)

Figure 3.1: From left to right, photo of a cat classified correctly, perturbed photo of a cat classified incorrectly (adversarial example), and perturbation applied to the second photo. Photos obtained from Foolbox’s Github repository [13]

3.2 Papernot

Papernot proposes an algorithm for mutating correctly classified examples by a RNN in order to turn them into adversarial examples.

In broad terms, Papernot’s algorithm attempts to decrease the gradient and therefore the confidence, of a RNN over an example sequence by choosing a position at random on the sequence to mutate so that the example variates as minimally as possible from the original one. [14].

3.3 Cleverhans

Cleverhans is a tool that provides the user with the ability to benchmark ma-chine learning system’s vulnerability to adversarial examples [15].

One of the steps the algorithm takes for doing this benchmarking is to generate counterexamples. However, it does so internally and there is no simple way of extracting those counterexamples. To do so, the sourcecode would need to be modified.

(28)

3.4 Foolbox

Foolbox is a tool that generates adversarial examples for artificial neural net-works in image classification [13].

Even though it should theoretically be possible to use other type of input apart from images, we opted out of this option as well because of not finding any simple way to do this and because of the fact that it is not clear whether it can be applied to RNN.

3.5 Why Did It Not Work?

We ended up discarding this approaches because using adversarial examples during the training of our RNN would negatively impact the accuracy and results, so that was actually of no use to us. Furthermore, the reason for not using them after training was that we ran into a lot of implementation problems due to the fact that we are working with RNN and the tools we researched are not trivially used on RNN.

(29)

4 Counter Example Search

Algorithm

4.1 Score Functions

Score functions are used by the algorithms to determine, between two different counterexample, which one is better. All results obtained by the score functions are between 0 and 1.

Ma¨yr advised us to keep counterexamples length as short as possible in order to save execution time. We wondered if it really made any difference, so we decided to try out multiple ways of assigning score to a particular example. When we faced the problem of how to run multiple tests at the same time in order to try out these score assignations methods, the idea of extracting them into score functions came about naturally.

The different score functions that we propose are:

Confidence Only

Length Only

Confidence - Length Percentage

4.1.1 Confidence Only

(30)

This score function then prioritizes examples for which the network had a higher confidence in the classification.

Algorithm 2 Confidence Only Score

function GetScore(confidence, length)

if confidence is less than CLASSIFICATION MARGIN then

conf idence ←1−conf idence

end if

percentage←(conf idence−M IN CON F IDEN CE)/(M AX CON F IDEN CE−

M IN CON F IDEN CE)

returnpercentage

end function

4.1.2 Length Only

Between two different counterexamples, the length score function prioritizes the shorter ones.

Algorithm 3 Length Only Score

if length is greater than maxLength then throw Exception

end if

percentage←(maxLength−length)/maxLength

returnpercentage

end function

maxLength is a constant that represents the maximum length possible of a

word in the given language.

4.1.3 Confidence - Length Percentage

(31)

Algorithm 4 Confidence and Length Score

if length is greater than maxLength then throw Exception

end if

if confidence is less than CLASSIFICATION MARGIN then

conf idence ←1−conf idence

end if

conf idenceP ercentage ← (conf idence −

M IN CON F IDEN CE)/(M AX CON F IDEN CE −

M IN CON F IDEN CE)

lengthP ercentage←(maxLength−length)/maxLength

return conf idenceP ercentage ∗ CON F IDEN CE M U LT IP LIER +

lengthP ercentage∗LEN GT H M U LT IP LIER

end function

4.2 Teachers

4.2.1 LSTM Teacher CZ - Baseline

This is our baseline. It is a copy from the teacher coded by Ma¨yr for use when learning from a LSTM model. It randomly generates a list of examples from which we obtain the ones which are counterexamples. We group those together in a counterexample pool, and then choose the one item with shortest length. If two or more counterexamples have the same length, it chooses the first one arbitrarily.

4.2.2 No Perturbations

Our first algorithm is one that from a counterexample pool (in our case a randomly generated one), chooses the counterexample with the biggest score.

(32)

Algorithm 5 Select Counterexample No Perturbations

function SelectCounterexample(words, dfa)

originalConf idence←Conf idenceQueryW ithCache(N ormalize(words))

returnChooseExampleBestScore(originalConf idence, scoreF unction) end function

For the following teachers, the method select counter examples stays mainly the same, and what changes is the part that applies the perturbation.

Algorithm 6 Select Counterexample

function SelectCounterexample(words, dfa)

normalizedW ords←N ormalize(words)

f oundConf idence←Conf idenceQueryW ithCache(normalizedW ords)

while patience not exhausted do

oldScore←T otalScore(f oundConf idence, scoreF unction)

normalizedW ords←ApplyP erturbations(f oundConf idence)

newConf idence ←Conf idenceQueryW ithCache(normalizedW ords)

for each newExample, index in newConfidence do

previousExample←f oundConf idence[index]

if ScoredImproved(newExample, previousExample, scoreFunction) and IsCounterExample(dfa, newExample) then

f oundConf idence[index]←newConf idence[index]

end if end for

newScore←T otalScore(f oundConf idence, scoreF unction)

if newScore is equals to oldScorethen

patience←patience−1

end if end while

returnChooseExampleBestScore(f oundConf idence, scoreF unction) end function

4.2.3 Random Perturbations

(33)

The decision to do random perturbations is drawn from Papernot’s paper[14]. Even though Papernot created his algorithm to craft adversarial examples, we believed it could also be applied in our case to mutate counterexamples into ones with higher score.

Algorithm 7 Random Perturbation

function ApplyPerturbations(examples)

perturbatedExamples←EmptyList()

for each example in examplesdo

pos←SelectP ositionT oP erturbateRandomly()

perturbatedExample←P erturbateRandomlyAtP osition(example, pos)

perturbatedExamples.Add(perturbatedExample)

end for

returnperturbatedExamples

end function

4.2.4 Best Perturbations

One of the approaches we tried in the search for good ways to get counter examples was done by perturbing each position and changing it to all valid letters of the alphabet of a randomly generated word, keeping track of every generated example, and then concatenating it at the end. If the concatenation of the per-turbations is still a counterexample and has a better score, then we replace the original counterexample with the newly crafted one.

For example, let’s say that the randomly generated word was at firstaaa, and the alphabet hada,bandccharacters. The algorithm tries all combinations for the first position; that is;aaa, baa,caa and then keeps track of the one with the best score. For the sake of the example, lets say the best one out of the three generated words is baa. Then it does the same for the second position; aaa, aba, aca, and assume we get as best combination aca. This is then followed by, say, choosing

aaa for the last position’s perturbation. Then we concatenate b, c and a to form the examplebca. After generating the new example, we compare its score with the score of the original. If the new score is better that the last one, and if this new example is still a counterexample, we overwrite the original counterexample with the new one.

(34)

(35)

Algorithm 8 Best Perturbation

perturbatedExamples←EmptyList()

perturbatedExample←ApplyP erturbation(example)

perturbatedExamples.Add(perturbatedExample)

end for

end function

function ApplyPerturbation(example)

perturbatedExample←EmptyString()

for each index in exampledo

perturbation←BestP erturbationOnIndex(example, index)

perturbatedExample.Append(perturbation)

end for

end function

function _{BestPerturbationOnIndex}(example, index)

possibleP erturbations←EmptyList()

for each character in alphabetdo

possibleP erturbations.Add(P erturbate(example, index, character))

end for

perturbationsAndConf idence←Conf idence(possibleP erturbations)

bestP erturbation←ChooseBestScore(perturbationsAndConf idence)

returnbestP erturbation[index] end function

4.2.5 All Best Perturbation

Another approach we tried was to choose a size for the counterexamples pool, and then at each step, apply perturbations at each position of each counter example and keep all those perturbations. After doing this, we remove all the necessary perturbations with the smallest score so that the pool always has the same size.

(36)

(37)

Algorithm 9 All Best Perturbations

perturbatedExamples←EmptySet()

allP ossibleP erturbations←ApplyP erturbations(example)

perturbationsConf idence←Conf idence(posssibleP erturbations)

removeN onCounterexamples(perturbationsConf idence)

perturbatedExamples.AddAll(perturbationConf idence)

end for

removeLowestScore(perturbatedExamples, examples.Size())

end function

function ApplyPerturbations(example)

perturbatedExample←EmptyString()

allP erturbations←EmptyList()

for each index in exampledo

perturbationsOnIndex←P erturbationsOnIndex(example, index)

allP erturbations.AddAll(perturbationsOnIndex)

perturbatedExample.Append(perturbation)

end for

allP erturbations.Add(perturbatedExample)

returnallP erturbations

end function

function _{PerturbationsOnIndex}(example, index)

perturbations←EmptyList()

for each character in alphabetdo

if character not equals example[index] then

perturbations.Add(P erturbate(example, index, character))

end if end for

returnpertubations

end function

4.3 Distributions

(38)

Algorithm 10 Get Word

function GetWord(seededRandom)

length←GetLength(maxLength, seededRandom)

word←EmptyString()

while word.length less than length do

word.Append(alphabet[random.N ext()])

end while returnword

end function

4.3.1 Distribution Free

The counterexamples pool was generated using a uniform distribution, without taking into consideration any of the information that could be learnt over time from the language. No assumptions are made about the distribution and no knowledge about it was generated.

This is the distribution used by our baseline algorithm.

Algorithm 11 Get Length Distribution Free

function GetLength(maxLength, randomGen)

´return randomGen.N ext()mod maxLength

end function

4.3.2 Length Distribution

(39)

Algorithm 12 Get Length LD

function GetLength(randomGen, acceptedWordsPerLengthPercentage)

sof tmax ← sof tmax(acceptedW ordsP erLengthP ercentage ∗ (1 −

generatedW ordsP erLength)/totalW ordsP erLength)

random←randomGen.N extDouble

cumulative←0

for each index in softmax do

cumulative ←cumulative+sof tmax[index]

if cumulative is greater than random then generatedWordsPer-Length[index]++

returnindex

end if end for

generatedW ordsP erLength[sof tmax.Length−1] + +

returnsof tmax.Length−1 end function

(40)

5 Experimental Evaluation

5.1 Evaluation

5.1.1 Experiment Methodology

Each of our tests consisted of three parts:

1. Generate a dataset of examples from a base automaton. This automaton is the one that represents the language we would like the RNN to learn from.

2. Train our RNN on the generated dataset.

3. Produce all possible combinations of teachers and score functions algorithms, and for each of these, run 20 learners.

Each learner will start generating an automaton which will approxi-mate to the trained RNN by finding counterexamples between the two models.

Due to how PAC learning framework works, statistical symmetry bet-ween the generated automaton and the RNN will be achieved when no more counterexamples are found within a randomly generated exam-ple pool. At this moment, the the learner results are saved and a new learner is run.

5.1.2 Case Studies

(41)

The ones we are working with were coded by Ma¨yr, and they can be seen in the next figures 5.1, 5.2, 5.3, 5.4, 5.5, and 5.6. For a more detailed explanation of the representation of each language base automaton, please visit the corresponding citation found next to the title of each one.

Figure 5.1: A Ending Automaton [5]

(42)

Figure 5.3: Alternating Bit Protocol (ABP) [17]

(43)

(44)

Figure 5.6: Reduced E Commerce Automaton [5]

5.1.3 Performed Experiments

With these automaton, we had eight different tests (re-purposed from Ma¨yr’s code) whose inputs can be seen in the annex 1 in section 8.1.

The information recorded for each of the tests was the following:

(45)

◦ All averages and variances were plotted into corresponding .pdf files. Moreover, total time for each algorithm, total hypothesis found count for each algorithm, and total automaton count found were also plotted. We used bars series for all our plots except for time per learner per algorithm (which we decided to plot as a boxplot series, to observe deviation of max and min values with respect to the median) and time per learner (which we used a line series to observe the effects of our cache on execution time)

All information that was generated or collected originally for each Teacher in Ma¨yr’s code was also kept under a folder named by the Teacher/Score function combination that generated it (i.e. NoPerturbations UniDistrGen

ConfidenceLength)

We also made two more .csv (ResultsSummary and SymResultsSummary), which sum up the information outputted by Ma¨yr (i.e. it outputs max, min, and avg for each of the columns outputted) in each of the respective files

5.2 Results

In order to quantify the impact of both a smart selection and a learning distri-bution approach, we explored three questions that provide insightful information that allow us to draw conclusions.

1. Does a smart selection of counterexamples reduce the number of generated hypothesis during training?

2. Does a smart selection of counterexamples reduce the variance of obtained automata?

3. Does a smart selection of counterexamples reduce the execution time of the automata inference algorithm?

We will now try to answer each of these questions. Note: all tables with >,

(46)

were almost == but varied minimally. These results were extracted by hand by analyzing the plotted data for each of our tests. Comparison between results were also extracted by hand. Our baseline implementation on the plotted data is the one called “LSTMTeacherCZ UniDistrGen”.

We would also like to point out that more plotted data results can be found on annex 2 in section 8.2.

5.2.1 Tests with Uniform Distribution

5.2.1.1 Number of Hypothesis

Score

\Teacher

All Best

Perturba-tions

Best Per-turbations

No Pertur-bations

Random

Perturba-tions Confidence

- Length Percentage

> ∼= ∼= ∼=

Confidence

Only N/A 6 6 N/A

Length

Only N/A == == >

Table 5.1: Resulting number of hypothesis from uniform distribution tests

Before we tested this, we believed that the counterexamples selected with the score function Confidence Only would better represent the language learned by the RNN, due to the high confidence that the RNN has in them. We thought adding it to an automaton would imply learning more about said language than with other kind of examples, and so, would not need as many examples to reach symmetry compared to an automaton that learned with a teacher with a different score function.

(47)

But why would All Best Perturbations behave differently than Best Perturbations, being that All Best Perturbations not only generates the same mutation as Best Perturbations, but also many more? We still do not really know, and we would need to continue testing to find our answer.

With all of that said, we feel that the only thing that we can safely conclude is that:

Only Best Perturbations and No Perturbations with Length Only managed to reduce the generated number of hypothesis.

5.2.1.2 Automata Variance

Score

\Teacher

All Best

Perturba-tions

Best Per-turbations

No Pertur-bations

Random

- Length Percentage

N/A _> _> _>

Confidence

Only N/A > > >

Length

Only 6 == == >

Table 5.2: Resulting automata variance from uniform distribution tests

(48)

Nevertheless, we do not have any evidence right now to confirm this hypothesis. If we were interested in confirming this, we could probably analyze the variance of the generated examples, and see how many of them are repeated between learners.

With all of that said, we feel that the only thing that we can safely conclude is that:

Only All Best Perturbations with Length Only managed to reduce the generated automaton variance.

5.2.1.3 Execution Time

Score

\Teacher

All Best

Perturba-tions

Best Per-turbations

No Pertur-bations

Random

- Length Percentage

> > > >

Confidence

Only > > > >

Length

Only > > > >

Table 5.3: Resulting execution time from uniform distribution tests

(49)

LSTMTeacherCZ_UniDistrGen AllBestPerturbations_UniDistrGen_ConfidenceLength BestPerturbation_UniDistrGen_ConfidenceLength NoPerturbations_UniDistrGen_ConfidenceLength RandomPerturbations_UniDistrGen_ConfidenceLength AllBestPerturbations_UniDistrGen_ConfidenceOnly BestPerturbation_UniDistrGen_ConfidenceOnly NoPerturbations_UniDistrGen_ConfidenceOnly RandomPerturbations_UniDistrGen_ConfidenceOnly AllBestPerturbations_UniDistrGen_LengthOnly BestPerturbation_UniDistrGen_LengthOnly NoPerturbations_UniDistrGen_LengthOnly RandomPerturbations_UniDistrGen_LengthOnly 120

140 160 180 200 220

Time per learner per Algorithm

Time in ms

(50)

LSTMTeacherCZ_UniDistrGen AllBestPerturbations_UniDistrGen_ConfidenceLength BestPerturbation_UniDistrGen_ConfidenceLength NoPerturbations_UniDistrGen_ConfidenceLength RandomPerturbations_UniDistrGen_ConfidenceLength AllBestPerturbations_UniDistrGen_ConfidenceOnly BestPerturbation_UniDistrGen_ConfidenceOnly NoPerturbations_UniDistrGen_ConfidenceOnly RandomPerturbations_UniDistrGen_ConfidenceOnly AllBestPerturbations_UniDistrGen_LengthOnly BestPerturbation_UniDistrGen_LengthOnly NoPerturbations_UniDistrGen_LengthOnly RandomPerturbations_UniDistrGen_LengthOnly 50000

100000 150000 200000

Time in ms

(51)

With this information, we can safely conclude that:

Our smart selection did not reduce any time from the automata inference algorithm

5.2.2 Tests with Learned Distribution

Running our tests with a learned distribution rather than a uniform one proved to be a difficult task. Even though calculating the time complexity of algorithm12 is out of the scope of this work, it is clear that it is greater than algorithm 11, and hence, what happened is not a surprise. Execution times grew dramatically, to the point that we could not run the more demanding ones because of the sheer amount of time they would take. Even when we did try running them, our computers would cancel all of the tasks after the 20 hour mark, and by that time some of the teacher/score functions combination still had not finished running. We discuss the reasons for this increase in execution time in section 5.2.2.3We were recommended to test for performance optimizations with the Visual Studio Performance Profiler [20], but even after running it a couple of times we did not have any great boost on performance.

Because of this, we sadly had to desist on running the more demanding tests, and focused on the faster and cheaply demanding ones. This is why we ran the ECommerce and Reduced ECommerce tests with only the baseline and NoPertur-bations Teachers.

(52)

5.2.2.1 Number of Hypothesis

Score

\Teacher

All Best

Perturba-tions

Best Per-turbations

No Pertur-bations

Random

- Length Percentage

> > > >

Confidence

Only > > > >

Length

Only > > > >

LSTM Teacher CZ with Length-DistrGen

>

Table 5.4: Resulting number of hypothesis from length distribution tests

It seems that with our length distribution, more hypotheses are usually needed to reach symmetry between the models. We discussed what could be causing this, and Yovine mentioned the possibility that this distribution was causing PAC to generate model that are overfitted to the RNN. That means that instead of being able to generalize the language from what the RNN learnt, the learners are adap-ting really well to the domain inferred by the RNN [21]. This brings up a new question: Could we modify the training process of a RNN such that the overfitting of the learners is not a problem but a virtue?

We would also need to analyze if our own Score Functions are causing the overfitting problem too. Confidence Only and Length - Score Percentage could be focusing too much on examples that (although the RNN has a high confidence on their classification) represent really well the language domain inferred by the RNN but not the original language.

Nevertheless, right now the only thing we can safely conclude is that:

(53)

5.2.2.2 Automata Variance

Score

\Teacher

All Best

Perturba-tions

Best Per-turbations

No Pertur-bations

Random

- Length Percentage

> N/A _> _>

Confidence

Only N/A > > >

Length

Only N/A N/A > >

>

Table 5.5: Resulting automata variance from length distribution tests

After seeing the number of hypothesis when using our length distribution, we thought that automata variance would also increase because of the different due to the increase in hypothesis count. And mostly it did, but there were some teachers that gave inconclusive results. It seems that the mutation algorithm of All Best Perturbations and Best Perturbations manages to find accurate counterexamples that represent roughly the same automatons.

We also observed that the automaton found when using this distribution tends to be bigger in the number of states they represent than the ones found with the uniform distribution. Again, this could be connected to the overfitting, causing it to try to represent more accurately the RNN model instead of the underlying language.

Nevertheless, the only thing we can safely conclude is that:

(54)

5.2.2.3 Execution time

Score

\Teacher

All Best

Perturba-tions

Best Per-turbations

No Pertur-bations

Random

- Length Percentage

> > > >

Confidence

Only > > > >

Length

Only > > > >

>

Table 5.6: Resulting execution time from length distribution tests

As we mentioned before, execution time with length distribution grew so much that the more demanding test (ECommerce and Reduced E Commerce) could not be completed due to the sheer amount of time they took. Nevertheless, we believe that this behaviour is understandable due to two things:

(55)

More generated hypothesis: As stated before, our length distribution also grew the amount of hypothesis found by each Teacher/Score function com-bination. This means that more calls to Equivalence Query had to be done before symmetry was found, which translates to more iterations of a teacher per learner. In other words, more iterations of: generating new examples with our length distribution (which was already slower than our uniform due to the amount of operations) and using counterexample mutation and selection.

(56)

LSTMTeacherCZ_UniDistrGen AllBestPerturbations_UniDistrGen_ConfidenceLength BestPerturbation_UniDistrGen_ConfidenceLength NoPerturbations_UniDistrGen_ConfidenceLength RandomPerturbations_UniDistrGen_ConfidenceLength AllBestPerturbations_UniDistrGen_ConfidenceOnly BestPerturbation_UniDistrGen_ConfidenceOnly NoPerturbations_UniDistrGen_ConfidenceOnly RandomPerturbations_UniDistrGen_ConfidenceOnly AllBestPerturbations_UniDistrGen_LengthOnly BestPerturbation_UniDistrGen_LengthOnly NoPerturbations_UniDistrGen_LengthOnly RandomPerturbations_UniDistrGen_LengthOnly LSTMTeacherCZ_LengthDistrGen AllBestPerturbations_LengthDistrGen_ConfidenceLength BestPerturbation_LengthDistrGen_ConfidenceLength NoPerturbations_LengthDistrGen_ConfidenceLength RandomPerturbations_LengthDistrGen_ConfidenceLength AllBestPerturbations_LengthDistrGen_ConfidenceOnly BestPerturbation_LengthDistrGen_ConfidenceOnly NoPerturbations_LengthDistrGen_ConfidenceOnly RandomPerturbations_LengthDistrGen_ConfidenceOnly AllBestPerturbations_LengthDistrGen_LengthOnly BestPerturbation_LengthDistrGen_LengthOnly NoPerturbations_LengthDistrGen_LengthOnly RandomPerturbations_LengthDistrGen_LengthOnly 200

250 300 350 400 450 500

Time in ms

(57)

LSTMTeacherCZ_UniDistrGen AllBestPerturbations_UniDistrGen_ConfidenceLength BestPerturbation_UniDistrGen_ConfidenceLength NoPerturbations_UniDistrGen_ConfidenceLength RandomPerturbations_UniDistrGen_ConfidenceLength AllBestPerturbations_UniDistrGen_ConfidenceOnly BestPerturbation_UniDistrGen_ConfidenceOnly NoPerturbations_UniDistrGen_ConfidenceOnly RandomPerturbations_UniDistrGen_ConfidenceOnly AllBestPerturbations_UniDistrGen_LengthOnly BestPerturbation_UniDistrGen_LengthOnly NoPerturbations_UniDistrGen_LengthOnly RandomPerturbations_UniDistrGen_LengthOnly LSTMTeacherCZ_LengthDistrGen AllBestPerturbations_LengthDistrGen_ConfidenceLength BestPerturbation_LengthDistrGen_ConfidenceLength NoPerturbations_LengthDistrGen_ConfidenceLength RandomPerturbations_LengthDistrGen_ConfidenceLength AllBestPerturbations_LengthDistrGen_ConfidenceOnly BestPerturbation_LengthDistrGen_ConfidenceOnly NoPerturbations_LengthDistrGen_ConfidenceOnly RandomPerturbations_LengthDistrGen_ConfidenceOnly AllBestPerturbations_LengthDistrGen_LengthOnly BestPerturbation_LengthDistrGen_LengthOnly NoPerturbations_LengthDistrGen_LengthOnly RandomPerturbations_LengthDistrGen_LengthOnly 2000

4000 6000 8000 10000

Time in ms

(58)

With this information, we safely concluded that:

Our smart selection with learned length distribution did not reduce any time from the automata inference algorithm.

5.2.2.4 Other results regarding our learned distribution:

Average automaton error found with respect to the RNN increased

When analyzing results, we came across the fact that the average error between the generated model and the RNN increased for every teacher / score function combination that used length distribution.

Figure 5.11: Comparing errors in Omlin & Giles A with 10 states max

(59)

high probabilistic examples, and even though our algorithm should always ask for more and more examples on each iteration, it is not capable of generating exam-ples that are different enough from each other to help the learner understanding completely the RNN.

Unanswered question: distribution of hypothesis generated by learner

One of the things we also expected to see when using the length distribution generator was that the amount of generated hypothesis per learner would start concentrating around a given number, as in a normal distribution. We thought that when comparing it to the same teacher / score function combination, the uniform one would yield much “flatter” results, having similar amounts of learners by hypothesis count. This idea came with the assumption that our length dis-tribution would manage on average to find approximately the same examples for each learner due to the knowledge it has on the underlying language distribution. On the other hand, uniform distribution would just probe randomly around, so we thought that findingxoryamount of hypothesis would have the same probability, and so it would yield much flatter results.

(60)

0

cnt

erE

xs

found

1

cnt

erE

xs

found

0 5 10 15 20

Counter ex count per learner in AllBestPerturbations_UniDistrGen_ConfidenceLength

Learner count

(61)

0

cnt

erE

xs

found

1

cnt

erE

xs

found

2

cnt

erE

xs

found

0 2 4 6 8 10 12

Counter ex count per learner in AllBestPerturbations_LengthDistrGen_ConfidenceLength

Learner count

Figure 5.13: Plot created with Length Distribution

(62)

(63)

6 Conclusions

6.1 What Teacher and score function

combination would we choose?

Sadly, after continuing analyzing our teachers and doing performance optimi-zations, we could never reach such a good balance of results as the baseline teacher (with uniform distribution) already reaches. It is the fastest one, and manages to beat even the beefier teachers (All Best Perturbations and Best Perturbations) on hypothesis count and automata variance on plenty of occasions. The question always stands though: what would happen in the case of a really complicated lan-guage? Or when learning from a RNN who that did not have a lot examples to train on?

Nevertheless, with our results up till now we cannot confidently assure that any of our teachers and score functions would work much better under those circums-tances. As the reader can notice, in our range of testing, not one of our algorithms proved to work better under the more complex ECommerce automaton. We even had to give up on running this test on some of them with the length distribution generator!

With that said, we concluded that a random selection of counterexamples is not a bad option to go with when doing this kind of work. However, it is worth exploring further the options that ended in slight improvements in order to optimize them.

6.2 Next steps

(64)

◦ This could let us generate more meaningful examples to mutate, or could have the same complications and limitations that the current length distribution generation has.

Completely change the example generator algorithm, make use of the fact that we know the model we are learning from is an RNN and develop a G.A.N. [22]

◦ Generative Adversarial Networks (G.A.N.) are a type of ANN that can learn to generate data from a given distribution by deforming another ANN.

◦ The process consist of training a Generative ANN (G) to receive random noise samples and modify same samples into data that can be feed into another ANN (C), which is in charge of learning whether a given data sample belongs to a given data distribution or if it was generated by

G itself. G will start getting better and better at “fooling” C, until C

can no longer distinguish between the two. At this point, G will have learned how to correctly mutate random noise data into a data sample that belongs to the given data distribution.

(65)

7 Bibliographical References

[1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. [Online]. Available: http://www.deeplearningbook.org

[2] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online]. Available: http://dx.doi.org/10.1162/neco.1997.9.8.1735

[3] M. Humphrys, “Continuous output - the sigmoid function.” [Online]. Available: http://www.computing.dcu.ie/∼_{humphrys/Notes/Neural/sigmoid.}

html

[4] “Hyperbolic functions.” [Online]. Available: http://www.encyclopediaofmath. org/index.php?title=Hyperbolic functions&oldid=29142

[5] F. Mayr and S. Yovine, “Regular inference on artificial neuralnetworks,” in Machine Learning and Knowledge Extraction, A. Holzingeret al., Eds. Cham: Springer International Publishing, 2018, pp. 350–369.

[6] W. S. Sarle, “Section - what is a softmax activation function?” [Online]. Avai-lable: http://www.faqs.org/faqs/ai-faq/neural-nets/part1/preamble.html

[7] I. Cooper, “The unnormalized sinc function.” [Online]. Available: http: //physics.usyd.edu.au/teach res/mp/doc/math sinc function.pdf

[8] J. E. Hopcroft, R. Motwani, and J. D. Ullman, Introduction to Automata Theory, Languages, and Computation (3rd Edition). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2006.

[9] J. De Boer and M. M. Bonsangue, “Extension and evaluations of learning algorithms.” [Online]. Available: http://liacs.leidenuniv.nl/∼_{kosterswa/bach/}

posters/boer.pdf

(66)

[11] L. G. Valiant, “A theory of the learnable,” Commun. ACM, vol. 27, no. 11, pp. 1134–1142, Nov. 1984. [Online]. Available: http://doi.acm.org/10.1145/ 1968.1972

[12] D. Angluin, “Learning regular sets from queries and counterexamples,” Inf. Comput., vol. 75, no. 2, pp. 87–106, Nov. 1987. [Online]. Available: http://dx.doi.org/10.1016/0890-5401(87)90052-6

[13] J. Rauber, W. Brendel, and M. Bethge, “Foolbox v0.8.0: A python toolbox to benchmark the robustness of machine learning models,” CoRR, vol. abs/1707.04131, 2017. [Online]. Available: http://arxiv.org/abs/1707.04131

[14] “Crafting adversarial input sequences for recurrent neural networks,” CoRR, vol. abs/1604.08275, 2016. [Online]. Available: http://arxiv.org/abs/1604. 08275

[15] N. Papernot et al., “cleverhans v1.0.0: an adversarial machine learning li-brary,” arXiv preprint arXiv:1610.00768, 2016.

[16] R. S. Sutton and A. G. Barto,Introduction to Reinforcement Learning, 1st ed. Cambridge, MA, USA: MIT Press, 1998.

[17] G. Tel,Introduction to Distributed Algorithms, 2nd ed. New York, NY, USA: Cambridge University Press, 2001.

[18] C. W. Omlin and C. L. Giles, “Constructing deterministic finite-state automata in recurrent neural networks,” J. ACM, vol. 43, no. 6, pp. 937–972, Nov. 1996. [Online]. Available: http://doi.acm.org/10.1145/235809.235811

[19] M. Merten, “Active automata learning for real-life applica-tions,” 2013. [Online]. Available: https://pdfs.semanticscholar.org/9cb9/ 74b6ece3e3fc2eab4f9cf0843bfc570df4a9.pdf

[20] Microsoft, “Beginners guide to performance profiling.” [Online]. Available: https://msdn.microsoft.com/en-us/library/ms182372.aspx

[21] I. V. Tetko, D. J. Livingstone, and A. I. Luik, “Neural network studies, 1. comparison of overfitting and overtraining,”Journal of Chemical Information and Computer Sciences, vol. 35, pp. 826–833, 1995.

(67)

8 Annexs

8.1 Annex 1: Tests Inputs

A Ending

◦ Language automaton = A Ending

◦ stepsCount = 12

◦ ε = 0,005

◦ δ = 0,005

◦ maxQueryLength = 12

◦ maxStates = Integer.M AX V ALU E

◦ languageDatasetTrainingSize = 30000

A or B

◦ Language automaton = A or B

◦ stepsCount = 12

◦ ε = 0,05

◦ δ = 0,05

◦ maxStates = Integer.M AX V ALU E ◦ languageDatasetTrainingSize = 30000

Omlin & Giles A with unlimited max states

◦ Language automaton = Omlin & Giles A

◦ stepsCount = 12

(68)

◦ δ = 0,05

Omlin & Giles A with 10 states max

◦ Language automaton = Omlin & Giles A

◦ stepsCount = 12

◦ ε = 0,05

◦ δ = 0,05

◦ maxStates = 10

Alternating Bit Protocol

◦ Language automaton = Alternating Bit Protocol

◦ stepsCount = 16

◦ ε = 0,05

◦ δ = 0,05

E Commerce Automaton

◦ Language automaton = E Commerce Automaton

◦ stepsCount = 16

◦ ε = 0,05

◦ δ = 0,05

◦ maxStates = 10

(69)

◦ Language automaton = Reduced E Commerce

◦ stepsCount = 16

◦ ε = 0,05

◦ δ = 0,05

◦ maxStates = 10

Reduced E Commerce Automaton 2

◦ Language automaton = Reduced E Commerce

◦ stepsCount = 16

◦ ε = 0,05

◦ δ = 0,05

◦ maxStates = 10

(70)

8.2 Annex 2: More plotted results

(71)

Alternating Bit Protocol

Number of Hypothesis

LS T MT eacherCZ _UniDist rG en A llB est P ert urbat ions_UniDist rG en_Conf idenceLengt h B est P ert urbat ion_UniDist rG en_Conf idenceLengt h NoP ert urbat ions_UniDist rG en_Conf idenceLengt h RandomP ert urbat ions_UniDist rG en_Conf idenceLengt h A llB est P ert urbat ions_UniDist rG en_Conf idenceO nly B est P ert urbat ion_UniDist rG en_Conf idenceO nly NoP ert urbat ions_UniDist rG en_Conf idenceO nly RandomP ert urbat ions_UniDist rG en_Conf idenceO nly A llB est P ert urbat ions_UniDist rG en_Lengt hO nly B est P ert urbat ion_UniDist rG en_Lengt hO nly NoP ert urbat ions_UniDist rG en_Lengt hO nly RandomP ert urbat ions_UniDist rG en_Lengt hO nly LS T MT eacherCZ _Lengt hDist rG en A llB est P ert urbat ions_Lengt hDist rG en_Conf idenceLengt h B est P ert urbat ion_Lengt hDist rG en_Conf idenceLengt h NoP ert urbat ions_Lengt hDist rG en_Conf idenceLengt h RandomP ert urbat ions_Lengt hDist rG en_Conf idenceLengt h A llB est P ert urbat ions_Lengt hDist rG en_Conf idenceO nly B est P ert urbat ion_Lengt hDist rG en_Conf idenceO nly NoP ert urbat ions_Lengt hDist rG en_Conf idenceO nly RandomP ert urbat ions_Lengt hDist rG en_Conf idenceO nly A llB est P ert urbat ions_Lengt hDist rG en_Lengt hO nly B est P ert urbat ion_Lengt hDist rG en_Lengt hO nly NoP ert urbat ions_Lengt hDist rG en_Lengt hO nly RandomP ert urbat ions_Lengt hDist rG en_Lengt hO nly 0 5 10

Count of Counter Examples Found Per Algorithm

Number of counterexamples found

(72)

Automata Variance

LS T MT eacherCZ _UniDist rG en A llB est P ert urbat ions_UniDist rG en_Conf idenceLengt h B est P ert urbat ion_UniDist rG en_Conf idenceLengt h NoP ert urbat ions_UniDist rG en_Conf idenceLengt h RandomP ert urbat ions_UniDist rG en_Conf idenceLengt h A llB est P ert urbat ions_UniDist rG en_Conf idenceO nly B est P ert urbat ion_UniDist rG en_Conf idenceO nly NoP ert urbat ions_UniDist rG en_Conf idenceO nly RandomP ert urbat ions_UniDist rG en_Conf idenceO nly A llB est P ert urbat ions_UniDist rG en_Lengt hO nly B est P ert urbat ion_UniDist rG en_Lengt hO nly NoP ert urbat ions_UniDist rG en_Lengt hO nly RandomP ert urbat ions_UniDist rG en_Lengt hO nly LS T MT eacherCZ _Lengt hDist rG en A llB est P ert urbat ions_Lengt hDist rG en_Conf idenceLengt h B est P ert urbat ion_Lengt hDist rG en_Conf idenceLengt h NoP ert urbat ions_Lengt hDist rG en_Conf idenceLengt h RandomP ert urbat ions_Lengt hDist rG en_Conf idenceLengt h A llB est P ert urbat ions_Lengt hDist rG en_Conf idenceO nly B est P ert urbat ion_Lengt hDist rG en_Conf idenceO nly NoP ert urbat ions_Lengt hDist rG en_Conf idenceO nly RandomP ert urbat ions_Lengt hDist rG en_Conf idenceO nly A llB est P ert urbat ions_Lengt hDist rG en_Lengt hO nly B est P ert urbat ion_Lengt hDist rG en_Lengt hO nly NoP ert urbat ions_Lengt hDist rG en_Lengt hO nly RandomP ert urbat ions_Lengt hDist rG en_Lengt hO nly 0 1 2 3 4 5

DFA Found Per Algorithm

Number of DFA

(73)

Execution Time

LS T MT eacherCZ _UniDist rG en A llB est P ert urbat ions_UniDist rG en_Conf idenceLengt h B est P ert urbat ion_UniDist rG en_Conf idenceLengt h NoP ert urbat ions_UniDist rG en_Conf idenceLengt h RandomP ert urbat ions_UniDist rG en_Conf idenceLengt h A llB est P ert urbat ions_UniDist rG en_Conf idenceO nly B est P ert urbat ion_UniDist rG en_Conf idenceO nly NoP ert urbat ions_UniDist rG en_Conf idenceO nly RandomP ert urbat ions_UniDist rG en_Conf idenceO nly A llB est P ert urbat ions_UniDist rG en_Lengt hO nly B est P ert urbat ion_UniDist rG en_Lengt hO nly NoP ert urbat ions_UniDist rG en_Lengt hO nly RandomP ert urbat ions_UniDist rG en_Lengt hO nly LS T MT eacherCZ _Lengt hDist rG en A llB est P ert urbat ions_Lengt hDist rG en_Conf idenceLengt h B est P ert urbat ion_Lengt hDist rG en_Conf idenceLengt h NoP ert urbat ions_Lengt hDist rG en_Conf idenceLengt h RandomP ert urbat ions_Lengt hDist rG en_Conf idenceLengt h A llB est P ert urbat ions_Lengt hDist rG en_Conf idenceO nly B est P ert urbat ion_Lengt hDist rG en_Conf idenceO nly NoP ert urbat ions_Lengt hDist rG en_Conf idenceO nly RandomP ert urbat ions_Lengt hDist rG en_Conf idenceO nly A llB est P ert urbat ions_Lengt hDist rG en_Lengt hO nly B est P ert urbat ion_Lengt hDist rG en_Lengt hO nly NoP ert urbat ions_Lengt hDist rG en_Lengt hO nly RandomP ert urbat ions_Lengt hDist rG en_Lengt hO nly 0 20000 40000 60000

Time in Miliseconds in Total Per Algorithm

Miliseconds

Búsqueda inteligente de contraejemplos para la inferencia de lenguajes

Universidad ORT Uruguay

Facultad de Ingenier´ıa

B´

usqueda inteligente de

contraejemplos para la inferencia

de lenguajes

Entregado como requisito para la obtenci´

on del

t´ıtulo de Licenciatura en Ingenier´ıa de Software

Kevin Mathias Chac´

on Levin - 190421

Diego Ignacio Zuluaga Gonz´

alez - 173642

Tutor: Sergio Yovine

Co-Tutor: Franz Ma¨

yr

Declaraci´

on de Autor´ıa

Dedicatoria

Abstract Espa˜

nol

Abstract

Keywords

Table Of Contents

1 Introduction

1.1

Motivation

1.2

Context

1.2.1

Artificial Intelligence (AI)

1.2.2

Deep Learning

1.2.3

Artificial Neural Networks

1.2.4

Recurrent Neural Networks (RNN)

1.2.4.1

Long-Short Term Memory (LSTM)

1.2.5

RNN Used

1.2.5.1

Explainable Artificial Intelligence

1.2.6

Deterministic Finite Automata (DFA)

1.2.7

Probably Approximately Correct Learning

(PAC Learning)

1.2.7.1

Oracle

EX

1.2.7.2

Oracle

EQ

1.2.7.3

Oracle

M Q

1.2.7.4

Distribution Free (DF)

1.2.8

L* Algorithm

1.2.9

Bounded L* Algorithm

2 Problem Statement

2.1

Introduction

2.2

Counterexample Search

2.2.1

Research question.

2.2.2

Distribution Learning (DL)

3 First Steps

3.1

Adversarial Examples

3.2

Papernot

3.3

Cleverhans