App reviews analysis with machine learning

Texto completo

(1)VRIJE UNIVERSITEIT AMSTERDAM COMPUTER SCIENCE. UNIVERSIDAD POLITÉCNICA DE MADRID FACULTAD DE INFORMÁTICA. APP REVIEWS ANALYSIS WITH MACHINE LEARNING. BACHELOR FINAL PROJECT. Author: Marı́a Sánchez Piñeiro Supervisor: Ivano Malavolta. Academic course 2018/2019.

(2) Contents 1 Resumen. 1. 2 Introduction. 2. 3 Background. 3. 4 Design 4.1 General operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3 4. 5 Classification Pipeline 5.1 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4 5 6. 6 Input Arguments 6.1 Self Contained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Restrictions on parameters . . . . . . . . . . . . . . . . . . . . . . . . .. 6 8 9. 7 Plugins 10 7.1 Plugin Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 7.2 Example Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 8 Output of Results 11 8.1 Terminal output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 8.2 Classifiers Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 9 Testing 13 9.1 Expected cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 9.2 Corner cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 10 Usage information. 14. 11 Related Work 15 11.1 A Study on Run-Time Permissions . . . . . . . . . . . . . . . . . . . . . 15 11.2 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 11.3 Findings and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 16 12 Further Research. 17.

(3) Abstract The importance of portable devices and apps is exponentially increasing. Users are a key part of this commerce expressing their opinions in app reviews, which are important for developers. We believe in the potential of this large amount data and its utility for further studies in any field. For this purpose, we created an automatic reviews classifier. This paper embraces the construction of the classification tool focusing on its flexibility and extensibility. The result is a Machine Learning algorithm comparison and labeling tool combining different Natural Language preprocessing techniques. It also allows to test your own parameters and add different algorithms.. 1. Resumen. Los dispositivos móviles y las opiniones de los usuarios cada vez juegan un papel más importante en el mundo de la tecnologı́a. Las plataformas de aplicaciones móviles recogen grandes cantidades de datos de opiniones de usuarios sin analizarlos automáticamente. Creemos en el gran potencial de esos datos y hemos creado una herramienta de clasificación automática de reseñas de usuarios en aplicaciones móviles con el fin de ofrecer a futuros investigadores una forma de extraer información de estos datos. Esta memoria recoge el proyecto de construcción del un clasificador de texto utilizando aprendizaje automático y técnicas de procesamiento de lenguaje natural para una clasificación supervisada. En ella se detalla el funcionamiento del clasificador y todas las opciones disponibles para el investigador. El clasificador realiza preprocesamiento del texto y extracción de caracterı́sticas y compara varios métodos de aprendizaje automático en la clasificación mostrando sus puntuaciones y resultados. Una parte de los datos introducidos a la herramienta deben estar clasificados para entrenar los algoritmos y ası́ poder analizar el resto de las reseñas. Después, el investigador elegirá entre las opciones de preprocesamiento de texto y extracción de caracterı́sticas que se ofrecen. Entre ellas están la tokenización, lematización y stemming del texto e inclusión de las puntuationes de las reseñas para la clasificación. Finalmente la herramienta clasificará automáticamente los datos sin etiquetar y mostrará una comparación de los algoritmos disponibles. También ofrece la posibilidad de añadir otros algoritmos de clasificación o ’plugins’ que se utilizarán igual que los disponibles por defecto. El proyecto global de la creación de la herramienta fue realizado por tres alumnos de Computer Science en la Vrije Universiteit de Ámsterdam. Esta parte en concreto se centra en el código de la herramienta (back-end). Las otras dos partes consistı́an en la creación de una interfaz gráfica para la misma (front-end) y un caso práctico utilizándola para una investigación real respectivamente. Otro de los objetivos de este subproyecto es hacer la herramienta lo más extensible y flexible posible para facilitar su futuro uso a personas que no son expertas en la materia y coordinar objetivos con el subproyecto de la interfaz gráfica.. 1.

(4) 2. Introduction. As mobile devices offer increasing capabilities, developers create more apps for different Operating Systems and the market of apps keeps expanding. In 2017 Google announces more than 2 billion monthly active users in Android1 and in 2019 the Google Play Store offers more than 2.6 million apps available to download in the biggest operating system in the world. These apps are generally delivered through app stores that let users post reviews. In these platforms they express their concerns about the Operating System or applications. Developers intend to be up to date with what users think or expect from their apps but the big quantities of data stored in the reviews system make it hard to analyze the content in them. With Machine Learning it is now possible to categorize large amounts of data with a small manually labeled input. In this study it will be exposed the construction of a tool for the automatic classification of app reviews. The classifier uses Natural Language Preprocessing techniques on the reviews text and compares the performance of different Machine Learning algorithms on the labeling, as well as automatically classifying the unlabeled reviews. The output is intended to help categorize and analyze the content in the reviews. This project also concerns the construction a solid architecture for the reviews analysis tool to make it extensible and self-contained. The main concern is enabling developers to add their reviews and classify them choosing labels, heuristics and algorithms. Another aim of this development is making it easy to test and compare different classification algorithms as well as obtaining their results in a suitable format. An argument parser was also built adding the possibility of inputting all the classification parameters. We included a plugins structure that allows researchers to test their plugins and output their labeling predictions. An additional goal of this project was making the classifier compatible with the interface. As well, making it self-contained and callable in isolation from the terminal. The project is divided in three sub-projects including the development of a web interface and a particular case of use for the tool related to React Native apps. This part of the project focuses in the construction of the classifier and the compatibility with the web interface. The project can be found in the replication package2 . The classifier can be run from the terminal and also using the web interface available. All the source code developed in the context of this project is in Python 3. It follows general Python code conventions3 . All the outputted files are in csv format. The target audience of this paper are apps developers and researchers. We provide the Software Engineering scientific community with a classifier that is easy to use and able to work in any topic highlighting the meaningful information hidden in the Users reviews. The paper is structured as follows: In the first place, some background will be described. Then we can find the general functioning of the tool. Mode details will be 1. Android reaches two billion users: https://www.theverge.com/2017/5/17/15654454/androidreaches-2-billion-monthly-active-users 2 App Reviews Analysis with Machine Learning (June 2019). Replication package: https://github.com/S2-group/appReviewsAnalysis 3 https://development. robinwinslow.uk/2014/01/05/summary-of-python-code-style-conventions/. 2.

(5) deeply explained in further sections with the ultimate goal of providing any user with enough information to be able to understand and use the tool after reading this paper. The related work will be explained in the end of the report.. 3. Background. In the past years more and more business join the trend of including review platforms to allow users to voice their opinions, complaints and recommendations. With the increasing amount of reviews systems in different products and services, the volume of data stored is getting larger. Multiple studies have been carried out seeking for the importance of reviews and their analysis in different fields. Most of these researches conclude that the confidence and consideration of users in a product or service is significantly related to the reviews of other costumers [1, 2, 3]. Analyzing the reviews is not only important for future costumers, but also to see what needs to be improved. For this analysis we are using Machine Learning [4]. This technique is used in computer systems to perform tasks without specific instructions. Numerous studies work with Machine Learning to extract conclusions from data [5, 6] . In our case we are carrying out a supervised classification to label the reviews inputted according to some specified labels that the researcher provides. Machine Learning algorithms are able to learn from this data with patterns and inference. An important part of text classification is preprocessing. Natural language is wide and can be hard to categorize. Preprocessing techniques transform the data before using Machine Learning algorithms [7].. 4. Design. The main guidelines and contribution of this project are: - A classification pipeline for the processing of the reviews data with seven default algorithms available, including preprocessing and features extraction on the data. - The possibility of modifying every parameter of the classification, exploring all combinations of preprocessing, features extraction and the classifiers available. - Terminal call with argument parser for the input and checking on all parameters. Also showing information in the terminal output while running, including the input parameters and classification results. The help mode of the terminal call explains its usage. - Development of a plugin architecture giving the researchers the possibility of using their own algorithms. - Automatic extraction of labels in the labeled data to avoid easy mistakes introducing them. - Output of results in files: every classifier outputs files for all labels predictions and one summary file with all of them. - Integration with the graphic interface including an information mode showing all the possibilities of classification and the possibility of reading data from stdin. - Measures for extensibility and ease of use in the code separating in different functions, including a main function and adding comments.. 3.

(6) 4.1. General operation. The analysis tool is meant to classify app reviews using machine learning algorithms with the aim of generalizing and investigating users concerns or points of improvement in apps. All parameters will be introduced through the terminal call or, if some are not required, will have set default values. The classification is supervised, which entails that some of the data inputted must be labeled. Both labeled and unlabeled data are inputted as two ’.tsv’ files containing the reviews and must have specific columns like the raw reviews text and score, including the labels of the training data. The labeled data will be splitted into train and test. The train partition will be used to train the different classifiers and the test data to measure their performance. Once the classifiers are trained, the unlabeled data will be classified and labeled. For the classification, a pipeline will be used and compute the preprocessing of the text and the features extraction from the data. The tool provides the option of applying stop words removal, stemming and lemmatization on the raw text of the reviews. In addition, bigrams and trigrams can be added as features for the classifiers. The labels of the training data can be given as an input or automatically calculated by the tool. The different methods that the classifier will run are also specified as inputs. In addition, other variable parameters in the tool are the percentage of train/test partition in the labeled data or the number of splits in the cross validation. The tool will output the results of the classification in two ways: on the one hand, while the methods are training, some performance measures will be shown as the precision, recall, F1-score and area under the curve calculated with the validation data. For this, cross validation will be used to make the results more reliable. Moreover, the classification predictions of the trained algorithms on the unlabeled data are saved as files. Researchers can decide which algorithms performed best using their own criteria and examining the measures to use their predictions. Another possibility that the tool offers is including the own heuristics and algorithms relaying on the plugin structure. An optional argument is the route to classifiers that can be added to the pipeline and perform like every other algorithm included. One specific example of an external classifier added was developed and can be found in the replication package.. 5. Classification Pipeline. Pipelines are structures that make it easier to repeat commonly used steps in the modeling process sequentially applying a list of transformations. For this reason, pipelines are commonly used in machine learning tasks. For the analysis of the reviews a classification pipeline was developed using the modules Scikit-learn4 and Nltk5 . The input of the pipeline is the text of the reviews and the reviews ratings are an optional input. For the training phase of the classifiers the pipeline also uses the labels of the training data. The pipeline consists of three phases: 4 5. Scikit-learn pipeline: scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html Nltk documentation: nltk.org. 4.

(7) • Preprocessing: The preprocessing of text is an important step before the classification and can significantly variate the performance of the algorithms. There are various different preprocessing techniques applicable to text, from which we chose to implement the most commonly used6 : tokenization, stop words removal, stemming and lemmatization. Tokenization separates text into tokens or words. Stop words are commonly used words in the English language like ”as” or ”so” that do not usually contain important information for the classification and can be removed. Stemming reduces words to their root form, which would reduce ’fishing’, ’fished’ and ’fisher’ to the root ’fish’. Lemmatization reduces the words to their canonical form or lemma like ’better’ to ’good’. Stemming and lemmatization are used to make similar words look the same for the classifier seeking for a better performance. • Feature extraction: The reviews are represented as bags of words, where every review text is equal to a list of words not taking in account their order and keeping their repetitiveness. This allows the algorithms to learn from the existence and frequency of words. Tf-idf normalization is used (term frequency-inverse document frequency), a measure of the relevance of a word in the document that gets lower taking into account how frequent it is in the whole collection. This index gives more importance to the words that appear few times in the whole collection and avoids giving excessive importance to words commonly repeated in natural language. Also n-grams were experimented (series of n words that appear together in the text) to take in account sequences. • Classification: The set of reviews was analyzed by multiple binary classifiers. It was taken into account that a review can belong to more than one category. The classification process is explained in more detail in the following section.. Figure 1: Pipeline Structure. 5.1. Machine Learning Algorithms. Different Machine Learning algorithms were included in the tool. The chosen algorithms are commonly used in natural language processing and, in addition, are demonstrated to perform well in binary classification. The algorithms7 available in the tool are: 6 Text preprocessing: https://towardsdatascience.com/text-preprocessing-steps-and-universalpipeline-94233cb6725a 7 Scikit-learn supervised learning scikit-learn.org/stable/supervised learning.html. 5.

(8) • Naive Bayes: Applies Bayes’ theorem with independence assumptions between features, taking into account that the presence of one feature does not affect the presence of others. Naive Bayes algorithm is easily scalable to large datasets. • Decision Tree: Uses a tree-like graph. Each node in the tree represents an attribute, the links between nodes represent decisions which will be made with the features and leaf nodes represent labels. The algorithm starts at the initial node where it will make decisions until getting to the leaf nodes. • Random Forest: Builds and trains multiple decision trees and uses the mode of their predictions. • Maximum Entropy: Also known as multi-nominal logistic regression, based on the principle of maximum entropy. Tries to find the best model fitting the relationship between outcome dependant variables and the independent ones. • Support Vector Machine: Plots data as points in high dimensional space and finds hyperplanes that maximize distances between different classes. It focuses on finding the optimal separating hyperplane between between the two classes. • K-Nearest Neighbours: The KNN classifier analyses the k nearest points from the training data and assigns the class most prevalent among those points. For this algorithms it is also possible to decide the number k. • Multilayer Perceptron: Multiple Neural network units that use linear classification in a neurons structure (units organized in layers that take inputs and convert them to outputs applying a function).. 5.2. Scores. The tool compares the performance of the classification methods with the measures8 precision, recall, F1-Score and area under the curve. Precision(C) is the portion of instances classified correctly to belong to class C. Recall(C) is the fraction of reviews belonging to class C that were classified correctly. Also the F1-Score was calculated, as a harmonic metric between the two. The area under the curve of the ROC plot indicates how well the classifier separates the group being tested. For this, we use the probabilities outputted by the classifier about how sure it is of a result. Due to the small amount of training data available, k-fold cross validation was used to test the models. The training data was split into k equal portions. One of the k-groups is the validation data for each repetition. All metrics were calculated for each validation group.. 6. Input Arguments. In this section all the inputs of the tool will be described. The flexibility of the global variables in the code is significantly time saving for the researchers using our tool since most of the parameters are not intuitive to change if not understanding its functioning deeply. Also, we are trying make the program effortless to understand and change in the future. The input arguments are: 8. Scikit-learn model evaluation: https://scikit-learn.org/stable/modules/model evaluation.html. 6.

(9) ∗ Two file paths to the labeled and unlabeled data containing all the information to train the algorithms and be classified (labeled file and unlabeled file). The labeled data is required, which means the tool will not run without it. The unlabeled data can be provided in case we want to label raw data. The researcher provides two routes and both archives are be checked to be possible to open and in format ’.tsv’. The tags of the labeled file must be (ID, Date, Review Score, Review text, Sentinent, Label) unlabeled file must be (ID, Date, Review Score, Review text). An example of both input files is available in the replication package. The columns ’ID’ must not contain the character ” since it carries problems in the csv format of the output. ∗ List of labels taken into consideration for the classification. The labels can be manually given as an input for the algorithms or automatically calculated by an auxiliary function. In case of manual input all the labels given must be at least represented in one example of the labeled data. The format is a list of strings. This format gives the researcher the opportunity of easily testing different labels, instead of computing all of them at the same time. Also the comfortable option of automatically finding the labels. ∗ Seven binary integers indicating whether to use or not the machine learning classifier algorithms. These include: NAIVE, MAXENT, FOREST, SVM, KNN, TREE and MLP. This gives the researcher the opportunity to experiment with the desired algorithms separately, also saving time. The number of neighbours in k can also be introduced. ∗ Three binary integers indicating whether the data will be preprocessed with stop words removal, stemming or lemmatization. ∗ One binary meaning whether the ratings of the reviews are used for the classification or not. In case they are not, the input of the classifier would be the preprocessed text of the reviews. ∗ One integer indicating the number of n-grams included in the features extraction. These can be none, bigrams or trigrams. When the given parameter is 0, no n-grams will be used. If it is 1 bi-grams will be used. If it is 2 tri-grams will be used. ∗ A double for the train size and two integers for the number of splits and repeats. The number of splits corresponds to the number of partitions in the cross validation. The number of repeats is the number of iterations of the cross validation method. The train size is the percentage that divides the labeled data into train and validation data. ∗ The path to the Plugins directory can also be introduced. In addition, in case we do not need all of them, a list of names of the specific plugins inside the directory that we want to use.. 7.

(10) Parameter. Format. Required. Default. Labeled data. File path. Yes. Unlabeled Data. File path. No. – –. Labels. List of Strings. No. Automatic. Classification Methods. Seven binaries. No. All 0. Preprocessing (Stop Words, Steam, Lemma). Three binaries. No. All 1. No ratings. One binary. No. 1 (Ratings used). N-grams. One integer. No. 1 (Bi-grams). Train size, N splits, N repeats. One double, two integers. No. 0.8, 5, 10. Plugins path. Plugins directory path. If plugins. Plugins. Plugins names. No. – –. Table 1: A summary of the input variables.. 6.1. Self Contained. The classification tool is self-contained and can be used in isolation from the command line. Another part of the project not included in this paper consisted in the creation of a web interface. This makes the tool much easier to use and accessible. All the parameters input was built with ArgParser9 : a module that supports command line options. Every input argument was defined and a help option was automatically created by the parser. The picture below shows part of the script output in case it is called with the parameter ’-h’ or ’- -help’. This is a description of the command line usage. As we can see in the help output, all the parameters previously explained can be introduced in the command line. Most of them are the classification settings. One possible call to the script would be: ”classification.py -labeled data ./Reviews manually classified -unlabeled data ./Reviews for classification –methods 1,1,0,0,0,0,0 –pre 1,1,1 –param 0.8,5,25 –extraction 1”. This outputs the results of the two first methods (Naive Bayes and Maximum Entropy) with the parameters selected using the data in the files. 9. docs.python.org/2/library/argparse.html. 8.

(11) Figure 2: Summary of the terminal call usage Note that it is also possible to call the script with only the argument ’-i’ for information, with ’-h’ or ’–help’ for instructions on how to use it, and with the argument -p for a search of plugins in a directory with no need to use the rest of the arguments. Also the argument ’-s’ indicates the script whether to use data from stdin instead of the provided files. The arguments ’-i’, ’-s’ and ’-p’ were built mainly for the interface adaptation.. 6.2. Restrictions on parameters. In software projects the management of corner cases and errors must be properly setup in order to always have full control about what is happening at each step. To enforce the coherence of the parameters and the correct operations of the classifiers some restrictions are checked on the input before the execution. The errors stop the execution of the script and have complete reports to provide the researcher enough information to understand why the particular exception is raising. In the first place, all the necessary parameters when required need to be provided to the classifier to avoid the raising of exceptions. Another important testing is the quantity and format of the parameters. All parameters should follow the established pattern. With this, the methods parameter should have 7 binary numbers (0s or 1s), and other combinations like 4 numbers or any other character not binary are not accepted. Other verification of input is the selection of at least one classification method in case of not using the information mode, including the possible plugin. It is also enforced that some parameters are within a range, like TRAIN SIZE checked to be between 0.01 and 0.99, N REPEATS between 0 and 50 and N SPLITS>0.. 9.

(12) 7. Plugins. 7.1. Plugin Structure. An additional part of the tool is the option of including other algorithms. The path of the plugins directory and the names of the plugins that we want to use as additional classifiers can be introduced as input arguments. For this purpose PluginBase10 was used. This is a library in Python for the construction of plugin-based architectures. The library makes it easier to build up the plugin-based structure without having problems with inputs. To use the plugin addition correctly it is important to understand how the information is transmitted to the classifiers in the pipeline. The pipeline output is a sparse matrix11 : a matrix in which most of the elements are zero. With this, the representation of the matrix only contains the places in it where there is a value.. Figure 3: Sparse Matrix structure - pipeline output In this image we can see an example of a sparse matrix. The first column in the structure is the position of the matrix and the second the value. The index [0,0] contains the review score (5 in this case). The rest of the indexes correspond to the words in the raw text after preprocessing and the values they contain are the tf-idf. In case we are not using the reviews ratings we would not have them in [0,0].. 7.2. Example Classifier. For the testing of a correct functioning of the plugin-based architecture and clarification of usage to future users, a concrete example of plugin was developed12 . The name of this file is my plugin.py and it can be found in the replication package. In addition, a similar plugin with name my plugin 2.py was added to test the functions of usage of multiple plugins. Plugins must have a specific structure to match the structure and the attributes used in the classification tool. The required functions in our plugins are fit and predict. The fit function should be used to train the classifier. The predict function forecasts the classification of concrete examples. Another function included is meaning, which 10. Pluginbase: http://pluginbase.pocoo.org Sparse Matrix: towardsdatascience.com/handling-sparse-matrix-concept-behind-compressedsparse-row-csr-matrix-4fe6abe58a7a 12 Creating your own Scikit-learn estimator: danielhnyk.cz/creating-your-own-estimator-scikit-learn 11. 10.

(13) assists the predict function labeling concrete values. It is also possible to override the score function calculating the accuracy of our classifier, which normally is automatically measured, and also the predict proba function to give probabilities to the predictions. All these attributes should be inside the class My Plugin like in the example ones. The simple my plugin developed was programmed to classify as ’Positive’ all reviews rated with three or more points (up to five). Likewise, the classifier will label as ’Negative’ all reviews with punctuation zero to two stars, as we can see in the meaning function. The alternative my plugin 2 classifies as ’Positive’ the reviews rated with two or less points. Note that this will not work correctly in case of not using the review ratings for the classification.. 8. Output of Results. There are two ways of outputting information: terminal output and output files. The terminal output shows the chosen parameters of the classification the performance of every method within the validation data. The output files contain the labeling of the unlabeled data computed by the methods.. 8.1. Terminal output. One of the main functions of the program is comparing the performance of the different algorithms in the classification of every label. In this section the information shown in the terminal output will be explained in detail. It is distributed in the sections ”preprocessing and features extraction”, ”classifiers”, ”plugins”, ”data” and ”scores”. - Value of parameters: The chosen values for every parameter in the classification are shown. Including preprocessing and features extraction: we can see if the tool is using the reviews ratings, bigrams or trigrams, stop words removal, stemming or lemmatization. - Classifiers and Plugins: The classifiers that the tool will compare and the path to the plugins that we chose to use will be outputted. - Data: The number of reviews found in the labeled and unlabeled file are shown. Also the labels and the extraction method (manual or automatic) are specified. - Performance measures: The output also contains a table comparing the classification results of every method. For every Label and Classifier, the computed measures are: Count of occurrences in the training data, Precision, Recall, F1Score, Test Accuracy and Area under the Curve. - Number of predicted reviews per label: The number of reviews in the unlabeled data that each classifier predicted as ”positive” or ”negative” respecting to the specific label. The next figure shows the output of the tool for two labels and two classification algorithms using the paramters shown.. 11.

(14) Figure 4: Example of output. 8.2. Classifiers Predictions. In case unlabeled reviews are provided, the program also outputs the predictions of every classifier in file format inside the directory ’./results’. The tool outputs a file with the result of the combination of every algorithm with every label. The name of the file for the label ’PP’ and the method ’naive bayes’ would be ”result PP naive bayes.txt” The files are in format ’csv’13 and have the headings ’ID’, ’Label’ and ’Probability’. The column ’ID’ displays all the identifiers of the unlabeled reviews. The column ’Label’ shows ’0’ and ’1’ referring to the classification of ”Negative” and ”Positive” that the algorithm gave to this particular review in the label. Some files may also include a column ’Probability’. This column assigns a number between 0 and 1 to every review meaning the probability of it corresponding to the Label (or class ’Positive’). It is a way to know how sure the classifier is about the predictions. Note that some algorithms like decision tree always output possibilities of 0 or 1, while others like knn outputs multiples of 0.2 (0.4,0.6,...) and maximum entropy outputs a long number. The output of ’Probability’ depends on the existence of the attribute ’predict proba’ in the classifier. For simplicity with the plugins, if the algorithm does not have this attribute, the probability predictions are not outputted. This gives more flexibility to the developer since the probability part is optional to compute. As additional information, the output includes one file per classifier merging the predictions for all the labels, also in ’csv’ format. These files have one label per column and show the predictions, where the lines correspond to the unlabeled reviews in the same order. The name of the global classification for naive bayes would be ”total naive bayes.txt” The predictions are the probability of the result or ’0’ or ’1’ in case of not having probability outputs. An important detail to take into account is that intermediate results of these files have wrong data if the execution is stopped while the classification is being computed. 13. https://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/. 12.

(15) 9. Testing. In software projects testing is crucial for the final result. After building the tool some tests were performed to ensure the correct functioning. On one hand, the correct use of the classifier was tested with all the possibilities that are not raising exceptions and should work as planned. On the other hand, the corner cases of the performance were tested, the cases outside the normal operating of parameters when the classifier would not work, and the correct exception errors raising was checked. When an exception raises a message is shown with the description of the error and enough information for the user to understand why it is happening. All the methods were tested using different combinations of inputs. First, the performance of the methods with the raw text of the reviews in bag-of-words representation without any preprocessing. Also bigrams and trigrams were tested. Then different preprocessing methods (stemming, lemmatization and stop words removal). In addition, the ratings of the users.. 9.1. Expected cases. The expected cases when the program should work as planned that were tested are: - With the labeled data, the classifier works correctly if the command-line call provides a labeled data file in ’.tsv’ format and the required columns. It is only not required in case of info mode (argument -i) or stdin input (-s). - With and without unlabeled data, the classifier works in both situations in a different way. In case of not providing unlabeled data it only trains and compares the different algorithms without predicting unlabeled values. The unlabeled data in ’.tsv’ format and with the required columns is predicted correctly. - The classifier can work with and without labels. In case the labels are not provided they are searched automatically in the data file correctly. The classifier works with any subset of the labels in the labeled data. - The classifier works properly when the values for preprocessing, features extraction and classification parameters are in domain and have the correct format. Also, the script uses the inputted values in the classification. - The program uses all the default algorithms correctly and changes the number of neighbours in KNN. - The handling of plugins is correct, in both cases that they are located in the default directory or another given directory. The script finds all the plugins located in the directory precisely and includes them to the classification, even with other default classification algorithms. The plugins containing the expected structure are included in the classification pipeline and work as programmed. - The handling of input arguments is correct. Each of them lead to the expected action or change planned. When the -i parameter is inputted, the script shows information about the different options the classifier provides and finishes the execution instantly. When the -s parameter is inputted, the classifier uses stdin data input for the labeled and unlabeled files. The data is correctly processed and stored. 13.

(16) - The program correctly creates a ./result directory and stores the results files in case unlabeled data is provided. The output files are in ’.csv’ format and contain the correct results of every classification. The general classifier files contain a merge of all the label predictions without errors.. 9.2. Corner cases. The corner cases tested and their corresponding solutions are: - The terminal call can be incorrect because has missing requested parameters or wrong parameters are given. In this case, a help message is automatically generated by the argument parser and displays the correct usage of the command. - When the parameters are not well inputted in quantity or format, the script shows a description of the error including the wrong parameter and the right format to input it: ”EXTRACTION parameter should have ONE numeric value (0 - none, 1 - bigrams, 2 - trigrams)”. - When any of the given paths does not exist and does not lead to the expected files, an exception is raised. If the labeled data, unlabeled data or plugins paths do not exist an error is shown printing the given path: ”The path to the labeled data does not exist {path}”. - If the data files (labeled and unlabeled) do not have the required columns or there is an error in the input of data from stdin an exception is raised: ”Error in the format of the data file LABELED”, ”Error in the input of data from stdin”. - When any of plugins do not have one of the required attributes (MyPlugin() class, fit and predict) an exception is raised. If the missing attribute is predict proba when trying to calculate the AuC the error is catched and the behaviour of the script is different ”The plugin {name} does not contain the required attributes: fit and predict”. - In case no classification algorithms are selected, considering the ones by default in the script and also the possible plugins, an error message is shown. If there is no plugins in the given directory, the rest of the classifiers will be used, if there is no more classifiers the error will be shown: ”No algorithms selected”. - The input parameters could be out of the desired domain. This concerns the parameters train size, number of repeats, number of neighbours in KNN and n-grams. An exception is raised and the error description provides the name of the wrong parameter and the correct domain: ”Train size should be between 0.01 and 0.99”. - In case the labels of the given data do not fit the model the exception: ”Error in fitting the models: The labels do not match the data” is raised. For any other errors, even if the script does not handle them in a specific way, the exception raised will be shown with the error description.. 10. Usage information. In this section it will be explained in detail how to use the classification tool. In the first place, we should download and install Anaconda Python: a free and opensource distribution of the Python and R programming languages. It is also advisable to 14.

(17) download an IDE (development environment) like PyCharm or Spyder. Both Anaconda and the IDEs can be easily downloaded from the official web pages. The text file ’requirements.txt’ was included in the replication package. If we write in the terminal ‘pip install -r requirements.txt‘ inside this folder the required packages will be automatically installed. In other case, when opening the classification.py script with the IDE we should be suggested to download the necessary modules. These include the two data and computational analysis libraries pandas and numpy, the pluginbase module, the machine learning library sklearn and nltk for natural language processing among others. The last step is running the classification.py script from the terminal in our operating system or the IDE. The calls should have the structure: ’python classification.py -arg1 value1 -arg2 value2’ with the desired arguments. The option -h shows all possible arguments with their corresponding descriptions of use. For more information consult previous sections of this paper or the README.md in the replication package. A graphic interface is also available for the tool and makes the usage easier for non-experts automating the terminal calls and making the arguments input intuitive. This has some additional requirements like node.js and can be found in the replication package. For more information consult the frontend folder README.md.. 11. Related Work. This contribution follows a previous study on the Android run-time permissions system: an app reviews classifier focused in a concrete case. All the related work belongs to the paper: ”An Investigation into Android Run-time Permissions from the End Users’ Perspective” [8]. The data initially used to test the classification tool this paper concerns were the labeled and unlabeled reviews extracted in this related study about the run-time permissions system. With this, the new tool was tested to output approximately the same results as the related one with the same configuration of preprocessing, features extraction and classifiers.. 11.1. A Study on Run-Time Permissions. For Android developers the privacy of users is crucial. With this, the Android operating system introduced run-time permissions to ensure the security of its users against malicious software. This model protects the sensitive and privacy relevant content in the device making the apps ask the user for permission before accessing this data. Permissions are classified as normal or dangerous depending on the grade of risk they involve for the user. As any other software variation, the opinions of users are very important for the developers. Some researchers decided to conduct a study evaluating different reviews in the Google Play Store. The goal was to study the way the Android run-time permissions system was perceived and its different benefits and issues. This objective was divided in the three research questions: • RQ1: How accurate is an automatic classifier of user reviews using different combinations of Machine Learning? 15.

(18) • RQ2: To what extent app reviews express concerns about the permissions system? • RQ3: What are the main concerns about the permissions system? The main contribution of this paper is a study on the concerns expressed by users related to the new permissions model. Another important contribution is a semiautomatic pipeline to classify the reviews and research about the efficiency of different machine learning techniques in the matter of classification of reviews. It was also identified and discussed the issues with the new security model and possible methods for solving them.. 11.2. Execution. The study was made with more than 4.3 million user reviews in the Google Play Store with the ultimate goal of finding possible points of improvement in the permission model. The database for the study was built with 5.572 apps: the top 500 most popular free apps in the Google Store in that moment, after elimination of duplicates and filtering by API version and date. The first step involved Key-word based filtering in order to identify potential permission related reviews with words like permission, privacy and consent. Following, using Natural Language Processing techniques and Machine Learning they classified a subset of 3,574 reviews in different permission-related categories. With this it was possible to determine the concrete concerns of the users about this specific change analysing their reviews in apps. A quantity of 1000 randomly extracted reviews were manually classified into 10 fixed labels related to the permission system. The manually classified reviews were used to train the Machine Learning algorithms for the labeling of the rest of the reviews. The classification pipeline has a similar structure as the one in out classifier and included the algorithms Naive Bayes, Maximum Entropy, Random Forest and Support Vector Machine. The preprocessing used was stemming, lemmatization and stop words removal and the feature extraction included bi-grams and the user ratings. It also uses cross validation. The terminal output compared the performance of the classifiers also in a similar way, showing the measures F1-Score, Precision, Recall and Test Accuracy for every algorithm and label. The classifier was also used to label the rest of the reviews automatically.. 11.3. Findings and Conclusions. After many executions of the classification methods comparing preprocessing, inputs and accuracy of all of them some conclusions were achieved: • The best combination of preprocessing was only lemmatization, which improves the accuracy of all the algorithms. • The best results were outputted when using bi-grams. • The algorithm with the highest accuracy through all the experiments was SVM.. 16.

(19) More conclusions of interest were extracted in this research about the portion of reviews that involve run time permissions and the categories found to a greater extent. Also, with the final classification of all the reviews, some other conclusions were made taking into account the number of reviews of every permission-related label. In addition, the correlation between the existence of dangerous permissions in apps and the abundance of some labels in their reviews was studied. In the replication package, in addition to the classification script, we can find an extractor of app reviews, a script that downloads the binary files of the reviews, and some replication packages of previous studies.. 12. Further Research. With this study we encourage researchers in any field related to apps to study the users reviews with the analysis tool in search for generalization in data. We also believe in the potential of Machine Learning for text processing and classification in other interests. As one of the most important characteristics of the tool is the flexibility for the researcher, it can be modified in some ways to improve this like adding more options in specific parameters of every classifier or accepting other input formats. The tool could also output more information about the classifiers performance in addition to the scores available now. Preprocessing of text is wide and can be executed in many ways. The tool can also be improved offering more ways of preprocessing the text before the classification. Seven classifiers are offered in the tool but there is always a possibility of adding more options for the researcher.. 17.

(20) References [1] Filieri, R., & McLeay, F. (2014). E-WOM and accommodation: An analysis of the factors that influence travel adoption of information from online reviews. Journal of Travel Research, 53(1), 44-57. [2] Vermeulen, I. E., & Seegers, D. (2009). Tried and tested: The impact of online hotel reviews on consumer consideration. Tourism management, 30(1), 123-127. ISO 690 [3] Hu, N., Liu, L., Zhang, J. J. (2008). Do online reviews affect product sales? The role of reviewer characteristics and temporal effects. Information Technology and management, 9(3), 201-214. ISO 690 [4] Shai Shalev-Shwartz, Shai Ben-David - Understanding Machine Learning. Cambridge University Press (2014) [5] Nguyen, T. T., Armitage, G. J. (2008).A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys and Tutorials, 10(1-4), 56-76. [6] Pang, B., Lee, L., Vaithyanathan, S. (2002, July). Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10 (pp. 79-86). Association for Computational Linguistics. [7] Vijayarani, S., Ilamathi, M. J., Nithya, M. (2015). Preprocessing techniques for text mining-an overview. International Journal of Computer Science Communication Networks, 5(1), 7-16. [8] Scoccia, G. L., Ruberto, S., Malavolta, I., Autili, M., Inverardi, P. (2018, May). An investigation into Android run-time permissions from the end users’ perspective. In Proceedings of the 5th International Conference on Mobile Software Engineering and Systems (pp. 45-55). ACM. Replication package: github.com/S2-group/appReviewsAnalysis.

(21)