TF-IDF - 4 IMPLEMENTATION 4.1. Research Procedure

4 IMPLEMENTATION 4.1. Research Procedure

2. TF-IDF

4.2.4. Feature Extraction

Several NLP techniques have been used to extract features:

1. BoW

o At first, the most elemental NLP technique was used to create a baseline and be able to later compare the results when adding more features.

Nearest Neighbors (kNN), and Random Forests (RF). Despite the unavailability of labeled data, Unsupervised or Weakly Supervised Models were seldom used.

In this present literature review, it was discovered that the best performing model was the Random Forest (Balakrishnan et al., 2020; Chatzakou et al., 2017b, 2019;

Tahmasbi & Rastegari, 2018).

Therefore, in this present study five ML models were used:

Naïve Bayes SVM

Decision Tree Adaboost

Random Forest

4.2.6. Performance Assessment

Concerning Performance Metrics, (Elsafoury et al., 2021) states that, the most widespread metrics are accuracy, F1-score, precision, recall, as well as, Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) scores.

Hence, these metrics will be used to measure the performance of the models.

4.3. Detecting Bystander Contagion

4.3.1. Bystanders reply feature This feature has two possible values,

be stored.

As explained in previous sections of this thesis, bystanders are characterized by their indirect interactions with the victim. They communicate with users other than them through reactions such as likes and retweets and replies.

To know whether there was a relatively representative number of bystanders in the

corpus ed. The result obtained was

29.7423887587822%. Therefore, this feature remained in the set of features used for classification as it is deemed representative.

4.3.2. Like and Retweet Count features

To measure the bystander contagion, likes and retweets were used. In studies like (Tahmasbi & Rastegari, 2018), these features count of likes and retweets were used to measure power of an individual in a network in terms of how often he/she can interrupt the flow of information or how often the person acts as a mediator of communication between any other two individuals in the network. , how viral a tweet can potentially be.

5 RESULTS & EVALUATION

Using the performance metrics mentioned in Section 4.2.5, the ML models were assessed and the results using different sets of features will be compared to determine whether the research question has been answered and decide what set of features produces the best outcome.

As there are several charts, they have been added in the appendices section.

5.1. Baseline results

As mentioned previously, the baseline consists of using exclusively either BoW or TF-IDF. This is the equivalent of using a 1-Gram combined with either BoW or TF- IDF.

As illustrated in the bar Figure 0-1, the best performing model is the Random Forest, . It is also worth mentioning that the Naïve Bayes and the SVM have an recall value, the other performance metrics are not as high, therefore the Random Forest is more optimal.

In Figure 0-2, it can be appreciated that most models remain in similar values, except the SVM which increases approximately 6% its accuracy when compared to the BoW baseline. Furthermore, the recall improves by around 12% in the case of the Random Forest and the rest of the metrics remain in a similar range. Therefore, the Random Forest is still the best-performing ML algorithm.

5.2. N-grams results

The following subsections are the results of combining BoW/TF-IDF and 2-Grams or 3-Grams.

5.2.1. 2-Gram results

As depicted in Figure 0-3, when adding to the BoW baseline 2-Grams, the best performing models are the Random Forest and the Adaboost, both with an accuracy of

. Furthermore, the SVM increases its recall value by 8%.

After applying the TF-IDF + 2-Grams as feature extraction methods (Figure 0-4), the SVM increases the value of all of its metrics, therefore having a better performance.

Hence, it becomes the best-performing model along with the Random Forest.

5.2.2. 3-Gram results

With 3-Grams (Figure 0-5), the BoW method has a worse outcome in all models, except for the Adaboost. This ML algorithm maintains the same values and remains the top-performing model.

In Figure 0-6, it can be observed that the best-performing model is the Random Forest. It surpasses in accuracy all the previous TF-IDF combinations with n-grams.

The rest of the models have a poor performance when compared to TF-IDF 2-Gram.

5.3. Bystander Features results

In this section, the results of combining NLP techniques, n-grams, and bystander features are depicted.

5.3.1.

In this case, the results depicted in Figure 0-7 show that the performance is superior to ormance, however, it is lower than using N-grams and BoW. The best performing models are the Random Forest and Adaboost.

In Figure 0-8, it can be appreciated that the best performing classifier is the Random

When adding -gram with BoW (Figure 0-9), the

best performing , surpassing

the Random Forest.

In Figure 0-10, the best performing model is the Random Forest with an accuracy of The second best ML algorithm is the SVM.

In Figure 0-11, it can be observed that the best performing model so far is the accuracy.

In this case (Figure 0-12), the best-performing model is the SVM. However, its performance is inferior to the one obtained by the Random Forest with TF-IDF + reply + 2-gram.

5.3.2. likes and retweets

The performance of most models is worse than the previous feature extraction methods that used BoW. In this case (Figure 0-13), the Random Forest is the best model.

In Figure 0-14 (TF-IDF + 2-Gram + reply + likes/retweets), the best performing

In Figure 0-15 (TF-IDF + 3-Gram + reply + likes/retweets), the best-performing

model is still the Random Forest .

In this case (TF-IDF + reply + likes/retweets, Figure 0-16), the best performing model

is the Random Forest .

In Figure 0-17 (TF-IDF + 2-Gram + reply + likes/retweets), the best performing model is the Random Forest, with the same accuracy as the previous combination.

In Figure 0-18 (TF-IDF + 2-Gram + reply + likes/retweets), the best performing model is the Random Forest; however, its accuracy is lower than with 2-gram.

5.4. Discussion of Results

After analyzing all the possible combinations of feature extraction methods:

BoW + N-Gram TF-IDF + N-Gram

BoW + N-Gram + reply TF-IDF + N-Gram + reply

BoW + N-Gram + reply + likes/retweets TF-IDF + N-Gram + reply + likes/retweets

Table 5.4-5.4-1: Possible combinations of feature extraction methods

It can be inferred that the best-performing models using most combinations of feature selection methods is the Random Forest. Furthermore, the best method for extracting features is the TF-IDF + 2-Gram + reply. With that combination, the best ML

algorithm is the . Therefore, the

bystander contagion- performance of the

model and therefore its detection ability.

The second-best method for extracting features is -Gram. In this

case, the best performing model is the .

And after that, the combination of BoW + 3-Gram + reply + likes/retweets had a fairly positive outcome, enhancing the behavior of the models, and making them more optimal. The Random Forest is the best performing model, with an accuracy of

Furthermore, using any combination of feature extraction methods where bystander enhanced the performance of the models.

In the next table, there is a comparison between the usage of language modeling and/or BoW/TF-IDF and the addition of bystander features.

Classifier Accuracy without bystander features

Accuracy with bystander features

Comparison

Naïve Bayes 62,79 % 66,28 % + 3,49 %

SVM 60,46 % 70,93 % + 10,47 %

Decision Tree 62,79 % 72,09 % + 9,3 %

Adaboost 58,14 % 69,77 % + 11,63 %

Random Forest 68,60 % 73,25 % + 4,65 %

Table 5.4-2: Comparison between accuracy with/without bystander features

As it can be appreciated above, in all models there has been an improvement in accuracy thanks to the addition of bystander features.

6 CONCLUSION & FUTURE WORK

In conclusion, using bystander contagion-related features like th

and retweets have proved to be a valuable method to enhance or increase the performance of ML models used to tackle the cyberbullying problem.

6.1. Summary of Results

The best-performing model is the Random Forest, using a combination of the TF-IDF + 2-Gram + reply feature extraction methods. It reached an

The second-best -Gram. In this

case, the best performing model is the Decision Tree, with an accuracy of

after that, the combination i) BoW + 3-Gram + reply + likes/retweets had a fairly positive outcome. The Random Forest is the best performing model, with an accuracy

Therefore, the bystander contagion- retweets have

enhanced the performance of the model and therefore its detection ability. With these results, the research question can be answered. Regarding bystander contagion in cyberbullying detection is a valuable idea. It helps in cyberbullying detection and therefore can be used to further refine ML algorithms so that bullies can be flagged or blocked, and cyberbullying statements can be identified promptly.

6.2. Research Contribution

The contribution is that bystander contagion has been investigated in the cyberbullying scope in this thesis. This investigation had the aim to use bystander contagion as a tool to further enhance cyberbullying detection. As stated in the Results section, this goal has been achieved and bystander contagion is a useful concept to use when trying to improve the performance of ML algorithms that are in charge of detecting cyberbullying.

6.3. Future Work

and tweets that have media (videos, pictures, and gifs) links are being removed from the corpus. However, further research should be done in that direction.

Furthermore, this project focused on tweets in the English language, however, the cyberbullying issue is a global problem so it can be expanded to other languages.

A more complex labeling strategy could be undertaken to have subclasses for each type of cyberbullying actor. However, the number of tweets involved in this research is not extremely elevated, so, having more than two classes might be a challenge since the number of samples in each class would be too little.

In document Trabajo Fin de Grado Final-Year Project - Archivo Digital UPM (página 35-44)