Pipeline Workflow - 4 IMPLEMENTATION 4.1. Research Procedure

4 IMPLEMENTATION 4.1. Research Procedure

4.2. Pipeline Workflow

4.2.1. Data Crawling

In this step, tweet data requests were sent to the Twitter API using a wrapper called tweepy in Python.

The tweet fields, illustrated in Table 4.2-1, were extracted:

Tweet field Description

id Unique id assigned to the Tweet

text Text of the Tweet

conversation_id Groups related Tweets together created_at Creation time of the Tweet

public_metrics Public engagement metrics for the Tweet at the time of the request (e.g., retweet_count, like_count)

Table 4.2-1: Tweet fields extracted using tweepy

One of the most important fields is the conversation_id, which is a parameter that groups together tweets that are involved in the same conversation. In this case, direct replies to the victim's tweets or indirect replies (a user replies to another user who has replied to the victim's tweet).

The public_metrics field is relevant since the retweets and likes count will be used as bystander contagion features.

Aside from these fields, several expansion fields were retrieved to obtain information related to users, to be used

Expansion field Description

author_id Author of the Tweet

in_reply_to_user_id If the represented Tweet is a reply, this field will contain the necessarily always be the user directly mentioned in the Tweet.

Table 4.2-4.2-2: Expansion fields extracted using tweepy

The tweets that were selected to be crawled are the replies to a selection of tweets posted by an online influencer in their Twitter feed. For every tweet he publishes, several aggressive comments are posted against his appearance, nationality, and career. His feed is suitable for this research because it is rich in bullying signals and the number of bystanders is considerable too. Furthermore, several bystanders actively participate in his threads, mostly, instigating the bullying.

Figure 4-1: Example of a Twitter Thread used in this project

As it can be observed in Figure 4- the thread a

Twitter thread is composed of a main tweet and all the replies to it and the replies are the samples to be used in later steps of the pipeline.

4.2.2. Labeling strategy

Based on the criterion used in (van Hee et al, 2018), the following labeling strategy will be undertaken:

Threat/blackmail: expressions containing physical or psychological threats or indications of blackmail.

Insult: expressions meant to hurt or offend the victim.

General insult: general expressions containing abusive, degrading, or offensive language that is meant to insult the addressee.

Attacking relatives: insulting expressions towards relatives or friends of the victim.

Discrimination: expressions of unjust or prejudicial treatment of the victim.

Two types of discrimination are distinguished (i.e., sexism and racism). Other forms of discrimination should be categorized as general insults.

Curse/exclusion: expressions of a wish that some form of adversity or misfortune will befall the victim and expressions that exclude the victim from a conversation or a social group.

Defamation: expressions that reveal confident or defamatory information about the victim to a large public.

Sexual Talk: expressions with a sexual meaning or connotation. A distinction is made between innocent sexual talk and sexual harassment.

Defense: expressions in support of the victim, expressed by the victim himself or by a bystander.

Bystander defense: expressions by which a bystander shows support for the victim or discourages the harasser from continuing his actions.

Victim defense: assertive or powerless reactions from the victim.

Encouragement to the harasser: expressions in support of the harasser.

Other: expressions that contain any other form of cyberbullying-related behavior than the ones described here.

In this case study, we will add some other relevant categories:

Body shaming: expressions

Ironic statements: expressions that may seem like they have a positive meaning but are, in fact, negative.

Table 4.2-3 illustrates the values for the two possible classes. Any sample that

Label Value

Cyberbullying-related 1 Non-cyberbullying-related 0

Table 4.2-4.2-3: Possible labels given to samples

The dataset built is well-balanced, since -related

and -cyberbullying-related.

4.2.3. Data Preprocessing

The data preprocessing steps used in the study are as follows:

1. Lowercasing the tweets

o This step is necessary so that words that have the same meaning but are written differently will not be regarded as distinct words. E.g., Small,

are all regarded as the same word.

2. Removing repeated chars

o On the internet, it is common to repeat characters to emphasize the meaning of a certain word. For instance, helloooo will be interpreted as friendlier than hello .

3. Expanding contractions and slang

o On social media platforms, users tend to use slang and contractions, so, these words are expanded to their original form so that the ML model can use them more easily.

4. Translating emojis into words

o The usage of emojis is important online since facial expressions are an important feature to understand the tone of a comment. In digital contexts, emojis are used to display emotion.

5. Removing links and mentions 6. Removing unnecessary spaces

4.2.4. Feature Extraction

Several NLP techniques have been used to extract features:

1. BoW

o At first, the most elemental NLP technique was used to create a baseline and be able to later compare the results when adding more features.

In document Trabajo Fin de Grado Final-Year Project - Archivo Digital UPM (página 31-35)