CAPITULO II: MARCO DE REFERENCIA
2.2 MARCO LEGAL
2.2.3 CÓDIGO DE COMERCIO
There were individual CSV files corresponding to a single disaster event which were joined together to perform the initial exploratory analyses. Each CSV file contained roughly around 2,000 tweets pertaining to a single disaster event. The initial data ex- ploration was done using Tableau software. The cumulative file generated after joining all the individual CSV files contained exactly 19,112 tweets in English language. The original dataset also consisted of Middle-East Respiratory Syndrome (MERS) and EBOLA diseases as disaster events. However, they were eliminated in the current ex- ercise as they were categorized into different humanitarian categories (not among the 9 categories defined above), this would have impacted the classifier model and its ac- curacy, if included. In this work, tweets corresponding to five different types of natural hazard events: earthquake, flood, hurricane, cyclone and typhoon are considered.
Figure 3.5 represents the distribution of tweets by different disaster events. It can be observed that a total of around 9,000 tweets were collected for earthquake events, this is the case because there were 4 earthquake events in the dataset - US Earthquake 2014, Pakistan Earthquake 2013, Chile Earthquake 2014 and Nepal Earthquake 2015 respectively. Again, a total of around 4,000 tweets is seen for flood events because it included India and Pakistan Floods for the year 2014 respectively. The remaining tweets were for Cyclone Pam in Vanuatu 2015, Typhoon Hagupit in Philippines 2014 and Hurricane Odile in Mexico 2014. It can be observed that there is an equitable number of tweets per disaster event in the dataset, this is helpful when training the classifier models for automatic categorization of tweets coming from different disaster events.
CHAPTER 3. EXPERIMENT DESIGN AND METHODOLOGY
Figure 3.5: Number of tweets in each disaster type.
CHAPTER 3. EXPERIMENT DESIGN AND METHODOLOGY
Figure 3.7: Geographic distribution of tweets by humanitarian categories.
ian categories. From the figure, it can be seen that a vast majority of tweets (nearly 35 percent) are classified as Other Useful Information. Approximately 2,500 tweets are classified as Donation Needs, Offers & Volunteering Services, Injured or Dead People and Not Related or Irrelevant respectively. It is also observed that roughly around 2,000 tweets are classified as Infrastructure & Utilities Damage and Sympathy & Emo- tional Support. 1,000 tweets are classified as Caution & Advice and around half this number of tweets (between 400-600) are classified as Displaced People & Evacuations and Missing, Trapped, or Found People.
Also, figure 3.7 is provided which depicts the geographic distribution of tweets by their information content. It is important to note that the size of the concentric circles correspond to the number of tweets pertaining to a specific humanitarian category for a
CHAPTER 3. EXPERIMENT DESIGN AND METHODOLOGY
Figure 3.8: Distribution of tweets in each humanitarian category across the five disaster events.
specific disaster type. It can be seen directly from the map that most of all the tweets in case of Pakistan and Nepal Earthquake events contain information on Missing, Trapped or Found People than any other disaster events. This consequently leads to a high number of tweets related to Donation Needs, Offers or Volunteering Services due to the presence of substantial information on people in need of help. One can also note that most of the tweets classified as Injured or Dead People are coming from India Floods and Pakistan Earthquake events. Also, it is observed that there is a lot of infrastructure and utilities damage in Mexico and US Earthquake based on the high number of such tweets for those disaster events. Lastly, a huge number of Other Useful tweets are observed for US Earthquake and Philippines Typhoon, while a major share of tweets on Vanuatu Cyclone Pam are classified as Not Related or Irrelevant.
From figure 3.8, it is observed that the four earthquake events have highest number of tweets containing Other Useful Information while the cyclone at Vanuatu has least helpful tweets as most of them are either off-top or irrelevant. The flood events in India and Pakistan have a majority of tweets classified as Injured or Dead People,
CHAPTER 3. EXPERIMENT DESIGN AND METHODOLOGY
Figure 3.9: Trend of tweet publication among different countries in the dataset.
and a high number of tweets requesting donation needs and volunteering services. This is an immediate actionable information with the potential of saving the lives of missing, trapped and found people if the humanitarian services are lent to them on time. Hurricane Odile has caused maximum infrastructure and utilities damage apart from earthquake events. Emotional support and sympathetic tweets are found for typhoon and earthquake events, for the most part.
Figure 3.9 illustrates the publication trend of tweets among different countries. The blue line represents the original tweets while the orange refers to re-tweets. It can be seen from the figure that India and Pakistan have the highest number of original tweets at the time of disaster while Philippines and United States have the lowest. Also, Nepal and Philippines have the highest number of re-tweets than any other country. And, United States, Vanuatu and Philippines seem to have near-similar number of tweets and re-tweets as against India, Nepal and Pakistan where there is a significant difference in the number of original tweets and re-tweets. This could indicate the tweet publication behavior of different countries at the time of a disaster event. However, the dataset is not all inclusive (does not contain the full list of tweets
CHAPTER 3. EXPERIMENT DESIGN AND METHODOLOGY
Figure 3.10: Textual content of tweets based on their tweet/re-tweet frequency.
from start to end date of a disaster) and such remarks may or may not hold true. Such an insight is worthwhile for differentiating the relevant and original tweets from irrelevant and duplicated tweets.
On the other hand, Figure 3.10 depicts the most frequent tweets and re-tweets with their textual content. The textual content of the tweet is displayed on the x-axis while the y-axis shows the number of times the tweet was published. It can be seen clearly that the number of times a tweet was re-tweeted ranges from two to nearly sixty times in the dataset while there is a much less duplication of original tweet texts (not occurring over 3 times) in the dataset. This figure was plotted using actual dataset without cleaning the tweet text, so it includes the URL’s, hash-tags, twitter handles and symbols etc. as evident from the tweet content in the figure.