Methods and Techniques for Segmentation of Consumers in Social Media

Texto completo

(1)Methods and Techniques for Segmentation of Consumers in Social Media PhD Thesis. Óscar Muñoz Garcı́a (MSc Artificial Intelligence) Departamento de Inteligencia Artificial ETS de Ingenieros Informáticos. Supervisors Asunción Gómez Pérez (PhD Computer Science, MBA) Raúl Garcı́a Castro (PhD Computer Science and Artificial Intelligence). 2015.

(2)

(3) Tribunal nombrado por el Sr. Rector Magfco. de la Universidad Politécnica de Madrid, el dı́a de de .. Presidente: Vocal: Vocal: Vocal: Secretario: Suplente: Suplente:. Realizado el acto de defensa y lectura de la Tesis el dı́a de de en la Escuela Técnica Superior de Ingenieros Informáticos. Calificación:. EL PRESIDENTE. LOS VOCALES. EL SECRETARIO. i.

(4) ii.

(5) A Mari. Gracias por tu comprensión durante todo el tiempo que he dedicado a la tesis. A mis padres. Gracias por todo vuestro apoyo y motivación sin los cuales no habrı́a llegado hasta aquı́. A mi hija Lucı́a..

(6) iv.

(7) Acknowledgements. This thesis represents the final stage of a long period of my life I had never completed without the help of many people whom I thank for their inestimable support that worth its weight in gold. First of all, I want to acknowledge all the co-authors of the research works that have contributed to the contents included in these thesis: Silvia, Inés, Nuria, March, Beatriz, Gloria, Javier, Daniel, Jesús, David, Guadalupe, Auxi, Socorro, Elena, Vı́ctor, and Carlos. This thesis would not have been possible without their hard work. Havas Media Group deserves a special recognition. I want to acknowledge my colleagues there for all their lessons about marketing and advertising. I could not imagine my professional career from now on without their support and training. Specially, I have no words to express my gratitude to Gloria. I also want to acknowledge the Spanish Centre for the Development of Industrial Technology that has partially supported this research under the CENIT program in the context of the Social Media Project (CEN-20101037). Thanks a lot to all the partners in this project. Finally, I want to acknowledge my supervisors, Asun and Raúl, for their guidance, reviews and patience, during and before the writing of this thesis. I hope I have lived up to their expectations..

(8) vi.

(9) Abstract. Social media has revolutionised the way in which consumers relate to each other and with brands. The opinions published in social media have a power of influencing purchase decisions as important as advertising campaigns. Consequently, marketers are increasing efforts and investments for obtaining indicators to measure brand health from the digital content generated by consumers. Given the unstructured nature of social media contents, the technology used for processing such contents often implements Artificial Intelligence techniques, such as natural language processing, machine learning and semantic analysis algorithms. This thesis contributes to the State of the Art, with a model for structuring and integrating the information posted on social media, and a number of techniques whose objectives are the identification of consumers, as well as their socio-demographic and psychographic segmentation. The consumer identification technique is based on the fingerprint of the devices they use to surf the Web and is tolerant to the changes that occur frequently in such fingerprint. The psychographic profiling techniques described infer the position of consumer in the purchase funnel, and allow to classify the opinions based on a series of marketing attributes. Finally, the socio-demographic profiling techniques allow to obtain the residence and gender of consumers..

(10) viii.

(11) Resumen Los medios sociales han revolucionado la manera en la que los consumidores se relacionan entre sı́ y con las marcas. Las opiniones publicadas en dichos medios tienen un poder de influencia en las decisiones de compra tan importante como las campañas de publicidad. En consecuencia, los profesionales del marketing cada vez dedican mayores esfuerzos e inversión a la obtención de indicadores que permitan medir el estado de salud de las marcas a partir de los contenidos digitales generados por sus consumidores. Dada la naturaleza no estructurada de los contenidos publicados en los medios sociales, la tecnologı́a usada para procesar dichos contenidos ha menudo implementa técnicas de Inteligencia Artificial, tales como algoritmos de procesamiento de lenguaje natural, aprendizaje automático y análisis semántico. Esta tesis, contribuye al estado de la cuestión, con un modelo que permite estructurar e integrar la información publicada en medios sociales, y una serie de técnicas cuyos objetivos son la identificación de consumidores, ası́ como la segmentación psicográfica y sociodemográfica de los mismos. La técnica de identificación de consumidores se basa en la huella digital de los dispositivos que utilizan para navegar por la Web y es tolerante a los cambios que se producen con frecuencia en dicha huella digital. Las técnicas de segmentación psicográfica descritas obtienen la posición en el embudo de compra de los consumidores y permiten clasificar las opiniones en función de una serie de atributos de marketing. Finalmente, las técnicas de segmentación sociodemográfica permiten obtener el lugar de residencia y el género de los consumidores..

(12) x.

(13) Contents 1 INTRODUCTION. 1. 1.1. Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 1.2. Dissemination of Results . . . . . . . . . . . . . . . . . . . . . . .. 7. 2 STATE OF THE ART 2.1 2.2. Semantic Vocabularies for Representing Social Media Information. 10. 2.1.1. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .. 12. Techniques for Tracking Users in the Web . . . . . . . . . . . . .. 13. 2.2.1. Techniques for Capturing Web Activity . . . . . . . . . . .. 14. 2.2.1.1. Technique Based on Web Logs . . . . . . . . . .. 15. 2.2.1.2. Technique Based on Web Beacons . . . . . . . . .. 18. 2.2.1.3. Technique Based on JavaScript Tags . . . . . . .. 19. 2.2.1.4. Technique Based on Packet Sniffing . . . . . . . .. 21. Techniques for Identifying Unique Users . . . . . . . . . .. 22. 2.2.2.1. Technique Based on Cookies . . . . . . . . . . . .. 22. 2.2.2.2. Technique Based on Fingerprint . . . . . . . . . .. 23. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. Technique for Detecting the Evolution of Temporary Records . . .. 27. 2.3.1. Early Binding Algorithm [Li et al., 2011] . . . . . . . . . .. 28. 2.3.2. Late Binding Algorithm [Li et al., 2011] . . . . . . . . . .. 28. 2.3.3. Adjusted Binding Algorithm [Li et al., 2011] . . . . . . . .. 29. 2.3.4. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .. 30. Social Media Analysis Applied to Market Research . . . . . . . .. 30. 2.4.1. 32. 2.2.2. 2.2.3 2.3. 2.4. 9. KPIs Based on Social Media Analysis . . . . . . . . . . . .. xi.

(14) 2.4.2 2.5. 2.6. 2.7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .. 35. Marketing Background . . . . . . . . . . . . . . . . . . . . . . . .. 35. 2.5.1. The Consumer Decision Journey . . . . . . . . . . . . . . .. 37. 2.5.2. The Marketing Mix . . . . . . . . . . . . . . . . . . . . . .. 39. 2.5.3. Research on Human Emotions . . . . . . . . . . . . . . . .. 40. 2.5.4. Owned, Paid and Earned Media . . . . . . . . . . . . . . .. 43. 2.5.5. Marketing Technology . . . . . . . . . . . . . . . . . . . .. 44. 2.5.6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. Analysis of Social Media Content . . . . . . . . . . . . . . . . . .. 47. 2.6.1. Lemmatisation and Part-Of-Speech Tagging . . . . . . . .. 47. 2.6.2. Normalisation of Microposts . . . . . . . . . . . . . . . . .. 48. 2.6.3. Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . .. 49. 2.6.4. Identification of Wishes . . . . . . . . . . . . . . . . . . . .. 51. 2.6.5. Detection of Place of Residence . . . . . . . . . . . . . . .. 52. 2.6.6. Detection of Gender . . . . . . . . . . . . . . . . . . . . .. 53. 2.6.7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .. 53. Open Research Problems . . . . . . . . . . . . . . . . . . . . . . .. 55. 3 APPROACH. 57. 3.1. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 57. 3.2. Contributions to the State of the Art . . . . . . . . . . . . . . . .. 61. 3.3. Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 64. 3.4. Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 65. 3.5. Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 67. 4 RESEARCH METHODOLOGY. 71. 4.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 71. 4.2. Research Methodology . . . . . . . . . . . . . . . . . . . . . . . .. 73. 4.3. Method Followed for Obtaining the Artefacts Provided by this Thesis 74 4.3.1. Method Followed for Ontology Engineering . . . . . . . . .. 76. 4.3.2. Method Followed for the Data Mining Techniques . . . . .. 79. 4.3.2.1. Business Understanding . . . . . . . . . . . . . .. 80. 4.3.2.2. Data Understanding . . . . . . . . . . . . . . . .. 80. xii.

(15) 4.3.2.3. Data Preparation . . . . . . . . . . . . . . . . . .. 81. 4.3.2.4. Modelling . . . . . . . . . . . . . . . . . . . . . .. 81. 4.3.2.5. Evaluation . . . . . . . . . . . . . . . . . . . . .. 82. 4.3.2.6. Deployment . . . . . . . . . . . . . . . . . . . . .. 82. 5 SOCIAL MEDIA ONTOLOGY FOR CONSUMER ANALYTICS 83 5.1. Ontology Modules . . . . . . . . . . . . . . . . . . . . . . . . . . .. 85. 5.2. Notation Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 86. 5.3. Core Ontology Module . . . . . . . . . . . . . . . . . . . . . . . .. 90. 5.4. Publication Channels Module . . . . . . . . . . . . . . . . . . . .. 96. 5.5. Contents Module . . . . . . . . . . . . . . . . . . . . . . . . . . .. 97. 5.6. Users Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 98. 5.7. Opinions Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 103. 5.8. Topics and Keywords Module . . . . . . . . . . . . . . . . . . . . 105. 5.9. Geographical Locations Module . . . . . . . . . . . . . . . . . . . 106. 6 MORPHOSYNTACTIC CHARACTERISATION OF SOCIAL MEDIA CONTENTS 109 6.1. Types of Social Media Analysed . . . . . . . . . . . . . . . . . . . 110. 6.2. Distribution of Part-of-Speech Categories . . . . . . . . . . . . . . 111. 6.3. 6.2.1. Distribution of Nouns . . . . . . . . . . . . . . . . . . . . . 113. 6.2.2. Distribution of Adjectives . . . . . . . . . . . . . . . . . . 113. 6.2.3. Distribution of Adverbs . . . . . . . . . . . . . . . . . . . 114. 6.2.4. Distribution of Determiners . . . . . . . . . . . . . . . . . 114. 6.2.5. Distribution of Conjunctions . . . . . . . . . . . . . . . . . 114. 6.2.6. Distribution of Pronouns . . . . . . . . . . . . . . . . . . . 115. 6.2.7. Distribution of Prepositions . . . . . . . . . . . . . . . . . 115. 6.2.8. Distribution of Punctuation Marks . . . . . . . . . . . . . 115. 6.2.9. Distribution of Verbs . . . . . . . . . . . . . . . . . . . . . 116. Hypothesis Validation . . . . . . . . . . . . . . . . . . . . . . . . 116. xiii.

(16) 7 TECHNIQUE FOR UNIQUE USER IDENTIFICATION BASED ON EVOLVING DEVICE FINGERPRINT DETECTION 117 7.1. 7.2. 7.3. Data Understanding Activity . . . . . . . . . . . . . . . . . . . . 118 7.1.1. Collect Initial Data Task . . . . . . . . . . . . . . . . . . . 119. 7.1.2. Describe Data Task . . . . . . . . . . . . . . . . . . . . . . 123. 7.1.3. Explore Data Task . . . . . . . . . . . . . . . . . . . . . . 124. 7.1.4. Verify Data Quality Task . . . . . . . . . . . . . . . . . . . 130. Data Preparation Activity . . . . . . . . . . . . . . . . . . . . . . 131 7.2.1. Select Data Task . . . . . . . . . . . . . . . . . . . . . . . 131. 7.2.2. Clean Data Task . . . . . . . . . . . . . . . . . . . . . . . 131. 7.2.3. Construct Data Task . . . . . . . . . . . . . . . . . . . . . 132. Modelling Activity . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.3.1. Select Modelling Technique Task . . . . . . . . . . . . . . 133 7.3.1.1. Cluster Signature . . . . . . . . . . . . . . . . . . 135. 7.3.1.2. Similarity Computation . . . . . . . . . . . . . . 137. 7.3.1.3. Attribute Weight Computation . . . . . . . . . . 138. 7.3.2. Generate Test Design Task . . . . . . . . . . . . . . . . . . 140. 7.3.3. Build Model Task . . . . . . . . . . . . . . . . . . . . . . . 140 7.3.3.1. X-Real-IP Header . . . . . . . . . . . . . . . . . 143. 7.3.3.2. X-Forwarded-For Header . . . . . . . . . . . . . . 145. 7.3.3.3. User-Agent Header . . . . . . . . . . . . . . . . . 146. 7.3.3.4. Accept Header . . . . . . . . . . . . . . . . . . . 147. 7.3.3.5. Accept-Language Header . . . . . . . . . . . . . . 148. 7.3.3.6. Accept-Charset Header . . . . . . . . . . . . . . . 149. 7.3.3.7. Accept-Encoding Header . . . . . . . . . . . . . . 151. 7.3.3.8. Cache-Control Header . . . . . . . . . . . . . . . 152. 7.3.3.9. Plugins . . . . . . . . . . . . . . . . . . . . . . . 153. 7.3.3.10 Fonts . . . . . . . . . . . . . . . . . . . . . . . . 154 7.3.3.11 Video . . . . . . . . . . . . . . . . . . . . . . . . 156 7.3.3.12 Time zone . . . . . . . . . . . . . . . . . . . . . . 157 7.3.3.13 Session Storage . . . . . . . . . . . . . . . . . . . 158 7.3.3.14 Local Storage . . . . . . . . . . . . . . . . . . . . 160 7.3.3.15 Internet Explorer Persistence . . . . . . . . . . . 161. xiv.

(17) 7.4. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.4.1. 7.4.2. 7.5. Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 163 7.4.1.1. Rand Index . . . . . . . . . . . . . . . . . . . . . 164. 7.4.1.2. Error Rate . . . . . . . . . . . . . . . . . . . . . 164. 7.4.1.3. Recall . . . . . . . . . . . . . . . . . . . . . . . . 164. 7.4.1.4. Specificity . . . . . . . . . . . . . . . . . . . . . . 164. 7.4.1.5. False Positive Rate . . . . . . . . . . . . . . . . . 165. 7.4.1.6. False Negative Rate . . . . . . . . . . . . . . . . 165. 7.4.1.7. Precision . . . . . . . . . . . . . . . . . . . . . . 165. 7.4.1.8. F-measure . . . . . . . . . . . . . . . . . . . . . . 165. 7.4.1.9. Purity . . . . . . . . . . . . . . . . . . . . . . . . 166. Evaluation Results . . . . . . . . . . . . . . . . . . . . . . 166 7.4.2.1. Variant Based on Uniform Weights . . . . . . . . 166. 7.4.2.2. Variant Based on Attribute Entropy . . . . . . . 167. 7.4.2.3. Variant Based on Time Decay . . . . . . . . . . . 169. 7.4.2.4. Variant Based on Attribute Entropy and Time Decay . . . . . . . . . . . . . . . . . . . . . . . . 170. 7.4.2.5. Comparison of the Variants . . . . . . . . . . . . 171. Hypothesis Validation . . . . . . . . . . . . . . . . . . . . . . . . 173. 8 TECHNIQUES FOR SEGMENTATION OF CONSUMERS FROM SOCIAL MEDIA CONTENT 175 8.1. Common Elements Used by the Techniques . . . . . . . . . . . . . 176 8.1.1. Collect Initial Data Task . . . . . . . . . . . . . . . . . . . 177. 8.1.2. Data Preparation Activity . . . . . . . . . . . . . . . . . . 178. 8.1.3 8.2. 8.1.2.1. Select Data Task . . . . . . . . . . . . . . . . . . 179. 8.1.2.2. Clean Data Task . . . . . . . . . . . . . . . . . . 180. 8.1.2.3. Construct Data Task . . . . . . . . . . . . . . . . 182. Rule-based Modelling Technique . . . . . . . . . . . . . . . 187. Technique for Detecting Consumer Decision Journey Stages . . . . 191 8.2.1. Data Understanding Activity . . . . . . . . . . . . . . . . 191 8.2.1.1. Collect Initial Data Task . . . . . . . . . . . . . . 192. 8.2.1.2. Describe Data Task. xv. . . . . . . . . . . . . . . . . 195.

(18) 8.2.2. 8.3. 8.3.2. 8.2.1.4. Verify Data Quality Task . . . . . . . . . . . . . 197. Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 200 8.2.2.1. Select Modelling Technique Task . . . . . . . . . 200. 8.2.2.2. Build Model Task. . . . . . . . . . . . . . . . . . 201. Data Understanding Activity . . . . . . . . . . . . . . . . 205 8.3.1.1. Collect Initial Data Task . . . . . . . . . . . . . . 205. 8.3.1.2. Describe Data Task. 8.3.1.3. Explore Data Task . . . . . . . . . . . . . . . . . 208. 8.3.1.4. Verify Data Quality Task . . . . . . . . . . . . . 208. . . . . . . . . . . . . . . . . 207. Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 210 8.3.2.1. Select Modelling Technique Task . . . . . . . . . 210. 8.3.2.2. Build Model Task. . . . . . . . . . . . . . . . . . 211. Technique for Detecting Emotions . . . . . . . . . . . . . . . . . . 212 8.4.1. 8.4.2. 8.5. Explore Data Task . . . . . . . . . . . . . . . . . 196. Technique for Detecting Marketing Mix Attributes . . . . . . . . . 205 8.3.1. 8.4. 8.2.1.3. Data Understanding Activity . . . . . . . . . . . . . . . . 213 8.4.1.1. Collect Initial Data Task . . . . . . . . . . . . . . 213. 8.4.1.2. Describe Data Task. 8.4.1.3. Explore Data Task . . . . . . . . . . . . . . . . . 215. 8.4.1.4. Verify Data Quality Task . . . . . . . . . . . . . 216. . . . . . . . . . . . . . . . . 214. Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 218 8.4.2.1. Select Modelling Technique Task . . . . . . . . . 218. 8.4.2.2. Generate Test Design Task . . . . . . . . . . . . 219. 8.4.2.3. Build Model Task. . . . . . . . . . . . . . . . . . 219. Technique for Detecting Place of Residence . . . . . . . . . . . . . 223 8.5.1. Data Understanding Activity . . . . . . . . . . . . . . . . 223 8.5.1.1. Collect Initial Data Task . . . . . . . . . . . . . . 224. 8.5.1.2. Describe Data Task. 8.5.1.3. Explore Data Task . . . . . . . . . . . . . . . . . 225. . . . . . . . . . . . . . . . . 224. 8.5.2. Data Preparation Activity . . . . . . . . . . . . . . . . . . 225. 8.5.3. Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 225 8.5.3.1. Select Modelling Technique Task . . . . . . . . . 226. 8.5.3.2. Generate Test Design Task . . . . . . . . . . . . 235. xvi.

(19) 8.6. Technique for Detecting Gender . . . . . . . . . . . . . . . . . . . 235 8.6.1. 8.7. 8.6.1.1. Collect Initial Data Task . . . . . . . . . . . . . . 235. 8.6.1.2. Describe Data Task. 8.6.1.3. Explore Data Task . . . . . . . . . . . . . . . . . 236. . . . . . . . . . . . . . . . . 236. 8.6.2. Data Preparation Activity . . . . . . . . . . . . . . . . . . 236. 8.6.3. Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 237 8.6.3.1. Select Modelling Technique Task . . . . . . . . . 237. 8.6.3.2. Generate Test Design Task . . . . . . . . . . . . 240. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 8.7.1. 8.7.2. 8.8. Data Understanding Activity . . . . . . . . . . . . . . . . 235. Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 241 8.7.1.1. Accuracy . . . . . . . . . . . . . . . . . . . . . . 242. 8.7.1.2. Recall . . . . . . . . . . . . . . . . . . . . . . . . 242. 8.7.1.3. Precision . . . . . . . . . . . . . . . . . . . . . . 242. 8.7.1.4. F-measure . . . . . . . . . . . . . . . . . . . . . . 242. Evaluation Results . . . . . . . . . . . . . . . . . . . . . . 243 8.7.2.1. Technique for Detecting Consumer Decision Journey Stages . . . . . . . . . . . . . . . . . . . . . . 243. 8.7.2.2. Technique for Detecting Marketing Mix Attributes 246. 8.7.2.3. Technique for Detecting Emotions . . . . . . . . 249. 8.7.2.4. Technique for Detecting Place of Residence . . . 252. 8.7.2.5. Technique for Detecting Gender . . . . . . . . . . 252. Validation of Hypotheses . . . . . . . . . . . . . . . . . . . . . . . 255. 9 CONCLUSIONS AND FUTURE WORK. 257. 9.1. Social Media Data Model for Consumer Analytics . . . . . . . . . 258. 9.2. Morphosyntactic Characterisation of Social Media Contents. 9.3. Technique for Unique User Identification Based on Evolving Device Fingerprint . . . . . . . . . . . . . . . . 259. 9.4. Techniques for Segmentation of Consumers from Social Media Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261. . . . 258. 9.4.1. Technique for Detecting Consumer Decision Journey Stages 261. 9.4.2. Technique for Detecting Marketing Mix Attributes. xvii. . . . . 262.

(20) 9.4.3 9.4.4 9.4.5 9.4.6 9.4.7. Technique for Detecting Emotions . . . . . . . . . . . . . . Technique for Identifying the Place of Residence of Social Media Users . . . . . . . . . . . . . . . . . . . . . . . . . . Technique for Identifying the gender of Social Media Users Normalisation of User-Generated Content . . . . . . . . . Evaluation of Scalability . . . . . . . . . . . . . . . . . . .. xviii. 262 265 265 265 266.

(21) List of Figures 2.1. Process followed by the technique based on web logs (adapted from Kaushik [2007]) . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. Process followed by the technique based on web beacons (adapted from Kaushik [2007]) . . . . . . . . . . . . . . . . . . . . . . . . .. 19. Process followed by the technique based on JavaScript tags (adapted from Kaushik [2007]) . . . . . . . . . . . . . . . . . . . . . . . . .. 20. Process followed by the tags or web beacons techniques for gathering data from multiple sites (adapted from [Kaushik, 2007]) . .. 20. Process followed by the technique based on packet sniffing (adapted from Kaushik [2007]) . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 2.6. Consumer Decision Journey stages adopted in this thesis . . . . .. 39. 3.1. Contributions to the State of the Art . . . . . . . . . . . . . . . .. 64. 3.2. Relationships between the objectives, contributions, assumptions, hypothesis and restrictions . . . . . . . . . . . . . . . . . . . . . .. 69. Relations between methodology, methods, techniques, processes, activities and tasks (adapted from Gómez-Pérez et al. [2004]) . . .. 72. Iterative research methodology using exploratory and experimental approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 73. 4.3. Web mining framework (adapted from Hu and Cercone [2004]) . .. 75. 4.4. The CRISP-DM reference model (adapted from Shearer [2000]) .. 79. 5.1. Ontology network . . . . . . . . . . . . . . . . . . . . . . . . . . .. 84. 5.2. Social Graph Ontology Modeles . . . . . . . . . . . . . . . . . . .. 86. 2.2 2.3 2.4 2.5. 4.1 4.2. xix.

(22) 5.3. Class Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 87. 5.4. Object Property Example . . . . . . . . . . . . . . . . . . . . . .. 87. 5.5. Inverse Object Properties Example . . . . . . . . . . . . . . . . .. 88. 5.6. Class Inheritance Example . . . . . . . . . . . . . . . . . . . . . .. 88. 5.7. Property Inheritance Example . . . . . . . . . . . . . . . . . . . .. 89. 5.8. Instances Example . . . . . . . . . . . . . . . . . . . . . . . . . .. 89. 5.9. Core ontology module of the SGO . . . . . . . . . . . . . . . . . .. 91. 5.10 Publication Channels module of the SGO . . . . . . . . . . . . . .. 96. 5.11 Contents module of the SGO . . . . . . . . . . . . . . . . . . . . .. 97. 5.12 Users module of the SGO . . . . . . . . . . . . . . . . . . . . . . 100 5.13 Opinions module of the SGO . . . . . . . . . . . . . . . . . . . . . 104 5.14 Topics and Keywords module of the SGO . . . . . . . . . . . . . . 105 5.15 Locations module of the SGO . . . . . . . . . . . . . . . . . . . . 107 7.1. Format of the data used by the technique for unique user identification based on evolving device fingerprint detection . . . . . . . 124. 7.2. Daily distribution of visitors during the period of study . . . . . . 125. 7.3. Daily distribution of visits during the period of study . . . . . . . 125. 7.4. Daily distribution of page views during the period of study . . . . 125. 7.5. Distribution of the activity records captured by unique user . . . 126. 7.6. Distribution of visits per country . . . . . . . . . . . . . . . . . . 127. 7.7. Disagreement decay for the X-Real-IP header (second interval) . . 144. 7.8. Disagreement decay for the X-Real-IP header (first interval) . . . 144. 7.9. Agreement decay for the X-Real-IP header . . . . . . . . . . . . . 145. 7.10 Agreement decay for the X-Forwarded-For header . . . . . . . . . 145 7.11 Disagreement decay for the User-Agent header . . . . . . . . . . . 146 7.12 Agreement decay for the User-Agent header . . . . . . . . . . . . 147 7.13 Disagreement decay for the Accept header . . . . . . . . . . . . . 148 7.14 Agreement decay for the Accept header . . . . . . . . . . . . . . . 148 7.15 Disagreement decay for the Accept-Language header . . . . . . . . 149 7.16 Agreement decay for the Accept-Language header . . . . . . . . . 149 7.17 Disagreement decay for the Accept-Charset header . . . . . . . . . 150 7.18 Agreement decay for the Accept-Charset header . . . . . . . . . . 150. xx.

(23) 7.19 Disagreement decay for the Accept-Encoding header . . . . . . . . 151 7.20 Agreement decay for the Accept-Encoding header . . . . . . . . . 151 7.21 Disagreement decay for the Cache-Control header . . . . . . . . . 152 7.22 Agreement decay for the Cache-Control header . . . . . . . . . . 153 7.23 Disagreement decay for the Plugins attribute . . . . . . . . . . . . 154 7.24 Agreement decay for the Plugins attribute . . . . . . . . . . . . . 154 7.25 Disagreement decay for the Fonts attribute (second interval) . . . 155 7.26 Disagreement decay for the Fonts attribute (first interval) . . . . 155 7.27 Agreement decay for the Fonts attribute . . . . . . . . . . . . . . 156 7.28 Disagreement decay for the Video attribute . . . . . . . . . . . . . 157 7.29 Agreement decay for the Video attribute . . . . . . . . . . . . . . 157 7.30 Disagreement decay for the Time zone attribute . . . . . . . . . . 158 7.31 Agreement decay for the Time zone attribute . . . . . . . . . . . 158 7.32 Disagreement decay for the Session Storage attribute . . . . . . . 159 7.33 Agreement decay for the Session storage attribute . . . . . . . . . 159 7.34 Disagreement decay for the Local storage attribute . . . . . . . . . 160 7.35 Agreement decay for the Local Storage attribute . . . . . . . . . . 160 7.36 Disagreement decay for the Internet Explorer persistence attribute 161 7.37 Agreement decay for the Internet Explorer persistence attribute . 162 7.38 Performance of the variants evaluated for the technique for unique user identification based on evolving device fingerprint detection . 172 8.1. Initial Data Collection task executed by the content-analysis techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177. 8.2. Data Preparation Activity implemented by the content-analysis techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179. 8.3. Clean data task executed by the content-analysis techniques . . . 180. 8.4. Construct data task executed by the content-analysis techniques . 183. 8.5. Format of the data used by the technique for detecting Consumer Decision Journey stages . . . . . . . . . . . . . . . . . . . . . . . 195. 8.6. Distribution of the texts along the media sources and sectors for the Consumer Decision Journey gold standard . . . . . . . . . . . 196. xxi.

(24) 8.7 8.8 8.9 8.10 8.11 8.12 8.13 8.14 8.15 8.16 8.17 8.18 8.19 8.20 8.21 8.22 8.23 8.24 8.25 8.26 8.27 8.28 8.29 8.30. Distribution of the texts along the Consumer Decision Journey categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example annotation of a post according to a Consumer Decision Journey category using Amazon Mechanical Turk . . . . . . . . . Format of the data used by the technique for detecting Marketing Mix attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example annotation of a post according to a Marketing Mix Category using Amazon Mechanical Turk . . . . . . . . . . . . . . . . Format of the data used by the technique for detecting emotions . Example annotation of a post according to a Emotions category using Amazon Mechanical Turk . . . . . . . . . . . . . . . . . . . Data format of the corpus used by the technique for detecting the place of residence of social media users . . . . . . . . . . . . . . . Example of user profile location metadata . . . . . . . . . . . . . Example of an output of the Google Geocoding API . . . . . . . . Example execution of table location filtering process . . . . . . . . Example of user profile description metadata . . . . . . . . . . . . Example of location extraction from content . . . . . . . . . . . . Data format of the corpus used by the technique for detecting the gender of social media users . . . . . . . . . . . . . . . . . . . . . Example of user profile name metadata . . . . . . . . . . . . . . . Dependency tree obtained from a tweet that mentions to a user . Accuracy of the Consumer Decision Journey classifier for English Accuracy of the Consumer Decision Journey classifier for Spanish Accuracy of the Consumer Decision Journey classifier by sector . Accuracy of the Marketing Mix classifier for English . . . . . . . . Accuracy of the Marketing Mix classifier for Spanish . . . . . . . Accuracy of the emotions classifier . . . . . . . . . . . . . . . . . Accuracy of the emotions classifier by sector . . . . . . . . . . . . Accuracy of the emotions classifier by social media type . . . . . . Performance of the gender recognition approaches . . . . . . . . .. xxii. 197 199 208 209 215 217 225 227 228 230 232 234 236 238 240 244 245 246 247 248 250 251 251 253.

(25) List of Tables 2.1. Prefixes that can be declared in a web server log file . . . . . . . .. 16. 2.2. Identifiers that can be declared in a web server log file . . . . . . .. 17. 2.3. Subcategories of the Marketing Mix elements . . . . . . . . . . . .. 40. 2.4. Categories for the sentiment classification, organised according to their polarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 42. Relations between the conceptual framework of emotions used in this thesis and the Wordnet-Affect taxonomy . . . . . . . . . . . .. 43. Example lemmatisation and part-of-speech tagging of an example text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 48. 5.1. Vocabularies selected for defining the Social Graph Ontology . . .. 84. 5.2. Properties of the class sioc:UserAccount. . . . . . . . . . . . . . .. 92. 5.3. Properties of the class sioc:Post (1/2) . . . . . . . . . . . . . . . .. 93. 5.4. Properties of the class sioc:Post (2/2) . . . . . . . . . . . . . . . .. 94. 5.5. Properties of the class sioc:Forum . . . . . . . . . . . . . . . . . .. 94. 5.6. Properties of the class marl:Opinion. . . . . . . . . . . . . . . . .. 95. 5.7. Properties of the class skos:Concept . . . . . . . . . . . . . . . . .. 95. 5.8. Properties of the class sioc:Community . . . . . . . . . . . . . . .. 95. 5.9. Properties of the class rdfg:Graph . . . . . . . . . . . . . . . . . .. 95. 5.10 Properties of the class sioc:Site . . . . . . . . . . . . . . . . . . .. 96. 5.11 Properties of the class foaf:Document . . . . . . . . . . . . . . . .. 98. 5.12 Properties of the class schema:Review . . . . . . . . . . . . . . . .. 98. 2.5 2.6. 5.13 Property of the class sioc:Role . . . . . . . . . . . . . . . . . . . . 101 5.14 Properties of the class foaf:Agent . . . . . . . . . . . . . . . . . . 101 5.15 Properties of the class foaf:Person . . . . . . . . . . . . . . . . . . 101. xxiii.

(26) 5.16 Properties of the class foaf:Activity . . . . . . . . . . . . . . . . . 101 5.17 Properties of the class sgo:Cookie . . . . . . . . . . . . . . . . . . 102 5.18 Properties of the class sgo:Fingerprint. . . . . . . . . . . . . . . . 102. 5.19 Properties of the class tzont:PoliticalRegion . . . . . . . . . . . . . 107 5.20 Properties of the class tzont:Country . . . . . . . . . . . . . . . . 108 5.21 Properties of the class tzont:State . . . . . . . . . . . . . . . . . . 108 5.22 Properties of the class tzont:County . . . . . . . . . . . . . . . . . 108 5.23 Properties of the class tzont:City. . . . . . . . . . . . . . . . . . . 108. 5.24 Properties of the class schema:Continent . . . . . . . . . . . . . . 108 5.25 Properties of the class tzont:TimeZone . . . . . . . . . . . . . . . 108 6.1. Distribution of part-of-speech categories by social media type . . . 112. 7.1. Statistics associated to the number of records gathered per unique user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126. 7.2. Distribution of visits for the 10 countries that generated more site activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128. 7.3. Entropy of fingerprint attributes . . . . . . . . . . . . . . . . . . . 128. 7.4. Cross-entropy between pairs of fingerprint attributes. 7.5. Conditional entropy between pairs of fingerprint attributes . . . . 130. 7.6. User-Agent values for Google, Bing, and Yahoo! robots . . . . . . 132. 7.7. Disagreement decay of fingerprint attributes . . . . . . . . . . . . 142. 7.8. Agreement decay of fingerprint attributes . . . . . . . . . . . . . . 143. 7.9. Evaluation results for the variant based on uniform weights . . . . 167. . . . . . . . 129. 7.10 Evaluation results for the variant based on attribute entropy . . . 168 7.11 Evaluation results for the variant based on time decay . . . . . . . 169 7.12 Evaluation results for the variant based on attribute entropy and time decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 7.13 Comparison of the variants with more performance . . . . . . . . 173 8.1. Examples of the linguistic patterns for identifying Consumer Decision Journey stages . . . . . . . . . . . . . . . . . . . . . . . . . 202. 8.2. Primary and secondary sentiments. 8.3. Distribution of texts for the sentiment corpus by social media type 215. xxiv. . . . . . . . . . . . . . . . . . 214.

(27) 8.4 8.5. Distribution of texts for the sentiment corpus by domain . . . . . Distribution of texts for the sentiment corpus for the training and test sets by domain . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Excerpt from sentiments in Badele3000 . . . . . . . . . . . . . . . 8.7 Examples of rules for classifying emotions . . . . . . . . . . . . . 8.8 Collocations of “odio” in Badele3000 . . . . . . . . . . . . . . . . 8.9 Accuracy of the place of residence identification approaches . . . . 8.10 Coverage of the gender recognition approaches . . . . . . . . . . . 8.11 Confusion matrix with the results of the approach based on mentions to users. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1. 216 219 221 223 223 252 253 254. Rule reordering example . . . . . . . . . . . . . . . . . . . . . . . 264. xxv.

(28) xxvi.

(29) Chapter 1 INTRODUCTION The rise of Web 2.0 technologies and social media has enabled users to author their own content. This has populated the Web with huge amounts of user-generated content that can be exploited for many different and interesting purposes, such as explaining or predicting real world outcomes through opinion mining, which provides a valuable tool for market research. Data scientists in almost every industry that is exposed to public opinion are under pressure to deal with the explosive growth of social media. Such professionals must be aware of what is said about the issues that affect their business in different social media channels. Social media are media in which information is created by the interaction of users, who express their opinions freely and spontaneously. This has revolutionised the way in which organisations and consumers interact. Users have adopted massively these channels to engage in conversations about content, products, and brands, while organisations are striving to adapt proactively to the threats and opportunities that this new dynamic environment poses. Social media is a knowledge mine about users, communities, preferences and opinions, which has the potential to impact positively marketing and product development activities [Weber, 2007]. In the marketing field, media and society digitalisation has revolutionised the rules of traditional brand communication with an explosion of channels and possibilities for brands to contact consumers. Brands and media agencies are facing a big challenge developing systems to assure the best communication strategy for the brand (in terms of cost, effectiveness and efficiency). Activities such as. 1.

(30) word-of-mouth advertising where products or brands are promoted via oral or written communication have successfully adapted to social media through viral processes. It is becoming essential to know the views of consumers towards brands and products for designing advertisement campaigns, estimating future sales and deciding the strategy to follow when launching a new brand image. According to a Nielsen [2012b] report, 70% of social media users take into account the product experience published by other users; 65% declare to search information about brands, products and services; 53% express positive comments on brands; and 50% express complaints at least once per month. Social media monitoring tools are being used successfully in a range of domains (including market research, online publishing, etc.). However, tools available nowadays to analyse social media do not leverage completely the rich and complex information structure generated by users. Most of these tools elaborate their reports from metrics based on volume of posts, opinion polarity about the subject that is being studied, and users’ reputation. Although such metrics are good indicators of a subject’s popularity and relevance, these metrics are often inadequate for capturing complex multi-modal dimensions of the subjects to be measured that are relevant to business, and must be complemented with ad-hoc studies such as opinion polls. Therefore, existing opinion-mining techniques must be extended for discovering other aspects of discourse, such as consumer intents, mood and emotions. Overcoming some of the limitations of current tools to manage and analyse the information produced in social media is a pending challenge that this thesis addresses. The main goal of this thesis is to provide a data model and a set of techniques based on Web users tracking and natural language processing for extracting semantic information from the contents generated by consumers in social media. In the following paragraphs we introduce the specific contributions of this thesis to the State of the Art. The disparity of formats, mechanisms for accessing the information, content sizes, and metadata hinders the collection, integration and processing of the content published in social media, forcing to use specific methods and techniques for each kind of media. In this thesis, we provide a data model for the marketing domain that can be using for standardising and normalising the. 2.

(31) information that can be extracted from social media about consumers, brands, media and opinions of consumers about brands (C1). The distributed nature of the Web and the disparity of devices, that can be used to access social media (PCs, smartphones, tablets, smart TVs, etc.) make difficult to track the actions performed by users for web analytics purposes. Unique user identification is a key task within the web analytics data collection process, and is useful for measuring the effectiveness of online advertising campaigns, among other applications. The fingerprinting technique consists in tracking user activity on a set of sites by capturing technical information about the browser and the machine that the user employs to navigate the Web. Browser fingerprinting has been demonstrated to be an effective method for unique user identification when the device used to navigate the Web does not support cookies. However as the attributes used for generating browser fingerprint evolve, multiple distinct fingerprint records are created for the same user, leading to incorrect unique user identification. This thesis contributes to the State of the Art with a technique for unique user identification that detects browser fingerprint evolution (C3). In the last decade, the availability of digital user-generated documents from social media has dramatically increased. This massive growth of user-generated content has also affected traditional shopping behaviour. Customers have embraced new communication channels such as microblogs and social networks that enable them not just to talk with friends and acquaintances about their shopping experience, but also to search for opinions expressed by complete strangers as part of their decision making processes. Uncovering how customers feel about specific products or brands and detecting purchase habits and preferences has traditionally been a costly and highly time-consuming task which involved the use of methods such as focus groups and surveys. However, the new scenario calls for a deep assessment of current market research techniques in order to better interpret and profit from this ever-growing stream of attitudinal data. With this purpose, we present a novel analysis and classification of user-generated content in terms of it belonging to one of the four stages of the Consumer Decision Journey [Court et al., 2009] (i.e. the purchase process from the moment when a customer is aware of the existence of the product. 3.

(32) to the moment when he or she buys, experiences and talks about it) (C4.1). Using a corpus of short texts written in English and Spanish and extracted from different social media, this thesis identifies a set of linguistic patterns for each purchase stage that will be then used in a rule-based classifier. Additionally, we use machine-learning algorithms to automatically identify business indicators such as the Marketing Mix elements [McCarthy and Brogowicz, 1981] (C4.2). Sentiment analysis of social media is of commercial interest as user-generated content published in the Web reaches and influences many potential customers. Most work in this field has focused on opinion polarity (positive or negative) and, therefore, does not specify the kind of sentiment related to that opinion. In order to provide this information, this thesis establishes four polarised categories that capture the main sentiments that can be found on social media: satisfaction-dissatisfaction (SD), trust-fear (TF), love-hate (LH), and happiness-sadness (HS). It develops a rule-based system that classifies texts in Spanish from those social media, according to this sentiment classification with respect to a brand, company or product. The rules have been written in a simple grammar after (linguistically) analysing a corpus of different business domains whose texts had been manually classified (C4.3). Characterising users through demographic attributes is a necessary step before conducting opinion surveys from information published by such users in social media. In this thesis, we describe, compare and evaluate different techniques for the identification of the attributes “gender” (C4.4) and “place of residence” (C4.5) by mining the metadata associated to the users, the content published and shared by themselves, and their friendship networks. Natural language processing techniques are a key technology for analysing user-generated content. Despite some efforts have been done to structure social media information, such as Twitlogic [Shinavier, 2010], there is still the need for approaches that are able to cope with the different channels in the Social Web and with the challenges they pose. The content published in social media is characterised by the use of casual language; social media posts contain texts that vary in length from short sentences in microblogs to medium-size articles in web logs. Very often the text published in social media contains misspellings, is com-. 4.

(33) pletely written in uppercase or lowercase letters, or it is composed of set phrases, among other characteristics that challenge existing content analysis techniques, leading to problems regarding the accuracy of natural language processing tools like part-of-speech taggers. As an example, for the Spanish language, the absence of an accent in a word may give such word a completely different meaning. As a minor contribution, this thesis studies the differences of the language used in heterogeneous social media sources, by analysing the distribution of the part-of-speech categories extracted from the analysis of the morphology of a sample of texts published in such sources, showing that the task of normalising user-generated content is a necessary step before analysing social media posts, particularly on Twitter1 (C2). Therefore the content analysis techniques proposed by this thesis implement a stage that performs a morphological normalisation of user-generated content that makes use of on-line and collectively developed resources, including Wikipedia2 and a SMS lexicon. The results obtained demonstrate that the normalisation of user-generated content improves slightly the accuracy of the content analysis techniques presented in this thesis.. 1.1. Thesis Structure. This thesis is structured as follows: • Chapter 2 reviews the State of the Art and identifies the open research problems addressed in this thesis. • Chapter 3 presents the objectives of this thesis, which were defined according to the open research problems identified in Chapter 2. In addition, we present the contributions to the State of the Art, as well as the assumptions and hypotheses on which our contributions rely. Finally we describe the restrictions, which define the scope of the different contributions. • Chapter 4 presents the research methodology, and the method followed for obtaining the artefacts provided by this thesis, which is inspired in an 1 2. http://twitter.com http://www.wikipedia.org. 5.

(34) existing framework for web mining. For defining the model of the data warehouse we have followed an existing methodology for building ontology networks. For addressing the rest of the phases defined by the framework, we have followed an existing data mining process model. • Chapter 5 describes the data model that we have designed for representing the information extracted from social media for the marketing domain. • Chapter 6 characterises the different kinds of social media according to the morphosyntactic characteristics of the textual content published in such media. • Chapter 7 provides a technique for uniquely identifying users in social media based on the fingerprint of their devices, regardless the evolution of such fingerprints. The chapter also presents the evaluation results and describes the data set used for evaluating the technique. • Chapter 8 presents a collection of techniques for extracting sociodemographic and psychographic profiles from social media users applied to the marketing domain, through the analysis of the opinions they express about brands, as well as from the profiles published by them in social networks. The chapter also presents the evaluation results and describes the data sets used for evaluating the techniques. • Finally, Chapter 9 presents research conclusions and possible future lines of research and innovation.. 6.

(35) 1.2. Dissemination of Results. Some of the contributions produced within the framework of this thesis have been published in international peer-reviewed journals, conferences and workshops. In the following we list the contributions along with the publications that support them. The technique proposed for uniquely identifying users in social media based on the fingerprint of their devices has been published in an international journal: Óscar Muñoz-Garcı́a, Javier Monterrubio-Martı́n, Daniel Garcı́aAubert. Detecting browser fingerprint evolution for identifying unique users. International Journal of Electronic Business, 10(2):120–141, 2012, ISSN 1470-6067, DOI 10.1504/IJEB.2012.051116. The techniques proposed for classifying user-generated content into Consumer Decision Journey stages and Marketing Mix elements have been published in an international journal indexed by JCR: Silvia Vázquez, Óscar Muñoz-Garcı́a, Inés Campanella, Marc Poch, Beatriz Fisas, Nuria Bel, Gloria Andreu. A classification of usergenerated content into Consumer Decision Journey stages. Neural Networks, 58:68–81, October 2014, ISSN 0893-6080, DOI 10.1016/J.NEUNET.2014.05.026. The technique proposed for detecting emotions has been published in the proceedings of a Spanish conference: Guadalupe Aguado-de-Cea, Marı́a Auxiliadora Barrios, Marı́a Socorro Bernardos, Inés Campanella, Elena Montiel-Ponsoda, Óscar Muñoz-Garcı́a, Vı́ctor Rodrı́guez. Análisis de sentimientos en un corpus de redes sociales. In Proceedings of the 31st AESLA (Asociación Española de Lingüı́stica Aplicada) International Conference, San Cristóbal de la Laguna, Tenerife, Spain, April 2014.. 7.

(36) The techniques proposed for identifying the place of residence and gender of social media users have been published in a Spanish journal: Óscar Muñoz-Garcı́a, Jesús Lanchas Sampablo, David Prieto Ruı́z. Characterising social media users by gender and place of residence. Procesamiento del Lenguaje Natural, 51:57–64 , September 2013, ISSN 1135-5948. The characterisation of the different kinds of social media according to the morphosyntactic characteristics of the textual content published in such media has been published in the proceedings of an international workshop: Óscar Muñoz-Garcı́a, Carlos Navarro. Comparing user generated content published in different social media sources. In Proceedings of the NLP can u tag #user generated content ?! via lrecconf.org Workshop co-located with Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1–8, Istanbul, Turkey, 26 May 2012. Finally, the approach that we follow for performing morphological normalisation of social media posts has been published in the proceedings of a Spanish workshop: Óscar Muñoz-Garcı́a, Silvia Vázquez Suárez, Nuria Bel. Exploiting Web-based collective knowledge for micropost normalisation. In Proceedings of the Tweet Normalization Workshop co-located with 29th Conference of the Spanish Society for Natural Language Processing (SEPLN 2013), pp. 10–14, Madrid, Spain, 20 September 2013, ISSN 1613–0073.. 8.

(37) Chapter 2 STATE OF THE ART This chapter reviews the State of the Art regarding the objectives of this thesis. The information published in social media consists of connected data by nature, due to the interlinked nature of social networks. Therefore, graph-based data models are an appropriate way of representing the relationships between the users and contents included in social media. Section 2.1 describes existing semantic vocabularies that can be used for representing social media information. Such vocabularies will be reused in this thesis to provide a normalised schema for structuring the information published in social media. This thesis provides a technique for unique user identification which is an essential step for tracking the activity of users in the Web. Section 2.2 describes the existing techniques for tracking users in the Web, while Section 2.3 describes a technique for detecting the evolution of temporary records, upon which our user technique for identifying unique users is based. Additionally, this thesis has a strong business context, and its objectives are devoted to solve specific problems related with the marketing field. Section 2.4 describes the State of the Art on social media analysis applied to market research, while Section 2.5 introduces the marketing background upon which the contributions of our thesis are based. Finally, many of the contributions of this thesis rely on natural language processing techniques applied to the analysis of textual content published in social media, whose State of the Art is described in Section 2.6. In the following we detail the State of the Art and existing research problems related with it.. 9.

(38) 2.1. Semantic Vocabularies for Representing Social Media Information. Social media and the online communities built around them are silos whose users, contents, topics, etc. are rarely connected among them (e.g. Twitter data is not connected with Facebook3 data), except for minor service integrations (e.g. publishing a tweet whenever an status update is made in a LinkedIn4 account). In addition, there is no unified data format according to which to express the information posted to every social media. For example, the data published using the Facebook Graph API5 does not match the one used by the Twitter API6 , neither do match the content syndication formats RSS7 and Atom [Nottingham and Sayre, 2005], commonly used by weblogs and news publication sites. Format heterogeneity and cross-social network integration issues difficult data gathering and the integrated analysis of the data published in social media. SIOC [Breslin et al., 2006] is a Semantic Web ontology designed to cope with these issues. It uses RDF8 for representing data published in social media, allowing linking posts, authors, topics, and other concepts, regardless specific social networks, therefore providing a mechanism for integrating information related to online communities. The SIOC vocabulary is linked with FOAF [Graves et al., 2007] for representing information about users and user-accounts. FOAF defines a data model of persons and relationships between persons, including mappings with other Semantic Web vocabularies, like Schema.org9 . Schema.org is a vocabulary designed for marking up HTML10 pages to improve indexing and metadata visualisation by search providers like Google11 , Yahoo!12 3. http://www.facebook.com http://www.linkedin.com 5 http://developers.facebook.com/docs/graph-api 6 https://dev.twitter.com 7 http://www.rssboard.org/rss-specification 8 http://www.w3.org/TR/rdf11-concepts 9 http://schema.org 10 http://www.w3.org/TR/html5 11 http://www.google.com 12 http://www.yahoo.com 4. 10.

(39) and Bing13 . This vocabulary includes a rich set of classes and properties that can be used for complementing the ones provided by SIOC and FOAF for annotating users and contents. Additionally, SIOC reuses the Dublin Core vocabulary14 for aggregating metadata to posts (e.g. title, summary, publication date) using properties standardised by DCMI (Dublin Core Metadata Initiative)15 . The SIOC specification16 suggests using SKOS [Miles et al., 2005] for representing topics according to which contents can be categorised. SKOS is a RDF vocabulary that provides a model for representing conceptual schemes such as, thesauri, classification schemes, subject heading lists, taxonomies, and other king of controlled vocabularies within the framework of the Semantic Web. Regarding geo-localisation of contents and users, FOAF is linked with the WGS8417 vocabulary that allows annotating resources with geographical coordinates. In addition, for representing time zones and political regions (e.g. countries and states) the ontology Time Zone18 can be used. Schema.org also provides ontology elements for describing spatial features of web resources. SIOC does not provide ontology elements, neither a recommendation for annotating the content with the results of natural language analysis processes. Nevertheless there exists multiple vocabularies that can be used for performing this task. As an example, the categorisation model ISOcat [Kemps-Snijders et al., 2008] can be used for annotating contents with linguistic information based on a standardised set of categories. With respect to Opinion Mining, Marl [Westerski et al., 2011] is an ontology used for annotating and describing opinions according to the polarity expressed in them with respect to specific entities (e.g. brands, persons) mentioned in social media. Therefore it provides ontology elements for classifying opinions into three possible categories of polarity (i.e. positive, negative, neutral) and for quantifying such polarity according to a numeric scale. Additionally, the Onyx 13. http://www.bing.com http://dublincore.org/documents/dcmi-terms 15 http://dublincore.org 16 http://rdfs.org/sioc/spec 17 http://www.w3.org/2003/01/geo 18 http://www.w3.org/2006/timezone 14. 11.

(40) ontology [Sánchez-Rada and Iglesias, 2013] allows categorising opinions into a broader set of emotions, like the ones described by the Wordnet-Affect taxonomy [Valitutti et al., 2004]. Multiple instances of social graphs can be used to perform analyses with different data sets (e.g. for analysing different domains or markets). These instances can be treated separately with RDF named graphs, and each named graph can be described by using the graph description metadata, like the one provided by the RDFG vocabulary [Carroll et al., 2005]. Finally, the PROV-O (PROV Ontology)19 provides a set of ontology elements that can be used for representing and exchanging information of the provenance of data generated by different systems. Therefore, it can be used within the social media field for indicating the content authoring entities and referencing publication sources. PROV-O has been mapped with the Dublin Core vocabulary20 , which in turn is mapped with FOAF. Thus, expressing social media facts using the FOAF and Dublin Core vocabularies automatically adds provenance information through the existing mappings. All these vocabularies are richer enough for describing general-purpose social graphs. However, we have not found during our survey vocabularies that allow describing some of the concepts related with the Marketing domain this thesis deals with and that will be explained in the following sections. Neither there exists a unified model that integrates the different vocabularies.. 2.1.1. Conclusions. Open Research Problem 1. While there exist data models for representing information captured from social media, either generic or social-network-specific, there are not schemas that integrate such information with marketing-specific classifications and KPIs (Key Performance Indicators) obtained from the analysis of the content generated by the consumers and the activity produced by them in social media. Therefore, the existing vocabularies may be extended with ontology elements that model marketing-related knowledge. 19 20. http://www.w3.org/TR/prov-o http://www.w3.org/TR/prov-dc. 12.

(41) Open Research Problem 2. The existing data models for representing social media information characterise the metadata that accompany the content published in the different kind of media. However, there is not a characterisation of such media according to the linguistic features of the textual contents published on them.. 2.2. Techniques for Tracking Users in the Web. User tracking consists in registering the activity of users as they interact with one or more websites so that such activity can be related with specific, uniquely identified users. The tracking of users is an essential activity in order to perform Web Analytics. Web Analytics is the professional discipline designed to draw conclusions, define strategies, or establish business rules on the basis of data collected in all web environments on which a company has control [Maldonado, 2009]. Web Analytics is a professional discipline because there is an industry related to Business Intelligence, Market Research and Marketing, which demands professionals with Web Analytics skills, which provide insights to their customers. Web Analytics allows studying the behaviour of users in websites, drawing conclusions, such as why they came to the site and from where, why they leave and where they went to, why they do not perform the actions we were expecting them to perform, or what search terms were used to get to the website. The strategies and business rules that Web Analytics enable are oriented to drive a continous improvement of the online experience that customers and potential customers have, leading them to website desired outcomes [Kaushik, 2009]. Web Analytics are used for measuring the performance of websites in a commercial context, providing a measurement model to Digital Marketing, allowing to quantify the effectiveness and impact of advertisement campaigns in digital media. Data gathered by applying Web Analytics (e.g. number of persons that have visualised a banner) are typically compared against KPI (e.g. outreach of a campaign) and used to improve the audience response to marketing campaigns (e.g. move the banner to a site with more audience). The most significant KPIs depend on counting unique visitors.. 13.

(42) Within a Web Analytics context, the data collection process consists in recording the activity generated by users while they interact with a set of websites. Such recorded activity may contain records about advertisement impressions, clicks on web page hyperlinks, and other navigational information. Collected data is useful for a number of marketing activities, such as, analysing advertisement campaign outreach or performing behavioural targeting, which involves tracking the on-line activities of users in order to deliver tailored ads to them. Specifically, ad targeting techniques, such as the one described by Deane et al. [2011], rely on data with users uniquely identified. For collecting such data, firstly the activity itself must be captured. After that, such activity must be associated with unique visitors. Visits and unique visitors are the basic web metrics required for nearly every web metric calculation [Kaushik, 2009]. As defined by the Digital Analytics Association [Burby and Brown, 2007]: Definition 1. A visit is an interaction, by an individual, with a website consisting of one or more requests for an analyst-definable unit of content (i.e. page view). Definition 2. The KPI unique visitors refers to the number of inferred individual people (filtered for spiders and robots), within a designated reporting timeframe, with activity consisting of one or more visits to a site. Each individual is counted only once in the unique visitor measure for the reporting period. At least six of the eight critical web metrics defined by Kaushik [2009] depend on uniquely identifying users (i.e. unique visitors, time on page, time on site, bounce rate, exit rate, and engagement). The other two are visits and conversion rate. Conversion rate can be calculated by taking into account either unique visitors or visits, depending on business objectives.. 2.2.1. Techniques for Capturing Web Activity. There are four main ways of capturing the activity (a.k.a. clickstream data) of website users [Kaushik, 2007]: web logs, web beacons, JavaScript tags, and packet sniffing. This section describes these approaches and analyses their advantages and disadvantages.. 14.

(43) 4. 3. 2. 1. Web Servers. Log files. Figure 2.1: Process followed by the technique based on web logs (adapted from Kaushik [2007]) 2.2.1.1. Technique Based on Web Logs. Web logs are a classic system for capturing clickstream data. This technique is implemented by web servers and consists in registering one log entry each time there is a request to a web server by a web client. In such log-based systems, the web server triggers the log action when it receives a request from the client. Figure 2.1 shows the process followed by this technique. The steps of this process are the following: 1. A user requests a resource (e.g. a web page) through its URL [Berners-Lee, 1994]. 2. The request is sent to a web server. 3. The server receives the request and creates a record in its log describing the request. 4. Finally, the server sends the resource to the user. The format of web server logs has been standardised by W3C21 . The standard proposes to describe log files as a sequence of log entries preceded by a header with one or more of the metadata described next: Version. Specifies the version of the log file format used. 21. http://www.w3.org/TR/WD-logfile.html. 15.

(44) Prefix c s r cs sc sr rs x. Description Client. Server. Remote. Client to Server. Server to Client. Server to Remote Server. This prefix is used by proxies. Remote Server to Server. This prefix is used by proxies. Application specific identifier.. Table 2.1: Prefixes that can be declared in a web server log file Fields. Specifies the fields recorded in the log. Such fields are defined by using a prefix and a field identifier. The prefix refers to the information transfer mode, while the identifier refers to an entry data type. For example, the identifier cs-method refers to the HTTP method [Fielding and Reschke, 2014b] used for data transfer from client to server. Table 2.1 shows the list of available prefixes, while Table 2.2 shows the possible fields that can be registered, indicating if the field requires or does not require to declare a prefix. Software. Identifies the software that generated the log. Start-Date. The date and time at which the log was started. End-Date. The date and time at which the log was finished. Date. The date and time at which the entry was added. Remark. Comment information. Analysis tools should ignore data recorded in this field. Listing 2.1 shows an example file log that includes a header in which the version used (line 1), the recording date (line 2), and fields registered (line 3) are specified. Registered fields correspond to the timestamp of particular requests, the HTTP method used, and the URI of the resource requested. The technique based on logs is the most accessible from all the techniques for recording web activity, since most web servers implement it. Also, there are. 16.

(45) Identifier date time time-taken bytes cached ip dns status comment method uri uri-stem uri-query. Description Date at which transaction completed. Time at which transaction completed. Time taken for transaction to complete in seconds. Number bytes transferred. Records whether a cache hit occurred. IP [Postel, 1981] address and port. DNS name [Mockapetris, 1987]. Status code [Fielding and Reschke, 2014b]. Comment returned with status code. HTTP method. URI [Berners-Lee et al., 2005]. Stem portion alone of URI (omitting query). Query portion alone of URI.. Prefix No No No. Type Date Time Fixed. No No Yes Yes Yes Yes Yes Yes Yes Yes. Integer Integer Address Name Integer Text Name URI URI URI. Table 2.2: Identifiers that can be declared in a web server log file 1 2 3 4 5 6 7. #Version: 1.0 #Date: 12−Jan−1996 00:00:00 #Fields: time cs−method cs−uri 00:34:23 GET /foo/bar.html 12:21:16 GET /foo/bar.html 12:45:52 GET /foo/bar.html 12:57:34 GET /foo/bar.html Listing 2.1: Example log file. numerous tools that allow analysis of logs such as AWStats22 , Webalizer23 and Analog24 . The main criticism to this technique is that the information captured in log files is often too technical (HTTP errors [Fielding and Reschke, 2014b], browser types, etc.) to be used directly for business purposes (e.g. marketing intelligence). Similarly, the information recorded in the logs is too large, since it records the download of any resource provided by the web server (style sheets, images, etc.) 22. http://awstats.sourceforge.net http://www.webalizer.org 24 http://www.analog.cx 23. 17.

(46) regardless it worths been measured or not. Therefore the log files must be conveniently filtered prior to their analysis. The technique based in logs is able to register any activity that implies an HTTP request [Fielding and Reschke, 2014a] from the client to the server. However it is not able to register users’ behaviour on web pages that do not require a resource download operation. Such operations are becoming more common due to dynamic web pages. 2.2.1.2. Technique Based on Web Beacons. The web beacons technique consists in placing banners, or 1 × 1 pixel transparent images, in web pages within img src HTML tags. When these tags are processed, a request to a tracking server is performed, what triggers the recording of the activity. Figure 2.2 shows the process followed by this technique. The steps of this process are the following: 1. A user requests a web page through its URL. 2. The request is sent to a web server. 3. The server sends the web page including an image of 1 × 1 pixels whose URL points to a data collection server. 4. When the web page is loaded in the user’s browser, a request of the image is sent to the data collection server. 5. The data collection server sends the image to the user, taking advantage of the HTTP protocol for managing cookies in the user’s device, and capturing user data, such as the web page that the user is viewing, the IP address of the user’s device, the timestamp of the activity, etc. Web beacons are used not only to capture information relating to the navigation of web pages; they can also can be inserted into email messages, so KPIs about an email sent can be recorded (e.g. number of email views). However, users often disable the download of images within their email applications.. 18.

(47) 4. Data Collector. 5. 3. 2 1. Website Servers. Figure 2.2: Process followed by the technique based on web beacons (adapted from Kaushik [2007]) 2.2.1.3. Technique Based on JavaScript Tags. The JavaScript tags technique is the most used nowadays, existing multiple commercial tools that implement it (e.g. Adobe Marketing Cloud25 , IBM EMM26 , webtrends27 , and Google Analytics28 ). It consists in placing JavaScript [ECMA, 2011] code within HTML pages, so that, when an event to be measured is produced, the scripting code is evaluated. Such code includes a request to a tracking server. Thus, when the script is evaluated, the request is performed and the activity is recorded. Figure 2.3 shows the process followed by this technique. The steps of this process are the following: 1. A user requests a web page through its URL. 2. The request is sent to a web server. 3. The server sends the web page including a script of JavaScript code assigned to different events (e.g. web page load, click on an active item). 4. When an event is triggered, its assigned JavaScript code is executed. Such code includes sending an HTTP request to a data collection server. 25. http://www.adobe.com/en/solutions/digital-marketing.html http://www.ibm.com/software/products/category/enterprise-marketing-management 27 http://webtrends.com 28 http://www.google.com/analytics 26. 19.

(48) 5. The data collection server processes the request, taking advantage of the HTTP protocol for managing cookies in the user’s device, and capturing user data, such as the web page that the user is viewing, the IP address of the user’s device, the timestamp of the activity, etc. Both, the technique based in web beacons and the technique based in JavasScript tags, allow collecting the web activity produced in multiple websites into a single data collection system. Figure 2.4 illustrates this scenario.. 4. Site Analytics Services. 3. 2. 5. 1. Website Servers. Figure 2.3: Process followed by the technique based on JavaScript tags (adapted from Kaushik [2007]). 3 4 2. Site Analytics Services. 5. Website 1 servers. 3 1. Data Collector. 2 Website 2 servers. Figure 2.4: Process followed by the tags or web beacons techniques for gathering data from multiple sites (adapted from [Kaushik, 2007]). 20.

(49) 2.2.1.4. Technique Based on Packet Sniffing. The packet sniffing technique consists in inspecting IP packages exchanged between web browsers and web servers. Packet sniffers can be implemented as a software layer over the web server, or as an independent module that intercepts and analyses the packages sent by web browsers before re-routing them to web servers. Figure 2.5 shows the process followed by this technique. The steps of this process are the following: 1. A user requests a web page through its URL. 2. The request is intercepted in its route to the web server by a packet sniffer that extracts the request data from the HTTP header of the request. 3. The packet sniffer re-routes the request to the web server. 4. The web server sends its response to the user’s browser. The response is intercepted by the packet sniffer, which extracts the information about the web page being served. Additionally, some sniffers add JavaScript tags to the web page, with the aim of obtaining additional information, once the browser processes the scripts. 5. The packet sniffer re-routes the response to the web browser.. 4. 5. 1. 2. Packet Sniffer. 3 Website Servers. Figure 2.5: Process followed by the technique based on packet sniffing (adapted from Kaushik [2007]). 21.

(50) 2.2.2. Techniques for Identifying Unique Users. This section describes the existing techniques for identifying unique users. Section 2.2.2.1 describes the widely used technique based on cookies, while Section 2.2.2.2 describes a novel technique based on the fingerprint of the device used for browsing the Web.. 2.2.2.1. Technique Based on Cookies. With respect to the technique for uniquely identifying users, the one based on cookies is the most extended. A cookie is a message sent to a web browser from a web server. The browser stores the message and forwards it to the server each time the web browser requests a page from the server. The web server can send two different kinds of cookies: 1. Session cookies, which have a lifetime limited to the user interaction with the website. 2. Persistent cookies, which remain on the machine of the user until a date of cookie expiration. The second type of cookies is the one used for user identification. Each time a request comes from a web browser to a web server, the server checks if a specific cookie exists on the client. If the cookie exists, the server obtains it and reads a unique user identifier stored on it. If the cookie does not exist, the server generates a new one, with a new unique user identifier, and sends it to the client. Typically, cookies used to identify users contain a user identifier, unique and anonymous, which identifies the browser. Therefore, this type of cookies identifies browsers used by users to access the Web. If a user uses multiple devices, the same user will be identified multiple times as a unique user (once per device). Cookies may be disabled in web browsers, or not supported by certain devices, such as smart TVs, so the user identification technique based on cookies cannot be universally applied. In addition, the browser may be configured to delete cookies periodically, or they can be erased by anti-spyware applications.. 22.

(51) 2.2.2.2. Technique Based on Fingerprint. The technique based on fingerprint is an alternative to the technique based on cookies. This technique consists in identifying users from a number of attributes of the web browser or that can be queried through it. These attributes are sent from the web browser to the web server within the headers of each HTTP request, or are available once a page has been loaded in the browser so that attribute values can be sent to the web server using the JavaScript tags technique explained before. Eckersley [2010] demonstrated the effectiveness of this technique by extracting and collecting the fingerprints of 470,161 browsers. After analysing the data Eckersley [2010] obtained the following conclusions: • 83.6% of browsers have a unique fingerprint. • In addition, 94.2% of the browsers with Adobe Flash Player29 or Java Virtual Machine30 installed have a unique fingerprint. This is because, making use of these technologies, more data are available for differentiating one browser from another (e.g. the fonts installed on the system). • The entropy [Shannon, 1948] associated with the distribution of fingerprints is 18.1 bits, which means that, if a browser is taken at random, at most one in 286,777 browsers share the same fingerprint. • However, the fingerprint of each web browser may change quickly. The number of unstable fingerprints was of 37.4% during the period of study. The approaches for implementing user identification based on browser fingerprint are described next [Eckersley, 2010]. Use the fingerprint as a global identifier. The strength of this technique is that, while cookies can be removed, disabled or not supported by certain web browsers or specific devices (e.g. smartphones and set-top boxes), a fingerprint can be always obtained. The weakness of this technique is that changes on the client (e.g. updating the browser version) imply changes on 29 30. http://www.adobe.com/products/flashplayer.html http://www.java.com. 23.