• No se han encontrado resultados

Environment for the evaluation and certification of data products quality

N/A
N/A
Protected

Academic year: 2020

Share "Environment for the evaluation and certification of data products quality"

Copied!
203
0
0

Texto completo

(1)

(2)

(3) I25K: Environment for the Evaluation and Certification of Data Products Quality Ph.D. Thesis *** Author: Jorge Merino. April, 2017. Ph.D. Supervisors: Ismael Caballero Manuel Serrano Mario Piattini.

(4)

(5) Jorge Merino Ciudad Real - Spain email: jorge.merino@uclm.com c 2017 Jorge Merino  Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License". Several names used by companies to differenciate their products and services can be claimed as registered brands. Those names will be written using uppercase letters or as first names wherever they appear in this document, and when the author has been reported about them. This document was created with LATEX, and developed using a template originally created by Carlos González Morcillo and Sergio García Mondaray, and edited by Jorge Merino..

(6)

(7) I would like to express my sincere gratitude to Dr. Ismael Caballero for all those talks full of creation and pure investigation that increased my development not only as a researcher and professional, but also as a person. My sincere thanks also goes to Dr. Manuel Serrano and Prof. Mario Piattini who provided me the opportunity to conduct this research and for their insightful comments and encouragement.. I thank my co-workers, my fellows, and overall, my friends from the DQTeam for your help throughout these years. I am grateful to everybody around me that made this journey more pleasant and complete.. I am indebted to my family for their understanding, patience, and trust. Special thanks to my designer, brother, and friend for such an impresive work, and support. Thanks to “Mine” for your silences and your never-ending talks, your support, and you.. Finally, thank you, reader, for being interested in my work. I hope my effort is useful for you..

(8)

(9) Abstract Data is the most important asset of any IT organization. The most successful companies of the world are data-driven businesses.. As any other raw material or asset for creating goods and services, data must be good enough for those companies to obtain benefits. When making choices based on data, it is vital that this raw material has the necessary levels of quality. Otherwise, created goods and services using the data might be useless or not appropriate for the intended purposes, and data-based decisions might be worthless or even harmful for the companies.. Considering that both the industry and public entities have interest in Data Quality, several solutions on Data Quality Assessment are present in the literature. Unfortunately, none of them focus on the certification and the assurance of the levels of quality of this precious asset. Consequently, this research digs deeply in the evaluation and certification of data in terms of its quality.. The main contribution of this thesis is the creation of an environment called I25K, for the evaluation and certification of Data Quality. I25K is composed of a Data Quality Model, an evaluation process, and a certification process. This environment has been defined to be easily implementable and deployable, including details on the necessary resources, the roles that participate in the evaluation and certification processes, and their responsibilities. I25K was validated through several case studies and it has been implemented and deployed alongside two important Spanish companies.. Future lines of research have been drawn as well, including the improvement of Data Quality and the extension of the results..

(10) iv.

(11) Contents Page General Index. xii. List of Figures. xiv. List of Tables. xvii. I. II. Introduction. 1. I.1. Data Quality in real life . . . . . . . . . . . . . . . . . . . .. 2. I.2. Data Quality theoretical and practical knowledge . . . . . .. 4. I.3. Hypothesis and Research goals . . . . . . . . . . . . . . . .. 7. I.3.1. 9. Research goals . . . . . . . . . . . . . . . . . . . .. I.4. Context of the Ph.D. . . . . . . . . . . . . . . . . . . . . . 10. I.5. Document structure . . . . . . . . . . . . . . . . . . . . . . 10. Research method II.1. 13. Action-Research . . . . . . . . . . . . . . . . . . . . . . . . 13 II.1.1. Application of Action-Research . . . . . . . . . . . 19 II.1.1.1. Action-Research Cycle 1. Data Quality Model . . . . . . . . . . . . . . . . . . . 21. II.1.1.2. Action-Research Cycle 2. Quality Evaluation Process . . . . . . . . . . . . . . . . . . 23. II.1.1.3. Action-Research Cycle 3.. Business. Rules representation . . . . . . . . . . . 25 II.1.1.4. Action-Research Cycle 4. Certification Environment and Integration . . . . . . 26 v.

(12) vi II.1.2 II.2. Systematic Literature Review II.2.1. III. Approach . . . . . . . . . . . . . . . . . . . . . . . 28 . . . . . . . . . . . . . . . . 28. Planning of the Systematic Literature Review. . . 30. II.2.1.1. Identification of the need for a review . . 30. II.2.1.2. Specifying the research question(s) . . . 31. II.2.1.3. Development of the Review Protocol . . 32. II.2.1.4. Data Extraction . . . . . . . . . . . . . 36. II.2.1.5. Data Synthesis . . . . . . . . . . . . . . 36. II.2.1.6. Evaluation of the Review Protocol . . . 38. Related Work III.1. 39. Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . 39 III.1.1. Defining Quality . . . . . . . . . . . . . . . . . . . 39. III.1.2. Defining Data Quality . . . . . . . . . . . . . . . . 40. III.1.3. Data Quality Models and Measurement . . . . . . 41 III.1.3.1 Redman’s Data Quality Model: . . . . . 41 III.1.3.2 English’ Data Quality Model: . . . . . . 41 III.1.3.3 MIT traditional Data Quality Model: . . 42 III.1.3.4 MIT recent Data Quality Model: . . . . 44 III.1.3.5 A classification of the Data Quality measurement methods . . . . . . . . . . 46. III.1.4 III.2. Data Quality Assessment and Improvement . . . . 47. Data Quality Standards . . . . . . . . . . . . . . . . . . . . 50 III.2.1. ISO/IEC 25000. SQuaRE . . . . . . . . . . . . . . 51 III.2.1.1 ISO/IEC 25012 . . . . . . . . . . . . . . 52 III.2.1.2 ISO/IEC 25024 . . . . . . . . . . . . . . 55 III.2.1.3 ISO/IEC 25040 . . . . . . . . . . . . . . 60. III.2.2. ISO 8000 . . . . . . . . . . . . . . . . . . . . . . . 63 III.2.2.1 ISO 8000-8 . . . . . . . . . . . . . . . . 63. III.2.3. SBVR . . . . . . . . . . . . . . . . . . . . . . . . 65.

(13) vii III.3. Data Quality Certification . . . . . . . . . . . . . . . . . . 70 III.3.1. Conducting the Review . . . . . . . . . . . . . . . 70 III.3.1.1 Study selection . . . . . . . . . . . . . . 70 III.3.1.2 Data Extraction . . . . . . . . . . . . . 72 III.3.1.3 Data Synthesis . . . . . . . . . . . . . . 78. III.3.2 III.4 IV. Conclusions of the literature review . . . . . . . . 85. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86. Data Quality Model IV.1. Elements of the model IV.1.1. 89 . . . . . . . . . . . . . . . . . . . . 89. Application of the foundations on the Data Quality Model . . . . . . . . . . . . . . . . . . . . 91 IV.1.1.1 ISO/IEC 25012 . . . . . . . . . . . . . . 91 IV.1.1.2 ISO/IEC 25024 . . . . . . . . . . . . . . 93. IV.2. How to use the Data Quality Model . . . . . . . . . . . . . 94. IV.3. Data Quality Characteristics measurement . . . . . . . . . 96. IV.4. IV.3.1. Accuracy . . . . . . . . . . . . . . . . . . . . . . . 98. IV.3.2. Completeness . . . . . . . . . . . . . . . . . . . . 99. IV.3.3. Consistency . . . . . . . . . . . . . . . . . . . . . 100. IV.3.4. Credibility . . . . . . . . . . . . . . . . . . . . . . 101. IV.3.5. Currentness . . . . . . . . . . . . . . . . . . . . . 103. Data Quality Properties measurement . . . . . . . . . . . . 105 IV.4.1. Syntactic Accuracy . . . . . . . . . . . . . . . . . 107. IV.4.2. Semantic Accuracy . . . . . . . . . . . . . . . . . 109. IV.4.3. Accuracy Range . . . . . . . . . . . . . . . . . . . 111. IV.4.4. Record Completeness . . . . . . . . . . . . . . . . 113. IV.4.5. File Completeness . . . . . . . . . . . . . . . . . . 116. IV.4.6. Data Values Completeness . . . . . . . . . . . . . 118. IV.4.7. False Completeness of a File . . . . . . . . . . . . 120. IV.4.8. Referential Integrity . . . . . . . . . . . . . . . . . 123. IV.4.9. Format Consistency . . . . . . . . . . . . . . . . . 125. IV.4.10 Risk of Inconsistency . . . . . . . . . . . . . . . . 127.

(14) viii IV.4.11 Semantic Consistency . . . . . . . . . . . . . . . . 129 IV.4.12 Data Values Credibility . . . . . . . . . . . . . . . 132 IV.4.13 Source Credibility . . . . . . . . . . . . . . . . . . 135 IV.4.14 Update Frequency . . . . . . . . . . . . . . . . . . 137 IV.4.15 Timeliness of Update . . . . . . . . . . . . . . . . 139 V. Data Quality Evaluation Process V.1. V.2. V.3. Establish the evaluation requirements . . . . . . . . . . . . 144 V.1.1. Establish the purpose of the evaluation . . . . . . 145. V.1.2. Obtain the Data Product Quality Requirements . 146. V.1.3. Identify target data . . . . . . . . . . . . . . . . . 148. V.1.4. Define the stringency of the evaluation . . . . . . 149. Specify the evaluation . . . . . . . . . . . . . . . . . . . . . 150 V.2.1. Select the Data Quality measures . . . . . . . . . 152. V.2.2. Define decision criteria for Data Quality measures 152. V.2.3. Establish the decision criteria for the evaluation . 153. Define the evaluation . . . . . . . . . . . . . . . . . . . . . 154 V.3.1. V.4. 143. Plan the evaluation activities . . . . . . . . . . . . 155. Execute the evaluation . . . . . . . . . . . . . . . . . . . . 157 V.4.1. Make the measurements . . . . . . . . . . . . . . . 157. V.4.2. Apply decision criteria for the Data Quality measures . . . . . . . . . . . . . . . . . . . . . . . 159. V.4.3 V.5. VI. Apply decision criteria for evaluation . . . . . . . 160. Conclude the evaluation . . . . . . . . . . . . . . . . . . . . 162 V.5.1. Create the Evaluation Report . . . . . . . . . . . 162. V.5.2. Review the evaluation results . . . . . . . . . . . . 164. V.5.3. Perform disposition of evaluation data . . . . . . . 165. V.5.4. Review quality of the evaluation . . . . . . . . . . 166. I25K. Certification environment. 167. VI.1. Roles and relationships . . . . . . . . . . . . . . . . . . . . 167. VI.2. Certification Process . . . . . . . . . . . . . . . . . . . . . 169 VI.2.1. Certification cycle . . . . . . . . . . . . . . . . . . 170.

(15) ix VII. Validation. 175. VII.1 Case Study 1: Validation of the Data Quality Model . . . . 176 VII.1.1 Case study design . . . . . . . . . . . . . . . . . . 177 VII.1.2 Findings . . . . . . . . . . . . . . . . . . . . . . . 179 VII.1.2.1 Pilot Study 1: Coverage . . . . . . . . . 181 VII.1.2.2 Pilot Study 2: Validity of thresholds . . 182 VII.1.3 Conclusions . . . . . . . . . . . . . . . . . . . . . 184 VII.2 Case Study 2: Integration of the Evaluation Process . . . . 184 VII.2.1 Case study design . . . . . . . . . . . . . . . . . . 185 VII.2.2 Findings . . . . . . . . . . . . . . . . . . . . . . . 189 VII.2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . 190 VII.3 Case Study 3:. I25K, the Data Quality Certification. Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 191 VII.3.1 Case study design . . . . . . . . . . . . . . . . . . 192 VII.3.2 Findings . . . . . . . . . . . . . . . . . . . . . . . 197 VII.3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . 198 VIII Conclusions. 201. VIII.1 Goals achievement . . . . . . . . . . . . . . . . . . . . . . . 202 VIII.2 Main contributions . . . . . . . . . . . . . . . . . . . . . . 207 VIII.2.1 Data Quality Model . . . . . . . . . . . . . . . . . 207 VIII.2.2 Evaluation Process . . . . . . . . . . . . . . . . . 208 VIII.2.3 Certification . . . . . . . . . . . . . . . . . . . . . 209 VIII.2.4 Validation . . . . . . . . . . . . . . . . . . . . . . 209 VIII.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 211 VIII.3.1 Extension of the Data Quality Model . . . . . . . 211 VIII.3.2 Data Quality Improvement . . . . . . . . . . . . . 211 VIII.3.3 Big Data . . . . . . . . . . . . . . . . . . . . . . . 212 VIII.3.4 Master Data Management . . . . . . . . . . . . . 212 VIII.3.5 Business opportunities . . . . . . . . . . . . . . . 212 VIII.4 Research results dissemination . . . . . . . . . . . . . . . . 213.

(16) x A. Concepts and Acronyms A.1. 215. Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 A.1.1. Attribute . . . . . . . . . . . . . . . . . . . . . . . 215. A.1.2. Big Data . . . . . . . . . . . . . . . . . . . . . . . 215. A.1.3. Certification . . . . . . . . . . . . . . . . . . . . . 215. A.1.4. Cloud computing . . . . . . . . . . . . . . . . . . 216. A.1.5. Data . . . . . . . . . . . . . . . . . . . . . . . . . 216. A.1.6. Data Dictionary . . . . . . . . . . . . . . . . . . . 216. A.1.7. Data Item . . . . . . . . . . . . . . . . . . . . . . 216. A.1.8. Data Model . . . . . . . . . . . . . . . . . . . . . 216. A.1.9. Data Quality . . . . . . . . . . . . . . . . . . . . . 217 A.1.9.1. Inherent Data Quality . . . . . . . . . . 217. A.1.9.2. System-Dependent Data Quality . . . . 217. A.1.9.3. Quality in use . . . . . . . . . . . . . . . 218. A.1.10. Data Quality Characteristic . . . . . . . . . . . . 218. A.1.11. Data Quality Model . . . . . . . . . . . . . . . . . 218. A.1.12. Data Quality Measure . . . . . . . . . . . . . . . . 218. A.1.13. Data Quality Management . . . . . . . . . . . . . 218. A.1.14. Data Quality Problem . . . . . . . . . . . . . . . . 218. A.1.15. Data Quality Project . . . . . . . . . . . . . . . . 219. A.1.16. Data Product . . . . . . . . . . . . . . . . . . . . 219. A.1.17. Data type . . . . . . . . . . . . . . . . . . . . . . 219. A.1.18. Data Value . . . . . . . . . . . . . . . . . . . . . . 219. A.1.19. Decision Criteria . . . . . . . . . . . . . . . . . . . 219. A.1.20. Entity . . . . . . . . . . . . . . . . . . . . . . . . 219. A.1.21. Evaluate . . . . . . . . . . . . . . . . . . . . . . . 220. A.1.22. Instance . . . . . . . . . . . . . . . . . . . . . . . 220. A.1.23. Information . . . . . . . . . . . . . . . . . . . . . 220. A.1.24. Internet of things . . . . . . . . . . . . . . . . . . 220. A.1.25. Personal Identifiable Information . . . . . . . . . . 220. A.1.26. Master Data . . . . . . . . . . . . . . . . . . . . . 221.

(17) xi A.1.27. Measure (noun) . . . . . . . . . . . . . . . . . . . 221 A.1.27.1 Base measure . . . . . . . . . . . . . . . 221 A.1.27.2 Derived measure . . . . . . . . . . . . . 221. A.2 B. A.1.28. Measure (verb) . . . . . . . . . . . . . . . . . . . . 221. A.1.29. Measurement . . . . . . . . . . . . . . . . . . . . . 222. A.1.30. Measurement Function . . . . . . . . . . . . . . . 222. A.1.31. Measurement Method . . . . . . . . . . . . . . . . 222. A.1.32. Metadata . . . . . . . . . . . . . . . . . . . . . . . 222. A.1.33. Quality Measure . . . . . . . . . . . . . . . . . . . 222. A.1.34. Quality Measure Element . . . . . . . . . . . . . . 222. A.1.35. Record . . . . . . . . . . . . . . . . . . . . . . . . 222. A.1.36. Requirements . . . . . . . . . . . . . . . . . . . . 223. A.1.37. Scale . . . . . . . . . . . . . . . . . . . . . . . . . 223. A.1.38. Stakeholder . . . . . . . . . . . . . . . . . . . . . 223. A.1.39. Target data . . . . . . . . . . . . . . . . . . . . . 223. A.1.40. Target Entity . . . . . . . . . . . . . . . . . . . . 224. A.1.41. Vocabulary . . . . . . . . . . . . . . . . . . . . . . 224. Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . 224. Guide for the elicitation of Business Rules B.1. 227. Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 B.1.1. B.1.2. B.1.3. Accuracy Questionnaire . . . . . . . . . . . . . . . 227 B.1.1.1. Syntactic Data Accuracy . . . . . . . . . 227. B.1.1.2. Semantic Data Accuracy . . . . . . . . . 227. B.1.1.3. Data Accuracy Range . . . . . . . . . . 228. Completeness Questionnaire . . . . . . . . . . . . 228 B.1.2.1. Record Completeness . . . . . . . . . . . 228. B.1.2.2. Data File Completeness . . . . . . . . . 228. B.1.2.3. Data Values Completeness . . . . . . . . 229. B.1.2.4. False Completeness of a Data File . . . . 229. Consistency Questionnaire . . . . . . . . . . . . . 229 B.1.3.1. Referential Integrity . . . . . . . . . . . 229.

(18) xii. B.1.4. B.1.5. C. D. E. B.1.3.2. Data Format Consistency . . . . . . . . 230. B.1.3.3. Risk of Data Inconsistency . . . . . . . . 230. B.1.3.4. Semantic Consistency . . . . . . . . . . 231. Credibility Questionnaire . . . . . . . . . . . . . . 231 B.1.4.1. Source Credibility . . . . . . . . . . . . 231. B.1.4.2. Data Values Credibility . . . . . . . . . 232. Currentness Questionnaire . . . . . . . . . . . . . 232 B.1.5.1. Update Frequency . . . . . . . . . . . . 232. B.1.5.2. Timeliness of Update . . . . . . . . . . . 233. Sampling. 235. C.1. Basic Considerations . . . . . . . . . . . . . . . . . . . . . 235. C.2. Sampling Procedure . . . . . . . . . . . . . . . . . . . . . . 237. Functions per Profiles. 239. D.1. Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 239. D.2. Examples of usage . . . . . . . . . . . . . . . . . . . . . . . 242 D.2.1. Example 1 . . . . . . . . . . . . . . . . . . . . . . 243. D.2.2. Example 2 . . . . . . . . . . . . . . . . . . . . . . 244. D.2.3. Example 3 . . . . . . . . . . . . . . . . . . . . . . 245. References. 249.

(19) List of Figures. II.1. Action-Research Cycle . . . . . . . . . . . . . . . . . . . . . . 17. II.2. Participants and roles in the application of the Action-Research method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20. II.3. Systematic Literature Review Protocol overview [1] . . . . . . 29. II.4. Selection procedure and analysis of the findings. . . . . . . . . 37. III.1 Comparison of the Data Quality assessment and improvement methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 III.2 Relationship among Quality Models, QM, QME, Property to Quantify, Target Entity from ISO/IEC 25024 [2] - Fig. 3 . . . 58 III.3 Example of DLC [2] - Fig. 4 . . . . . . . . . . . . . . . . . . . 59 III.4 Software product quality evaluation process [3] - Fig. 3 . . . . 61 III.5 Overview of the quality evaluation activities [3] - Fig. 2 . . . . 62 III.6 Overview of the activity model [4] - Fig. D.2.1 . . . . . . . . . 65 III.7 Inputs, outcomes, and resources to measure data and information quality [4] - Fig. D.2.1 . . . . . . . . . . . . . . . . . . . . . . 66 III.8 Kinds of Rules [5] - Fig. 17.1 . . . . . . . . . . . . . . . . . . . 68 III.9 Kinds of Behavioral Rules [5] - Fig. 18.1 . . . . . . . . . . . . 68 III.10 Kinds of Definitional Rules [5] - Fig. 17.1 . . . . . . . . . . . . 69 III.11 Relevant and Primary studies sorted by year . . . . . . . . . . 78 III.12 Summarization of the common steps included in the evaluation methodologies from [6] . . . . . . . . . . . . . . . . . . . . . . 82 III.13 Types of Target Data addressed by the primary studies . . . . 83 III.14 Summarization of involved entities in the primary studies . . . 83 III.15 Common requested inputs of the primary studies . . . . . . . . 84 xiii.

(20) xiv III.16 Validation of the primary studies IV.1. . . . . . . . . . . . . . . . . 85. Metamodel that defines the Data Quality Model elements and structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91. IV.2. Exclusively inherent Data Quality Characteristics from [7] . . . 93. IV.3. How to use the Data Quality Model . . . . . . . . . . . . . . . 95. IV.4. Graphical representation of the quality values for Credibility . 103. IV.5. Graphical representation of the quality values for Currentness . 104. V.1. Activities of the Data Quality Evaluation Process . . . . . . . 144. V.2. Activity 1. Establish the evaluation requirements . . . . . . . . 145. V.3. Activity 2. Specify the evaluation . . . . . . . . . . . . . . . . 151. V.4. Activity 3. Define the evaluation . . . . . . . . . . . . . . . . . 155. V.5. Activity 4. Execute the evaluation . . . . . . . . . . . . . . . . 158. V.6. Activity 5. Conclude the evaluation . . . . . . . . . . . . . . . 163. VI.1. Roles of the certification process . . . . . . . . . . . . . . . . . 168. VI.2. Model for Governance and Management of IT of AENOR . . . 169. VI.3. Certification cycle . . . . . . . . . . . . . . . . . . . . . . . . . 171. VII.1 Data Quality levels of target data . . . . . . . . . . . . . . . . 187 VII.2 Data Quality levels of target data after the improvement . . . 188 VII.3 Data Quality levels of target data . . . . . . . . . . . . . . . . 194 VII.4 Data Quality levels of target data after the improvement . . . 196 VII.5 ISO/IEC 25012 compliant Data Quality Certificate of BMS . . 197.

(21) List of Tables. II.1. Research Questions and Motivations . . . . . . . . . . . . . . 32. II.2. Search terms and synonyms . . . . . . . . . . . . . . . . . . . 33. II.3. Search Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . 33. II.4. Information Sources . . . . . . . . . . . . . . . . . . . . . . . 34. II.5. Data extraction form . . . . . . . . . . . . . . . . . . . . . . . 38. III.1. Data Quality Model for Thomas Redman in [8] . . . . . . . . 41. III.2. Data Quality Model for Larry English in [9] . . . . . . . . . . 43. III.3. MIT traditional Data Quality Model [10, 11] . . . . . . . . . . 44. III.4. MIT recent Data Quality model [12] . . . . . . . . . . . . . . 45. III.5. Data Quality measurement methods classification [13] . . . . . 47. III.6. Methodologies for Data Quality assessment and improvement. III.7. Data Quality Characteristics defined in [7] . . . . . . . . . . . 54. III.8. Target Entities and related Instances to quantify in each stage. 48. of the DLC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 III.9. Search results sorted by information source . . . . . . . . . . . 71. III.10 Data extraction from the primary study [14] . . . . . . . . . . 73 III.11 Data extraction from the primary study [15] . . . . . . . . . . 74 III.12 Data extraction from the primary study [16] . . . . . . . . . . 75 III.13 Data extraction from the primary study [17] . . . . . . . . . . 76 III.14 Data extraction from the primary study [18] . . . . . . . . . . 77 III.15 Data Quality Characteristics classification by [6] . . . . . . . . 79 III.16 Data Quality Characteristics classification . . . . . . . . . . . 80 III.17 Data Quality Characteristics classification . . . . . . . . . . . 81. xv.

(22) xvi IV.1. Summary of the Data Quality Model from [7] . . . . . . . . . 92. IV.2. Quality levels for the classification of the Data Quality Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97. IV.3. Example of ranges for the measurement for the Data Quality Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 97. IV.4. Data Quality Properties for the measurement of Accuracy . . 99. IV.5. Data Quality Properties for the measurement of Completeness 100. IV.6. Data Quality Properties for the measurement of Consistency . 100. IV.7. Data Quality Properties for the measurement of Credibility . . 101. IV.8. Example of the function for the measurement of Credibility . . 102. IV.9. Thresholds of the intervals for the measurement of Credibility 102. IV.10 Data Quality Properties for the measurement of Currentness . 103 IV.11 Example of function for the measurement of Currentness . . . 104 IV.12 Thresholds of the intervals for the measurement of Currentness104 IV.13 Example of Quality levels for the classification of the data files into the Data Quality Properties . . . . . . . . . . . . . . . . 105 IV.14 Example of the ranges for the measurement of the Data Quality Properties . . . . . . . . . . . . . . . . . . . . . . . . 106 IV.15 Equation for the measurement of Syntactic Accuracy . . . . . 108 IV.16 Equation for the measurement of Semantic Accuracy . . . . . 111 IV.17 Equation for the measurement of Accuracy Range . . . . . . . 113 IV.18 Equation for the measurement of Record Completeness . . . . 115 IV.19 Equation for the measurement of File Completeness . . . . . . 117 IV.20 Equation to calculate the percentage of expected values for each file (PEV) . . . . . . . . . . . . . . . . . . . . . . . . . . 119 IV.21 Equation for the measurement of Data Values Completeness . 120 IV.22 Equation for the measurement of False Completeness of a File 123 IV.23 Equation for the measurement of Referential Integrity . . . . . 125 IV.24 Equation for the measurement of Format Consistency . . . . . 127 IV.25 Equation for the measurement of Risk of Inconsistency . . . . 129 IV.26 Equation for the measurement of Semantic Consistency . . . . 131 IV.27 Equation for the measurement of Data Values Credibility . . . 134.

(23) xvii IV.28 Equation for the measurement of Source Credibility . . . . . . 137 IV.29 Equation for the measurement of Update Frequency . . . . . . 139 IV.30 Equation for the measurement of Timeliness of Update . . . . 141 VII.1 Characteristics of the data source selected for both pilot studies177 VII.2 Spearman’s Correlation test . . . . . . . . . . . . . . . . . . . 183 VII.3 Characteristics of the data source selected for the case study . 185 VII.4 Characteristics of the data source selected for the case study . 192 VIII.1 Comparison between the contribution of the literature and I25K207 C.1. Binomial Distribution of n experiments with p success probability236. C.2. Probability of r success results in n trials . . . . . . . . . . . . 237. D.1. Example of a Function per Profiles . . . . . . . . . . . . . . . 241. D.2. Equation for calculating the quality values associated with each range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241. D.3. Record Completeness levels . . . . . . . . . . . . . . . . . . . 246. D.4. Definition of the ranges for the assessment of the Record Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.

(24)

(25) CHAPTER. I. Introduction Media and organizations have been telling us that we live in the Information Era for a long time so far. Nevertheless, the last few years have made this statement real and even obvious, to the point that it is safe to state that we are immersed in the Data World where almost everything around us is sending, receiving or processing data. Our lives are influenced by multifarious devices that generate data about ourselves, our habits, about what we consume, what we like, or what we do not. This scenario highlights the importance of data. Companies have realized about this fact and have made data one of their most important assets [19, 20], using it to create value and obtain benefits from it.. In fact, if we examine thoroughly the most successful companies of the industry, for well-nigh all of them, data is their main asset —Google INC., Facebook, Amazon, etc.. For instance, Google, alongside his brand-new. product —the Google Pixel mobile phone—, provides a software that uses the data of the user to create a personal service to make decisions based on the data of the owner (e.g., book the cheapest tickets for a trip based on the desire of the user to travel to a certain place) [21]. Facebook continues making emphasis on the “topic data”, aiming for ways to learn from the users, to allow other companies to provide the most suitable products and services, based on the analysis of the preferences of those users [22]. In general, all the industry is moving to a market scenario, where every service is guided by the data generated by the users and clients. All the data generated by those users is ingested by the systems of the organizations that analyze it in order to make 1.

(26) 2. Chapter I. Introduction. decisions, provide suggestions, create new services for the users, etc. In this sense, data has become a decisive raw material to strategically improve the capabilities of organizations in gaining the best advantage in their business domains [23].. These factors are the starting point of this Ph.D. thesis. As any other raw material for creating goods and services, data must be good enough for those companies to obtain benefits. When making choices based on data, it is of prime importance that this raw material has adequate levels of quality. Otherwise, created goods and services using the data might be useless or not appropriate for the intended purposes, and data-based decisions might be worthless or even harmful for the companies [23].. This chapter introduces the main aspects about Data and its Quality: section I.1 describes the actual use of the Data Quality principles by the industry; section I.2 provides a global view of the theoretical knowledge about Data Quality; section I.3 defines the starting hypothesis and outlines the goals of the thesis; section I.4 contextualizes the thesis in the research projects that have funded the realization of the doctorate; section I.5 explains the structure of the rest of the document.. I.1. Data Quality in real life. According to Gartner [20], the majority of enterprises and organizations of all kind share part of their data as a way to create and receive interest, like they do with any other product or service. Even at a lower scale, the users share their data to obtain information about their surroundings (e.g., sharing your location to obtain information about near-placed restaurants). Douglas Laney, Vice-President at Gartner, goes a step further to talk about achieving direct monetization (i.e., value) of data [24]. “Monetization of data implies creating a team with a specific job of defining, developing and productizing the market for the information, similar to the product development life cycle.

(27) Chapter I. Introduction. 3. established for managing and marketing traditional products” Gartner stated. “The underlying message is that information is an asset in its own right. It has value” Douglas Laney asserted. Gartner calls this brand-new discipline of valuating information “Infonomics”, and “it is not something of the far future, in fact, this is happening today in various industries, in commerce and public sector, in large and small enterprises.”. [20].. The emergence. of a new data-related role in many organizations —the Chief Data Officers (CDOs)— indicates a growing recognition of information as a strategic business asset. Gartner predicts that “by 2020, 10% of organizations will have a highly profitable business unit specifically for productizing and commercializing their information assets” [24, 25].. Like any other asset, some concerns come with data alongside the creation of business opportunities. First of all, user’s privacy must be addressed by any company willing to use this type of data to create new goods or services. This concern does not only affect the privacy of the user data, but also the privacy of the data managed by companies [26]. In fact, along with the privacy, many other concerns are related to data in most of the companies: cost-effectiveness, maximizing value, creating business opportunities, managing the quality of the data, etc. [27, 28]. These are not easy issues to address, and therefore, companies —usually their CDOs— must create a data strategy that mitigate the risks related to those concerns and maximize the business revenue [24, 29]. The most difficult challenge is creating relevant metrics that quantifies the activities of Data Governance and Data Management and tie them to key business drivers [30].. The main enablers for the CDOs to approach these issues are the Data Governance foundations, specifically including Data Management and Data Quality activities [30, 31, 32].. Data Governance activities will define a. common way to work with data, defining data strategies and managing all the data-related resources and tasks. Data Governance activities must be agreed by the top managers and institutionalized within the company to reach.

(28) 4. Chapter I. Introduction. success in their data strategies. Among these activities, those related to Data Quality are usually identified as the most important ones [29, 30]. Friedman found various common Data Quality-related issues in most of the nowadays companies, in [33]: • Poor Data Quality is a primary reason for 40% of all business initiatives failing to achieve their targeted benefits. • Data Quality affects overall labor productivity by as much as a 20%. • As more business processes become automated, Data Quality becomes the rate limiting factor for overall process quality. Friedman asserts, as the outcome of his analysis, that in order to tackle these issues, business leaders and CDOs must focus on the assessment and improvement of Data Quality as the main part of their Data Governance plan. “Better data will lead to better use of predictive analytics for sales organizations” as reported by Gartner [29]. For this purpose, both academics and practitioners have proposed many solutions for a long time and compose the theoretical and practical foundations of Data Quality.. I.2. Data Quality theoretical and practical knowledge. Data Quality has been a major issue of any data-related activity for decades. In fact, the first definition dates back to the late seventies when Crosby defined Data Quality as “meeting requirements”[34]. Other definitions have appeared ever since, but the most common one is “fitness for use”, by Richard Wang in 1997, which means that “data has quality if the data is useful for the purpose that it is meant to be used” [10]. From those days to the present moment, there have been several solutions provided by both academics and practitioners. The solutions that have caused the greatest impact are gathered together and analyzed in [6].. Data Quality Management is a commonly used term that includes all those solutions. Generally speaking, Data Quality Management is focused.

(29) Chapter I. Introduction. 5. on the assessment of datasets and the application of corrective actions (i.e., improvement) to data to ensure that the datasets fit for the purposes for which they were originally intended [35]. Unfortunately, none of those solutions is actually meant to assess Data Quality. The solutions quantify each Data Quality characteristic based on a different set of metrics (i.e., measuring), but they provide neither a way to aggregate those measurement results nor a way to interpret the results. In other terms, they just provide indicators, and therefore, they cannot solidly affirm how sound data is for the intended purposes (i.e., assessing). In this scenario, organizations may need to analyze and interpret those indicators, using different criteria [6].. An additional. problem is that each one of these solutions is based on a particular Data Quality Model —set of characteristics or dimensions—, which may lead to not being able to compare the results with competitors [36]. Here, it is possible to find a need for a common framework for every solution [37, 38].. The most referenced Data Quality models in literature are [9, 10, 8, 12]. These Data Quality models have been a reference for decades, however, the industry has established two international standards that define Data Quality models: ISO/IEC 25012 [7] and ISO/TS 8000-8 [4]. These two new Data Quality models gather the expertise of the well-known solutions, and aim to meet some needs that organizations usually have to cope when dealing with data (e.g., acquisition, re-usability, provenance, consistency, security, integration, etc.) [39, 35]. Taking into account the aforementioned need for a common framework, these standards can be a good starting point to create a sound Data Quality Management for any kind of organization.. Additionally, several companies of the industry have created tools to support different Data Quality-related activities. Among them, Data Profiling and Data Cleansing tools stand out from the rest. The main goal of Data Profiling tools is to discover and observe constraints and characteristics present in data, whereas the main goal of Data Cleansing tools is to correct known defects on data. The problem with these tools is that organizations.

(30) 6. Chapter I. Introduction. acquire and indiscriminately use them as if they covered all the Data Quality Management activities. Furthermore, these tools usually tackle common Data Quality-related issues, but not business-specific problems. It is of the utmost importance to understand that the management of Data Quality includes several processes and activities, and these tools can only be used to support some of them. To the best of our knowledge, there is no tool capable of covering all the Data Quality Management activities. Apart from the Data Profiling and Data Cleansing tools, there exist tools to support: data requirements and business rules definition, Data Quality activities and tasks planning, resources management, etc. [40]. Recently, new ways of manipulating data have appeared.. Internet of. Things, Cloud Computing, and specially, Big Data, have made its entrance in the industry with a lot of expectation. New Data Quality challenges have progressed with the growth of this new paradigms [19]. Consequently, the industry is reacting to provide solutions to face those brand-new Data Quality challenges from two knowledge areas: enhanced Data Quality Management tools and Data Curation/Preparation. The first approach covers the same Data Quality-related activities as the aforementioned tools, but in this case, improving the efficiency of the tools and adapting them to the new paradigms; the second approach consists in cleaning and pre-processing data to obtain what it is usually called “quality-data” [41, 42]. The first approach comes with the same kind of problems than the original tools to support Data Quality Management activities, whereas the “quality-data” approach comes with a bigger issue. Authors addressing the “quality-data” term define it as data having “zero defects”. For these authors, a defect is some value that is not expected or desired [43, 44] —this is also the meaning of an outlier. In this sense, authors can simply erase or clean the “noise”. The cleansing activity usually includes tasks such as removing inconsistencies, data normalization, data transformation, and missing values imputation, among others [45, 46]. Data preparation also includes data integration and annotation, tasks that may actually introduce new inconsistencies [45, 42] if it is not used within the.

(31) Chapter I. Introduction. 7. context of the business. The authors of these solutions mostly come from the data mining field, where these assumptions are perfectly correct. Nevertheless, it is necessary to include not only the technical point of view, but also the business perspective. Thus, the quality of the data that is used in these new paradigms must be understood as an indicator of how sound the data is for their purposes, in other words, of how valuable the data is. According to Loshin in [23], classical Data Quality solutions created for regular data are not fully sound for environments where Big Data assumptions can be applied. Soares in [19], identifies the technological and managerial challenges that these new paradigms bring when facing Data Quality problems, and compares them with the classical way to deal with them, and claims that any Data Quality Management solution must address those new challenges.. I.3. Hypothesis and Research goals. After a thorough research, a conclusion was reached: it is essential to create a common and widely used Data Quality Management solution that really includes the assessment of Data Quality levels and that allows the companies to compare themselves with the best of the industry. From our perspective, it is vital to keep the assessment of Data Quality independent of the organizations — or at least independent of the parts of the organizations within the scope of the assessment — in order to prevent possible biases.. In this sense, we deem certification as a good approach to reach that independence. Literally, certification can be understood as the confirmation of certain characteristics of an object, person or organization, which is often provided by some type of external review, assessment, or audit. An independent and unbiased certification of the Data Quality levels would allow to objectively conclude whether data used by organizations is adequate for their purposes, and consequently, certification would help to determine the actual value of the data. Hence, the capability of being able to set a basis for determining the value of data is the most crucial matter..

(32) 8. Chapter I. Introduction. Summarizing, the starting hypothesis is contemplated below: It is possible to have a Data Quality Certification Environment that allows organizations to ascertain the quality levels of their data in order to be aware of its actual value.. In accordance to this hypothesis, the main objective of this research is to develop a Data Quality Certification Environment to certify the quality levels of data products. The certification of the Data Quality levels must be based on an independent and impartial assessment —remember: not only measurement, but also assessment—, ideally conducted by an external evaluator or an auditing authority.. The assessment must be based on a. common model in order to obtain comparable and fixed results.. Some Data Quality Management solutions have been mentioned before in this thesis, but none of them fulfill these requirements, mainly because those solutions are solely capable of measuring and they perform this task in accordance with their particular Data Quality Model and set of metrics. We believe that international standards are a good starting point since they are developed by experts aggregating the knowledge of the field of Data Quality. Additionally, and fundamentally, international standards are usually defined in a holistic manner to cover every possible business domain. Unfortunately, ISO/TS 8000-8 only provides theoretical metrics and ISO/IEC 25012 is just a Data Quality Model. ISO/IEC 25024 [2] is not sufficient either, since it just provides basic metrics for the Data Quality Model of ISO/IEC 25012. As a consequence, the proposal of this Ph.D. thesis, called I25K, is a Data Quality Certification Environment based on international standards conceived to fulfill these requirements. The development of I25K was planned to make it adaptable for the challenges of the new paradigms [19]. The foundations are explained below: • ISO/IEC 25040 [3], as a basis for creating a Data Quality Assessment Process..

(33) Chapter I. Introduction. 9. • ISO/IEC 25012 [7], as basis for selecting the characteristics that take part of the Data Quality Model. • ISO/IEC 25024 [2], as basis for creating the measures to quantify the Data Quality characteristics. • [23, 19], as basis for facing the new paradigms challenges — including IoT, Cloud computing and Big Data. It is important to highlight that I25K was implemented as a service of AQC-Lab [47], a quality evaluation laboratory. The entire proposal has been accredited by ENAC [48] —Entidad Nacional de Acreditación, in English, Spanish National Entity for Certificacion— as the first worldwide environment capable of providing evaluations of quality of Data Products based on ISO/IEC international standards. Furthermore, the I25K environment has been used by AQCLab to provide several Data Quality evaluations, which has led to the first certification of the quality of a data product provided by AENOR [49] —Asociación Española de Normalización y Certificación, in English, the Spanish Association for Standardization and Certification.. I.3.1. Research goals. In order to accomplish the main objective of this research, the subsequent partial goals were considered: 1. Analyze the existing work on assessment and certification of Data Quality, identifying quality models, evaluation and certification processes and supporting tools. 2. Define a Data Product Quality Model with the characteristics, indicators, metrics and aggregating functions that can be used to measure and evaluate the quality of the data products. 3. Delineate an Evaluation Process that incorporates all the involved the activities, inputs and outcomes, roles and responsibilities, and resources in the evaluation of Data Quality..

(34) 10. Chapter I. Introduction 4. Incorporate the Data Quality Model, the Evaluation Process and technology to support evaluations of the quality of the data products into a quality evaluation laboratory. 5. Outline a representation model and an elicitation and inference procedure of the business rules that affect data and its quality (ad-hoc goal). 6. Develop a Data Certification Process including the roles, responsibilities, and tasks to be performed for certification of data in terms of its quality levels. 7. Validate the environment for the evaluation and certification of the quality of data products through several case studies and real applications in the industry. 8. Establish a relationship with certifying authorities interested in the evaluation of the Data Quality that are willing to take part in the process to produce Data Quality certificates.. I.4. Context of the Ph.D.. This work has been funded by the PEGASO-MAGO project (Ministerio de Ciencia y Tecnología de España and Fondo Europeo de Desarrollo Regional FEDER, TIN2009-13718-C02-01), the GEODAS-BC project (Ministerio de Economía y Competitividad and Fondo Europeo de Desarrollo Regional FEDER, TIN2012-37493-C03-01), the SEQUOIA project (Ministerio de Economía y Competitividad and Fondo Europeo de Desarrollo Regional FEDER, TIN2015-63502- C3-1- R) and the SERENIDAD project (Consejería de Educación, Ciencia y Cultura de la Junta de Comunidades de Castilla La Mancha, y Fondo Europeo de Desarrollo Regional FEDER, PEII-2014-045-P).. I.5. Document structure. The rest of the document is organized as follows:.

(35) Chapter I. Introduction. 11. • II. Research method: This chapter explains the research method, the planning of the research and the technologies used during this research. • III. Related Work: This chapter describes the theoretical knowledge from the bibliography used as foundations of this research. • VI. I25K. Certification environment: This chapter pictures I25K, the environment for the evaluation and certification of Data Products Quality. • VII. Validation: This chapter reproduces the cases studies conducted to validate I25K. • VIII. Conclusions: This chapter outlines a set of conclusions derived from this research and traces future research lines. • Appendix A. Concepts and Acronyms: This appendix defines the main concepts and acronyms used in this thesis. • Appendix D. Functions per Profiles: This appendix explains the Functions per Profiles and provides examples of usages of these functions. • Appendix E. References: This appendix shows the list of references reviewed during this research..

(36) 12. Chapter I. Introduction.

(37) CHAPTER. II. Research method This chapters presents a summary of the research methods used for this thesis, containing Action-Research and the protocol for Systematic Reviews of the Literatures. Both methods were selected because of their success, not only among the academia, but also among practitioners from the industry.. Additionally, this chapter presents the special characteristics that must be considered when using Action-Research for R & D in information systems and explains how Action-Research has been applied in this thesis, including the participants who have participated in the activities and the research cycles. The application of the protocol for Systematic Literature Reviews can be shown in chapter III.. II.1. Action-Research. Action-Research [50] is not only a specific research method, but a set of methods of the same nature that share the following properties: 1. An “organic” process model which involves systematic and iterative phases, 2. Action and change oriented, 3. Focused on a problem, and 4. Involves participants for good collaboration. 13.

(38) 14. Chapter II. Research method Since it is not a specific method, there are many definitions of Action-Research.. Some of the most important ones are provided below: • According to [51]: “Action-Research is the manner in which the required conditions are to be met, to learn from our own experiences and make them accessible to others.” • French et al. in [52] defines it as: “the process of collecting research data by means of systematic mechanisms. The data collected refers to a current system related to an objective or system requirement; feeding the system with that data; undertaking actions by means of alternative variables selected from the system, based on the data and the hypotheses; and evaluating the results of the actions by collecting additional data.” • From the point of view of [53]:. “Action-Research consists in the. participation of all research members in studying the current problematic scenario, in an effort to improve or change it.” Two main aims of the Action-Research method can be deduced from these definitions: • To generate value in the form of benefit for the research “client” and, • To generate or increase the “research knowledge” [54]. In consequence, it is possible to assert that Action-Research is a collaborative research to establish a link between research and practice by means of a cyclical process. Action-Research focuses on creating new useful knowledge. This new knowledge is gained by introducing changes and by researching into candidate solutions to different real scenarios that are relevant to a group in practice [55]. This is achieved thanks to the intervention of a researcher in the real circumstances surrounding the research “client”. The results of these experiences must be beneficial to both the researcher and the participants.. Regarding to information systems, the “client” of an investigation is usually an organization to which the researcher provides services such as consulting,.

(39) Chapter II. Research method. 15. help to change or development of solutions. In exchange the researcher has access to data of interest for research and, in many cases receives funding [55].. Notwithstanding, this research method has an important concern —especially for the researcher— that must be addressed: the researcher using Action-Research must meet the requirements of the research “client” and the scientific community, whose needs are usually very different —sometimes, even opposed and conflicting. Attempting to meet every requirement is the main challenge for the researcher. The results of the investigation are more desirable and useful if the research comply with mentioned requirements. In a formal analysis of the participants in the Action-Research method, [53] identifies the subsequent four types of roles of this method: • Researcher: individual or group of people in charge of the investigation actively participating. • Object of investigation: in other words, the problem to be solved. • Judgement Reference Group: the “client” of the investigation, in the sense of having a problem to be solved. This group also participates in the research process —less actively than the researcher.. It may. include people participating in the research as well as people who are unknowingly involved —such as patients undergoing a placebo treatment. • Stakeholders: the “client” of the investigation, but in this case, in the sense that this group benefits from the result of the investigation. Stakeholders do not participate directly in the process —they can be the receiver of documents, reports, etc. This group may be compound of companies that benefit from a new environment for evaluating and certifying the quality of data and information, or technicians in charge of applying the new evaluation and certification environment. An Action-Research process is composed of a set of activities organized to form a cycle. [56] identifies the following steps, which must be followed within investigations using this method:.

(40) 16. Chapter II. Research method • Planning: Identification of the relevant issues which will guide the investigation.. The identified issues must be directly related to the. investigated object, and an answer must be found for every single one of them. This activity seeks alternative paths, research lines to follow, or reinforcement of existing solutions. The result should clearly define problems or situations. Some authors [57] distinguish between diagnosis —identifying initial problems— and planning —specifying actions to solve such problems. • Action: Careful, deliberated, and controlled variation of practice. The solution is simulated or tested. In this activity, the researcher intervenes more actively. • Observation: Collection of information, data retrieval, and documentation of the events.. The information may come from any site (e.g.,. bibliography, measurements, test results, observations, interviews, documents, etc.). It is also known as evaluation of the knowledge. • Reflection: Sharing and analyzing the results with stakeholders. This activity may raise new relevant issues and it may also help to delve into the subject under investigation to provide new knowledge that can improve practices. The modification of these new issues must be part of the research process itself, and then re-investigate these practices once modified [53]. It is also known as learning specification and, in some variants of the Action-Research method, is not only a stage, but a continuous process that occurs continuously. With these characteristics, an Action-Research process is iterative. The progress is reflected by increasingly refined solutions through the completion of cycles. Within each cycle, new ideas are put into practice and checked in the next cycle (see Figure II.1). This cycle characterizes the Action-Research method as a reflexive learning process and search for solutions. The cyclical character means re-evaluating or rethinking the followed actions or paths..

(41) Chapter II. Research method. 17. Figure II.1: Action-Research Cycle. Action-Research is recognized as one of the most powerful —qualitative— research methods in the field of information systems [50].. However, the. community of specialists has detected several problems in the application of this research method because of three fundamental causes: • Lack of methodology in the information systems and Engineering field. • The lack of a defined research process model which indicates the steps to follow for Action-Research in the information systems and Engineering field. • The consulting framework imposes an over-restrictive perspective, since it implies contractual liabilities and organizational interests that could be detrimental to the research. These causes may result in a lack of rigor in the research process. In addition, [58] shows that researchers of this area should be more rigorous when defining, applying, and reporting Action-Research studies in their field..

(42) 18. Chapter II. Research method. In the context of qualitative research in information systems, two realities are considered: scientific/academic and practical. Both realities interact but also move in different planes. Action-Research operates on this dual reality through two types of cycles for two types of projects: • Cycles oriented to solve problems within projects of information systems. These projects consist in the development of a computer solution (e.g., computer projects, software development, implementation, maintenance of computer systems, etc.). In this case, the researcher is in charge of solving a problem and Research-Action appears as an additional tool for the development of information systems. • Research-oriented cycles within research projects. These projects are intentional efforts seeking a result. In this case Action-Research offers a method of work and a justification to approach a certain reality for the purpose of testing a theory or a hypothesis. An outline of the use of Action-Research in information systems is provided in [59], together with several examples published by different authors regarding the analysis, design, and development of information systems, and particularly on software implementation and related processes. An introduction to the use of Action-Research in information systems is provided by [50], indicating ten Action-Research forms and four characteristics, which determine the way in which Action-Research is used. These are as follows: • Process Model (iterative, reflective, linear) • Structure (rigorous, fluid) • Typical involvement (collaborative, facilitative, expert) • Primary goals (organizational development, system design, scientific knowledge, training) Seven basic strategies for achieving Action-Research in information systems are listed in [57]: 1. Using the “change paradigm”.

(43) Chapter II. Research method. 19. 2. Establishing an agreement or formal research contract 3. Providing a theoretical framework 4. Planning data-collecting methods 5. Maintaining collaboration and mutual learning between the researcher and the judgement reference group 6. Providing incentives for the performance of the typical cycle interactions 7. Looking for the generalization of solutions The following subsection explains the way these concepts have been applied to this research.. II.1.1. Application of Action-Research. The development of this thesis has been contextualized within various R&D projects, in collaboration with the University of Castilla-La Mancha and other organizations interested in the research. Given this conjunction of relationships and the goals of the thesis (see section I.3), the method Action-Research was applied.. The first step to apply the Action-Research method was to define the roles that participated in the research. Figure II.2 identifies the participants of this research and the relationships between them.. • Researcher: The Ph.D. Candidate and author of this thesis. The supervisors also participated in the revision of the research development. • Object of investigation: Aligned with the goals of this thesis (see section I.3), in particular, the certification of the quality of data products. • Judgement Reference Group (JRG): This role can also represent the “client” of the research, and its members are the laboratory Alarcos Quality Center —from this point forward, AQCLab—, and the Spanish.

(44) 20. Chapter II. Research method. Figure II.2: Participants and roles in the application of the Action-Research method. Association for Normalization and Certification —from this point on, AENOR. AQCLab is a laboratory accredited by ENAC (Spanish Entity for Accreditation) to provide evaluations of the quality of software products, and it was interested in the creation of a new service for its customers that consist in the evaluation of the quality of data products. AENOR created a certificate for the evaluation of the quality of software products, and it was interested in the creation of a new product for its customers that consist of a certification of the quality of the data products. Moreover, this research was conducted within the scope of two national research projects —SEQUOIA and PEGASO-MAGO— and two regional research projects —SERENIDAD and GLOBALIA— (see section I.4). • Stakeholders: This group did not participate directly in the process, but it will benefit from the results of this research. Any organization that.

(45) Chapter II. Research method. 21. manages data as its main asset, as a product, or as a raw material to be used into its business processes, and it is interested in the evaluation and/or the certification of the quality of its data.. Up to the date. of publication of this thesis, AENOR has already provided the first worldwide certification of the quality of data to one of the most important business and marketing schools of Spain. Furthermore, several public and private organizations are interested —and have started the process— into the evaluation and certification of their data products, not only from Spain, but also from Italy. The suggested research questions were derived from the research goals (see section I.3): • R.Q.1. Which are the characteristics and the measurement methods and functions that can be included into the Data Quality Model and how are they related and can be compared? • R.Q.2. Which are the activities to be conducted during the Data Quality evaluation, and what are the roles and responsibilities involved in the process? • R.Q.3. What are the elements of a certification environment and how are they related, covering the roles and responsibilities involved in the certification? For the application of the Action-Research method, three main research cycles were proposed, and as a consequence of the analysis of the results of the second research cycle in the reflection phase, an additional cycle was necessary to refine the results of the research. II.1.1.1. Action-Research Cycle 1. Data Quality Model. The first cycle consisted of an initial contact with the Data Quality problematic. The goal was to define a set of characteristics in order to compose a Data Quality Model. Those characteristics must be quantified, and for that purpose, measurement methods and functions were developed as well. An.

(46) 22. Chapter II. Research method. important aspect of these measurement methods is that the results of the quantification can be compared between different data products.. During the planning phase, it was important to identify the common Data Quality issues of the Stakeholders when managing their data products in a production environment. Therefore, part of the investigation consisted in analyzing those issues and classifying them using existing Data Quality models from the literature. These models —that were obtained from the literature review depicted in section II.2— contain common characteristics referring to Data Quality, and have been applied to different use cases. Unfortunately, lots of them do not define specific ways to quantify the characteristics included, leaving the decision to each use case. For that reason, the results of the measurement obtained in different use cases are not comparable. Another issue detected on the existing Data Quality models is that they usually address the quality of data models, metadata, data dictionaries, etc. From the researcher and JRG points of view, the majority of organizations do not update the data models or data dictionaries, neither metadata nor its use are correctly defined, etc. Thus, the challenge was to define a Data Quality model without these flaws.. In the action phase, the Data Quality model was established, incorporating the characteristics and the measurement methods and functions to quantify those characteristics.. The literature on Data Quality was reviewed and. analyzed in order to develop a sound Data Quality model.. Only the. characteristics and measurement methods and functions that solely address data itself —not data models, metadata, data dictionaries, etc.— were included.. A circumstance that must be underlined is the adoption of. international standards as the main foundations of the research, because of the feedback and advise of the JRG.. Throughout the observation phase, two pilot case studies in a small relational databases were conducted. For further detail on these pilot case.

(47) Chapter II. Research method. 23. studies see section VII.1. These pilot case studies were helpful to identify deficiencies of the model that were analyzed and overcome.. One of the. main problems was to implement the measurement methods into the specific technology of the case study data product: no tool was available to quickly implement the proposed measurement methods of the Data Quality model. Furthermore, the JRG advised that in any certification environment, the tools must be validated by the industry or by a reference entity. Consequently, the decision was to use the SQL standard —applying the specific SQL depending on the target data base management system— to implement the measurement methods for relational data products. It is obvious that for NoSQL models, or even for Big Data scenarios, it will be necessary to implement the measurement methods again and deal with these issues. Fortunately, the definition of the measurement methods is generic for any type of system and, thence, easily applicable.. The reflection phase consisted in the correction of the mentioned flaws detected in the course of the previous phase. Several measurement methods were included and/or modified and the measurement functions were adjusted. The final model is presented in chapter IV of this thesis.. II.1.1.2. Action-Research Cycle 2. Quality Evaluation Process. During the second cycle, the interaction with the JRG was important. The JRG provided the expertise on product quality evaluation and certification, including the common requirements and needs of these processes. The goal to achieve during this cycle was to define a set of activities that systematically evaluate the quality of data products in order to build a thorough, unbiased, and repeatable Data Quality evaluation process.. During the planning phase, the findings obtained from the literature review planned in section II.2 and detailed in chapter III, were analyzed to define the evaluation process. In particular, only the findings related to evaluation.

(48) 24. Chapter II. Research method. methodologies were analyzed to fix the activities that would be incorporated into the Data Quality evaluation process.. In the action phase, the main results lied in fixing the activities and related tasks, the resources needed for each activity, the inputs and the intended outputs of each activity, and the roles that participate in the evaluation process and their responsibilities. Again, the decision of adopting international standards as the main foundation of the research must be accentuated: in this case, the product quality evaluation model from ISO/IEC 25040 [3] was the main basis of the Data Quality evaluation process.. Throughout the observation phase, a case study in a public entity related to the university —stakeholder— was conducted. In this case study, both the Data Quality model and the Data Quality evaluation process were applied to evaluate the quality of a data product in a production environment. After conducting the case study, the results were provided to the stakeholder. In this particular case, the results of the evaluation assisted to find several problems in some of the business processes of the stakeholder that were the result of data deficiencies. The stakeholder applied corrective actions to its data product and the evaluation was repeated. This time, the evaluation results improved and the problems were submitted as solved and controlled by the stakeholder. Once more, this case study helped to identify some issues to be improved in the evaluation process. The main issue was that the stakeholder had no documentation of its business rules, especially of the business rules that affects the target data product and its quality. This lack of documentation was solved through meetings and surveys with the data owners and data experts from the public entity. For further detail on this case study see section VII.2.. The reflection phase consisted in institutionalizing the Data Quality evaluation process and the Data Quality model in one of the organizations of the JRG —AQCLab— integrating both of them as a new business line and service. A new research cycle was identified as necessary owing to the identified.

(49) Chapter II. Research method. 25. flaws related to the representation and recovery of the unknown business rules, since the correction of this issue was too sensitive to be handled in a single reflection phase.. II.1.1.3. Action-Research Cycle 3. Business Rules representation. As aforementioned, the third cycle was created to respond the needs detected during the observation phase of the previous cycle. The goal was to fix a model and a procedure for the representation and inference of the business rules of an organization that defines constraints over the target data and its quality. Additionally, a tool was developed to semi-automatize the inference of the business rules.. During the planning phase, the literature was reviewed and analyzed to find ways to comprehensively represent the business rules. The literature review planned in section II.2 and detailed in chapter III revealed some patterns and guidelines, but it was necessary to specifically search for different solutions: models, methodologies, standards, etc.. In the action phase, the international standard SBVR [5] was selected as a representation model for the business rules for the purpose of keeping the consistency of the use of international standards as main foundations. Beyond that decision, SBVR was selected because it provides a model with a semi-formal representation of the business rules understandable, and mainly directed to business people, which can be automatized.. Additionally, a. procedure for the inference of the business rules was defined. To support the inference and create a repository of business rules, a tool was developed as well.. Throughout the observation phase, the business rules from the previous case studies were represented with the model.. The representation was. presented to the stakeholders in order to confirm the business rules and the understandability of the representation..

(50) 26. Chapter II. Research method. A great unexpected issue appeared in the reflection phase: the integration of the model and the inference procedure into the certification environment. The problem is originated by the regulation on the responsibilities of the roles involved in the environment of this research because none of the entities of the JRG is allowed to take such responsibility. AQCLab is in charge of the evaluation part, and AENOR is responsible for the certification process. If an organization either conduct the evaluation or the certification, it is forbidden that the organization provides an input for the evaluation process (e.g., the business rules). Thereby, this situation can be understood as a problem in the current relationship with the JRG, but it can also be seen as an opportunity to create a consultancy firm that takes the responsibility of inferring the business rules of organizations interested in the evaluation and certification of the quality of their data products.. II.1.1.4. Action-Research Cycle 4. Certification Environment and Integration. Again, all along the fourth and last cycle, the expertise of the JRG on product quality evaluation and certification was vital as a feedback for the researcher. The main goal was to set the boundaries of a certification environment that integrates all the elements and results from the previous cycles (the Data Quality Model, the Evaluation Process, the roles, and the responsibilities) and implement a certification process. On top of that, the first certification of the quality of the data products under ISO/IEC 25012 was obtained by one of the most important business and marketing schools of Spain, and other public and private organizations from Spain and Italy started the certification process.. In the planning phase two main paths were followed. On the one hand, the main standards and regulations to transform AQCLab into an accredited and authorized laboratory for the evaluation of the quality of data products were studied. On the other hand, other product quality certification processes were.

(51) Chapter II. Research method. 27. analyzed —specially data certification processes in terms of quality— to find similarities and differences. The literature review planned in section II.2 and detailed in chapter III identified a need for a common certification environment for any type of data that allows the comparison between evaluation results of different datasets.. In the action phase, the policies, procedures, and technical instructions required to accredit an evaluation laboratory were implemented according to the international standard ISO/IEC 17025 [60]. In this phase, the expertise of the JRG was essential: AQCLab was already accredited for the evaluation of the quality of software products. On the top of that, the relationship with AENOR was established, as well as an agreement for the development of an auditing and certifying process of the quality of data products.. The observation phase consisted of a business case in which the environment was applied. During this business case, the first certification of the quality of data products was obtained by one of the most important business and marketing schools of Spain after a first evaluation phase in which one the results identified defects, and a second evaluation phase in which one those defects were corrected and the Data Quality levels improved, reaching the minimum thresholds to obtain the certification. For further detail on this case study see section VII.3.. The reflection phase identified a new need: the improvement of the Data Quality levels in the evaluated data products. This new need was outside the scope of the initial research questions, but the development of a solution was considered interesting. Therefore, it will be handled as a future research line of this thesis (see section VIII.3)..

(52) 28. Chapter II. Research method. II.1.2. Approach. Considering the scope of this thesis explained in section I.3, the research method used throughout this section and the special aspects and relationships that characterize this research, it can be asserted that the approach has mainly an industrial focus, since it is not only a theoretical research, but also an implementation and application of the results in the industry.. II.2. Systematic Literature Review. In this research, a Systematic Literature Review was conducted in order to gather the existing and available expertise on the subject. A Systematic Literature Review —or just SLR— is a procedure to identify, analyze, evaluate, and interpret the available research on a given field, with the ultimate goal of answering a set of research questions [1].. An important aspect of the SLR is that the procedure search into the literature to find primary studies (i.e., research work in the intended knowledge area). A SLR itself is called a secondary study, since it consists in analyzing and evaluating the existing research —the primary studies—, and it does not provide any new solution to a problem.. Notwithstanding, the main motivation for conducting a SLR should always be the collection of new and old findings on a specific field to be able to discover research gaps and proposing innovative ideas for future research lines to fill those gaps [61]. There exist other ways of finding research gaps, but the strongest point is that SLRs are planned and conducted according to a specific and systematic protocol. Being systematic allows not only to make the procedure repeatable, but also to find all the expertise related to the set of research questions to be answered. Being able to find all the expertise around an intended field provides more accurate answers of the research questions, and stronger and more complete foundations on the subject..

Referencias

Documento similar

ASIAN J MATH ASIAN JOURNAL OF MATHEMATICS United States ✓ ASIAN J ORG CHEM ASIAN JOURNAL OF ORGANIC CHEMISTRY Germany ✓ ASIAN J PHARM SCI ASIAN JOURNAL OF PHARMACEUTICAL

The Council for Transparency and Good Governance of Spain, with the State Agency for Evaluation of Public Policies and Quality of Services (AEVAL), has developed a

International Journal of Oral and Maxillofacial Surgery International Journal of Orthopaedic and Trauma Nursing International Journal of Osteopathic Medicine.. International

In this guide for teachers and education staff we unpack the concept of loneliness; what it is, the different types of loneliness, and explore some ways to support ourselves

No obstante, como esta enfermedad afecta a cada persona de manera diferente, no todas las opciones de cuidado y tratamiento pueden ser apropiadas para cada individuo.. La forma

The PubMed collection of citation data, consisting primarily of publication titles and abstracts, author names, journal information, and MeSH categories, was used for text mining

Products Management Services (PMS) - Implementation of International Organization for Standardization (ISO) standards for the identification of medicinal products (IDMP) in

Products Management Services (PMS) - Implementation of International Organization for Standardization (ISO) standards for the identification of medicinal products (IDMP) in