Oficinas abiertas y cerradas en - ABENGOA Informe Anual 2013

Process metadata provides information about the processes that were used to generate and deliver the data (or a data set) used by the decision maker. Process abstraction, such as models or documentation, is recognized as a special form of metadata (Marco, 2000; Shankaranarayanan et al., 2003). A graphical representation that communicates process metadata is the information product map (IPMAP) (Shankaranarayanan et al., 2003). A set of constructs represents different stages of the data manufacturing process — data source, processing, storage, quality check, organizational or information system boundary and data consumer (sink). Each construct is supplemented with metadata about the corresponding stage, such as a unique identifier, the data entity composition, ownership, processing requirements and physical location where the step is performed. These help the decision maker understand what the output from this step is, how was this created, including business rules and applicable constraints,

Figure 2a. IPMAP for staging data in a data warehouse

D S1 D S2 E P1 E P2 C P1 C P2 I1 S1 I N P1 I2 S2

Figure 2b. IPMAP for staging of dimensions and facts in a warehouse

E P3 C P3 S3 S4 D S3 D S4 D S5 E P4 C P4 I3 E P5 I N P2 S5 D S6 D S7 E P6 C P5 I4 E P7 I N P3 E P8 D S8 C P6 C P7

where (both physical location and the system used) and who is responsible for this stage in the manufacture in addition to when (at what stage) an operation was performed.

Figure 2 shows a generic example of an IPMAP in a data warehouse. The IPMAP modeling scheme offers six constructs to represent the manufacturing stages of an IP:

1. a data source (DS) block that is used to represent each data source/ provider of raw data used in the creation of the IP;

2. a processing (P) block that is used to represent any manipulations and/or combinations of data items or the creation of new data items required to produce the IP;

3. a storage (S) block that represents a stage where data may wait prior to being processed;

4. an information system boundary (SB) block to represent the transition of data from one information system to another (e.g., transaction data in legacy file systems transferred to a relational database after some processing);

5. a data consumer (DC) block to represent the consumer; and

6. an inspection (I) block that serves to represent predetermined inspections (validity checks, checks for missing values, authorizations, approvals, etc.). The arrows between the constructs represent the raw/component data units that flow between the corresponding stages. An input obtained from a source is referred to as a raw data unit. Once a raw data unit is processed or inspected, it is referred to as a component data unit.

Consider a warehouse having (say) three dimensions and a set of facts (or measures). A generic, high-level sequence of steps that result in the warehouse is represented by the IPMAP in Figure 2, parts a, b, and c. The data from data sources (DS₁and DS₂) are extracted by extraction processes (EP₁and EP₂) and Figure 2c. IPMAP representing the transformation and loading of a warehouse T P2 S6 S5 S4 T P 1 S2 T P3 S3 LP 1

cleansed (CP₁ and CP₂). The cleansed data from DS₁ is inspected (manual or automated process I₁) and stored (S₁). This is combined with the cleansed data from DS₂ by an integration process (INP₁), inspected for errors (I₂) and staged in storage S₂. The staging of the other data (dimensions and facts) may similarly be represented as shown in Figure 2. The staged fact data (in S₅) may then be combined with the staged dimension data (in S₂, S₃ and S₄) by a transformation process (TP) and loaded into the DW by the process (LP₁). Though the transformation may be a single process, it is shown as multiple stages in Figure 2c.

Iverson (2001) states that in order to improve data quality, the metadata attributes must include process documentation such as data capture, storage, transformation rules, quality metrics and tips on usage and feedback. IT practitioners acknowledge the importance of process metadata for information system professionals, who commonly use it for the design and ongoing mainte- nance of data processing and delivery systems (Redman, 1996; Marco, 2000). However, the usability of process metadata for business users has not been explored. Information processes may include a large number of stages and the resulting visual representation might be too complex for non-technical users to comprehend. It is suggested (though not explicitly shown) that that providing such metadata to business users will improve their perception of quality and hence their decision-making process (Shankaranarayanan & Watts, 2003). How do we understand the impact of process and quality metadata on decision making? The key is to investigate the effects of process metadata in a visual form (using the IPMAP representation) — whether decision makers are likely to benefit from the availability of process metadata; if yes, to what extent does its use affect the efficiency and outcome of the decision-making process. An important aspect to explore is whether process metadata benefits the context- dependent evaluation of the data and its quality. While impartial assessment is derived from the granular details of the dataset contents, process metadata adds an extrinsic layer to the assessment — it puts the data in the context of the information processing that produced it. It is thus expected to affect users’ perceptions and assessment of data quality in different ways that we seek to explore and understand.

To conclude this section, we briefly describe an empirical framework to examine how data quality assessment and the provision of metadata affect analytical, data-driven decision making. By better understanding the importance of metadata within the context of decision making, organizations can recognize the value of implementing and managing metadata. We focus specifically on process metadata and the benefits of providing it to the decision maker during the decision process. We further argue that providing process metadata will have a larger impact on decision making (both the decision process and the decision outcome).

This research defines a new perspective for data quality management in decision support environments. It emphasizes the importance of contextual assessment of data quality, the provision of metadata to help this assessment and attempts to understand the impact of these two on the decision outcome. Data quality may be gauged independent of the decision-task(s) and such context-independent (impartial) assessments are captured as quality metadata (Shankaranarayanan & Even, 2004). However, impartial assessment does not always have the desired impact on the decision outcome(s), as decision makers may find impartial assessments insufficient and seek quality affirmation from additional, external sources (Shankaranarayanan & Watts, 2003). This does not overlook the importance of context-independent data quality assessment. It highlights the importance of another type of metadata, process metadata, by showing that the provision of process metadata does impact decision-making efficiency and ultimately the decision outcome. Though the link between data quality and decision outcome is understood, this research is a first step towards offering a deeper understanding of this link — one that identifies what else the organization must do to achieve the desired impact of improved data quality on decision outcomes.

The model presented in Figure 3 incorporates the impartial data quality assessments by the decision maker, the perceived usefulness of process metadata and the efficiency of the decision-making process into a framework for understanding how these factors influence decision outcomes. This model makes several interesting contributions. First it posits that process metadata (relevant to or associated with the data used in the decision-task), in addition to the impartial assessments (of the data used in the decision-task by the decision-maker) have Figure 3. The effects of quality and process metadata on the decision- making process Intrinsic Data Quality (Quality Metadata) Process Metadata Usefulness Decision-Process Efficiency Decision Outcome

a significant effect on decision outcome. Second, the model suggests that the effect of those two factors is mediated by the efficiency of the decision-making process. Earlier research (Chengalur-Smith et al., 1999; Fisher et al., 2003) suggests that quality assessment supported by metadata influences the decision outcome directly, moderated by several other factors. In this model we examine the theory that complementing the quality metadata with process metadata adds value to the decision maker. Further, the model examines the theory that both factors affect decision-making outcome, directly and indirectly. Preliminary results from testing the model show that process metadata per se does not impact decision outcomes. In combination with quality metadata, process metadata improves the efficiency of the decision process and consequently the decision outcome. The impact of the combination of the two metadata components is much more than the impacts of either metadata component alone.

Conclusion

This chapter presents a comprehensive examination of the state of metadata in decision environments such as the data warehouse. The purpose of this examination is to emphasize the need for further research on metadata by highlighting four issues:

1. the diverse and complex functionality that metadata support;

2. the challenges in successfully implementing integrated metadata repositories;

3. the roles that process and quality metadata play in improving decision outcomes; and

4. the potential contribution of metadata to management and use of organizational knowledge.

The metadata taxonomy that was introduced in this chapter highlights the different functionality that metadata serves in complex decision support environments, such as the data warehouse. It illustrates the multiple classes of metadata that support a wide spectrum of functionality in a data warehouse. This taxonomy supplements the existing metadata classification schemes and the chapter illustrates how the existing and proposed classification schemes fit together. The software industry acknowledges the importance of metadata and the difficulties with implementing metadata solutions. Software vendors now offer metadata

management capabilities within data warehousing products or as separate software packages. Unfortunately, there is still a mismatch between the capabilities needed for managing metadata and capabilities that these products offer. The state of industry offerings highlights the need for a metadata standard to achieve integration and exchange of metadata across repositories. As an alternative to managing metadata using commercial products, the paper dis- cusses the implementation of a “homegrown” integrated metadata repository and the associated challenges.

A causal model that may be used to validate the positive role of process metadata in improving decision outcomes is also described. This research model has important implications for future research in the data quality management field and for the design of complex decision-support environments. Findings support the commonsense notion that when participants perceive the quality of the input data to be good, their sense of decision-making efficiency, as well as their decision outcome, is improved. This finding, assuming further corroboration, has clear implications for user interface design — it is important not only for improving the storage and back-end processing of data, but also for communicating data quality to the end users. Having a sense of good quality will improve users’ confidence in the data and in the back-end system that provides it. As a result, users ought to be able to use this data more efficiently and effectively. Findings support our assertion that the metadata layer can serve as a tool for communicating data quality to business users. The provision of process metadata in the form of the IPMAP proved to have a significant effect on both perceptions of decision process efficiency and final outcomes. Future research will investigate this link more deeply, as well as the possible effect of providing other forms of metadata to the decision maker.

An important observation is the significant difference in how technical and business metadata are perceived. Administrators, IT managers and other technical users of metadata clearly recognize the merits of metadata. Business users, however, perceive metadata as a technical necessity and do not recognize its value. Business metadata includes metadata such as the source of a data element, business rules applied to manipulate it, assumptions and models used in the manipulation and other information that helps evaluate the usefulness of that data element to the decision maker. Few business users are trained in using metadata and hence ignore or overlook it even if made available to them. This observation raises the question: “To what extent is metadata useful to business users?” Two perspectives for addressing this question are as follows:

1. to assess the operational value of metadata; and 2. to understand its implications for decision making.

The benefits of metadata remain largely intangible and there is a lack of models or methodologies to evaluate and quantify its operational value (Stephens, 2004). Suggestions for looking at metadata ROI have been offered by practitioners (Marco, 2000). The following issues ought to be explored to better understand the operational context of metadata in decision environments as a step towards developing such evaluation models:

•

What functional types of metadata have the most significant opera-

tional importance? The consolidated taxonomy of metadata that was offered (see Metadata Functionality) can serve as a baseline for explor- ing this question — Are certain (functional) types of metadata more important? Other factors to look at are the data modeling method used, level of integration supported and the extent to which the metadata is exchange- able. This knowledge would assist organizations in identifying the key set of metadata modules (potentially a mix of both technical and business metadata) for the core of the metadata repository when implementing homegrown solutions.

•

Can metadata improve the performance of a data warehouse?

Dimensions for measuring data quality such as accuracy, timeliness, completeness and consistency, with specific reference to the data warehouses, have been proposed (Hufford, 1996). It is unclear, though, to what extent metadata contributes to good data quality and performance of a data warehouse. While investing in metadata is prescribed as a key factor for the success of data warehouses by several studies (Marco, 1998; Inmon, 2000; and Sachdeva, 1998), none of these studies offers theoretical quantification or empirical support for measuring such impacts. Measuring this impact is a challenging task for several reasons. First, there are many alternative approaches for measuring the performance or success of a data warehouse. Some are based on the ease of managing a data warehouse focusing on technical administration (Hufford, 1996) and others are based on evaluating the end-user experiences (Wixom & Watson, 2001). Second, there is no straightforward method for attributing costs directly to the metadata. As pointed out in the COTS product review, metadata management components are embedded within other offerings and in many cases are not priced separately. It is also practically impossible to precisely assess the software development time allocated to metadata, since it is typically part of application programming efforts. Developing measurement methods for such study is likely to require fairly sophisticated models and methods for cost and benefit attribution (Stephens, 2004).

References

Ballou, D. P., Wang, R. Y., Pazer, H., & Tayi, G. K. (1998, April). Modeling information manufacturing systems to determine information product qual- ity. Management Science, 44(4), 462-484.

Berners-Lee, T. (1997). Metadata architecture, Retrievable from the World Wide Web Consortium, http://www.w3.org/DesignIssues/Metadata.html Blumstein G. (2003, August 1). Metadata management architecture. Data

Management (DM) Direct Newsletter.

Broekstra, J. Klein, M., Decker, S., Fensel, D., van Harmelen, F., & Horrocks, I. (2001). Enabling knowledge representation on the web by extending RDF schema. In Proceedings of the 10th_{International World Wide Web}

Conference (WWW10), Hong Kong, (pp. 467-478).

Cabibbo L., & Torlone, R. (2001). An architecture for data warehousing supporting data independence and interoperability. International Journal of Cooperative Information Systems, 10(3), 377-397.

Candan, K. S., Liu, H., & Suvarna, R. (2001, July). Resource description format: Metadata and its applications. ACM SIGKDD Explorations Newsletter, 3(1).

Chengalur-Smith I., Ballou, D. P., & Pazer, H. L. (1999, November/December). The impact of data quality information on decision making: An exploratory study. IEEE Transactions on Knowledge and Data Engineering, 11(6), 853-864.

Davies, J., Weeks, R., & Krohn, U. (2002). Quizrdf: Search technology for the semantic web. In Proceedings of the 11th_{International World Wide Web}

Conference Workshop on RDF and Semantic Web Applications (WWW11).

Decker, S., Erdmann, M., Fensel, D., & Studer, R. (1999, January). Ontobroker: Ontology-based access to distributed and semi-structured information. In R. Meersman, Z. Tari & S. M. Stevens (Eds.), Proceedings of the 8th

Working Conference on Database Semantics, Rotorua, NZ (pp. 351- 369).

Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R. S. et al. (2004). Swoogle: A search and metadata engine for the semantic web. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), Washington DC (pp. 652-659).

Eckerson, W. W. (2003). Achieving business success through a commitment to high quality data. TDWI Report Series, Data Warehousing Institute, Seattle, WA. Retrievable from http://www.dw-institute.com

Eisenhardt, K. M., & Zbaracki, M. J. (1992). Strategic decision-making. Strategic Management Journal, 13, 13-37.

Elmasri, R., & Navathe, S. (2003). Fundamentals of Database Systems (4th

ed.). Redwood City, CA: Benjamin Cummings Publishing Company Inc. English, L.P. (1999). Improving data warehouse and business information

quality: Methods for reducing costs and increasing profits. New York: John Wiley & Sons.

Fisher, C. W., Chengalur-Smith I., & Ballou, D. P. (2003). The impact of experience and time on the use of data quality information in decision making. Information Systems Research, 14(2), 170-188.

Ford, C. M., & Gioia, D. A. (2000). Factors influencing creativity in the domain of managerial decision making. Journal of Management, 26(4), 705-732. Handschuh, S., & Staab, S. (2003). Cream: Creating metadata for the semantic

web. Computer Networks, 42(5), 579-598.

Heery, R. (1998, March). What is RDF. The Ariadne Magazine. Retrievable from http://www.ariadne.ac.uk/issue14/what-is/

Hernandez, M. A., & Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery, 1(2).

Hufford, D. (1996, January). Data warehouse quality. DM Review, Special Feature. Retrievable from http://www.dmreview.com/article_sub. cfm?articleId=1311

Imhoff, C. (2003). Mastering data warehouse design: Relational and dimensional techniques. Indianapolis, IN: Wiley Publications.

Inmon, B. (2000, July 7). Enterprise meta data. Data Management Direct Newsletter.

Iverson, D. S. (2001). Meta-Information quality – Keynote Address in the International Conference on Information Quality by the Senior VP Enterprise Information Solutions –Ingenix, Boston.

Jarke, M., Lenzerini, M., Vassiliou, Y., & Vassiliadis, P. (2000). Fundamentals of data warehouses. Heidelberg, Germany: Springer-Verlag

Kahn, B. K., Strong, D. M., & Wang, R. Y. (2002). Information quality benchmarks: Product and service performance. Communications of ACM, 45(4).

Kimball, R. (1998, March). Meta meta data data. DBMS Magazine.

Kimball, R., Reeves, L., Ross, M., & Thornthwaite, W. (1998). The data warehouse lifecycle toolkit. New York: Wiley Computer Publishing.

Klein, B. D., Goodhue, D. L., & Davis, G. B. (1997, June). Can humans detect errors in data? Impact of base rates, incentives and goals. MIS Quarterly, 21(2).

Luke, S., Spector, L., Rager, D., & Hendler, J. (1997). Ontology-based web agents. In Proceedings of the First International Conference on Autonomous Agents (Agents97) (pp. 59-66).

Marco, D. (1998, March). Managing meta data. Data Management Review. Marco, D. (2000). Building and managing the meta data repository: A full

lifecycle guide. New York: Wiley Computer Publishing, John Wiley & Sons.

Martin, P., & Eklund, P. (1999). Embedding knowledge in web documents. In

In document ABENGOA Informe Anual 2013 (página 178-183)