CAPITULO III. LA JUSTICIA ORAL Y COMUNAL TEPEHUANA
3.3 EL VIGILAR, REGULAR Y SANCIONAR EN CERRO DE LAS PAPAS: ESCENARIOS DE
3.3.2 RESOLUCIÓN COMUNAL: FIESTA, CONTROL Y SANCIÓN
A dimension table is highly denormalized, which means it contains repeated data and full text descriptions, rather than key values which join to other tables. A dimension table contains columns which fully describe the item, and each row represents a unique item. For example, an individual product exists in a product hierarchy. The hierarchy consists of the levels Product Category, Product Subcategory, Product Group, and SKU. A specific computer mouse might be in the Product Category Hardware, the Product Subcategory Peripherals, the Product Group Mice, with the individual product existing at the SKU level. There are also other columns that more fully describe the product and might include information such as color, price, weight, height, width, product code, and more.
Time Dimension Year Location Dimension Country Quarter State/Province Month City Day Plant Assembly Line
3 6
B u s i n e s s I n t e l l i g e n c e w i t h M i c r o s o f t O f f i c e P e r f o r m a n c e P o i n t S e r v e r 2 0 0 7The values for such items as the Product Category and Product Subcategory are stored as text in the dimension table. This is the part of a dimension table that is denormalized; the data is repeated many times instead of minimizing storage using relational techniques. This may seem wasteful, but as will be shown later, the storage used by dimension tables is minor in the overall scheme of the warehouse.
Due to restrictions on page size, the columns of a single record are broken into multiple pieces. Table 3-1 shows several records as they would appear in the dimension table.
Note that in some cases, architects do normalize dimension tables somewhat. In this case, one separate table might be made for Product Category and another for Product Subcategory. This design is called a snowflake schema and is sometimes seen in large warehouses. There is little space savings overall because, as previously mentioned, dimension tables take up only a very small portion of the storage of a warehouse. In addition, the building of the cubes, to be described later, is slightly slower due to the need to perform joins. However, once the cube is built, there is no performance penalty from creating a snowflake schema.
Conformed Dimensions A special consideration is that of conformed dimensions. Since most companies start with data marts, they end up with a number of different structures for those different marts. One key element in bringing various marts together into a warehouse is to have the same dimension structure across those marts. The structure of the employee dimension in an HR data mart should match the structure of the employee dimension in the Sales data mart, for example. Therefore, it is important to
ProductKey Product Category Product Subcategory Product Group SKU
1 Hardware Peripherals Mice 759U
2 Hardware Peripherals Mice A12Z
3 Hardware Printers Inkjet CC84
Product Name Weight Color Reorder Level Dealer Price
FragBoy Gaming Mouse 6 Black 25 22.95 Zed Laser Mouse 8 Grey 50 11.25 Onega Color Inkjet 180 Grey 12 43.50
Table 3-1 Three Records in a Product Dimension Table Show the Denormalization Common in a Warehouse.
C h a p t e r 3 : D a t a W a r e h o u s i n g a n d B u s i n e s s I n t e l l i g e n c e
3 7
design dimensions up front not just for the current data mart, but with an eye toward handling an entire enterprise data warehouse.
This is true of any dimension that can be used across multiple data marts. Time, product, employee, and customer are just a few examples of dimensions that commonly are used across multiple data marts. Whether these tables are actually only stored once and linked to different fact tables, or whether they are physically stored multiple times, their structure should be the same. Using conformed dimensions prevent what are sometimes called “stovepipe data marts,” or marts that stand alone in their own silo and cannot be integrated with other data marts.
Slowly Changing Dimensions One of the biggest issues you’ll encounter when dealing with dimensions is how to handle changes. Change is inevitable and here are two examples of this:
䉴 A company tracks salespeople and the manager to which they report. Each salesperson is rewarded based on his or her sales, and each manager is
rewarded based on the performance of the salespeople he or she manages. After a district realignment, some salespeople move from one manager to another. The salespeople need their history to go with them, but sales made under their previous manager should still be in that previous manager’s numbers.
䉴 A company wants to increase the profi t on an item without increasing the price, so they decide to drop the item’s size from 16 ounces to 14 ounces while maintaining the same price. Simply updating the fi eld in the database from 16 to 14 makes it look like the product has always been 14 ounces and thus history is lost as to when the change occurred. So, is this a new item and the old one has ended, or is the size column simply changed?
As can be seen from these two simple examples, there’s not necessarily an easy answer. These issues represent what are called slowly changing dimensions. They change slowly because people don’t move from one manager to another with each transaction. Item sizes don’t change with each transaction (if they do, there are other strategies, such as setting ranges of values and dropping each record into one of those buckets).
There are different strategies for dealing with slowly changing dimensions. By far the easiest way is what is called a Type I slowly changing dimension. With Type I, history is simply overwritten. In the case of the salesperson, they’d be tied to their new manager and it would look like they had always worked for this manager; all their history would now roll up to this new manager. This is great if the salesperson is a stellar performer and the manager is the one getting this person, but it’s horrible for the manager losing this salesperson and having them replaced by a subpar performer.
3 8
B u s i n e s s I n t e l l i g e n c e w i t h M i c r o s o f t O f f i c e P e r f o r m a n c e P o i n t S e r v e r 2 0 0 7In the case of a product, simply changing the size column in the database from 16 to 14 ounces makes it look like the size has always been 14 ounces. If sales decline because of the change, it’s entirely up to an analyst looking at the data to remember when the change occurred and identify the change in size as an issue.
The advantage of Type I is clear: it’s easy. No extra work is required. Data is changed and the primary key remains the same, so all history now reflects the current values as if they have never been different. The great disadvantage of Type I is also clear: history is lost. With Type I, it’s impossible to credit a salesperson’s sales to a previous manager, to determine when a product change was introduced, and so forth.
Another approach is the Type II slowly changing dimension. This type of dimension structure does maintain history, usually by versioning the record and then setting a start date and end date for each version. For example, a salesperson named Raju starts with a company on January 1, 2007 and works for Manager Bob. A year later, that salesperson is reassigned to Manager Maria. On the first record for salesperson Raju, the Start Date would be January 1, 2007 and the End Date would be January 1, 2008. A new record would be added for Raju that still maintained his employee ID (or some other key) but was now version 2, and had a Start Date of January 1, 2008 and no end date.
Sales would be tracked by the employee ID so that all of Raju’s sales always belong to him. Raju’s sales records also have the date on which they occurred, so that sales in 2007 roll up to Bob while sales in 2008 roll up to Maria. While simple on the surface, this can certainly complicate working with the data, because all queries must now look at the start and end dates for all of Raju’s entries in the Employee dimension table.
The advantage of a Type II dimension is clear: history is preserved. Companies will always be able to track when the change occurred so that prior sales will still roll up to the proper manager. Product changes will be evident because a change will start a new record with a new start date.
The disadvantage of a Type II dimension is also clear: it’s complicated. It makes storing, retrieving, and summing data much harder. Changes require an update to the existing record (to set the end date) and the insertion of a new record with a new version number and the start date. Taking multiple records that represent a single product or employee and making them appear as one to the end user can be a challenge.
There are other ways of handling slowly changing dimensions. There are Type III, and modifications of Type II and Type III. The actual mechanisms are beyond the scope of this book, as the goal is to show how to consume the data once the warehouse is built. Still, slowly changing dimensions are introduced here because many readers will be involved in the decision of how to store dimension data and track history, and an understanding of the tradeoffs is important.
C h a p t e r 3 : D a t a W a r e h o u s i n g a n d B u s i n e s s I n t e l l i g e n c e
3 9
Parent-Child Dimensions A special type of dimension that is encountered frequently is the parent-child dimension, or p-c dimension for short. Product is an example of a normal, or non p-c, dimension. The Product Category level might contain members such as Hardware, Software, and so forth. The Product Subcategory level might contain members such as Peripherals, Motherboard, Video Cards, Games, Business Applications, and so on. The members at each level are unique; in other words, a Product Subcategory is not also a Category. An individual product is not also a Product Group. There is a clearly defined hierarchy and all individual products are found at the lowest level of that hierarchy, and products are found at higher levels.
Contrast that with a standard organizational chart. At the top is the President or Chief Executive Officer. Below that is a group of Vice Presidents. Next come Directors, Managers, and employees. However, one Vice President might have two Directors, another might have five, and a third Director might not have any. In addition, some parts of the business might have Managers and then Team Leaders, while other departments don’t use team leaders. In other words, there’s no well- defined hierarchy, so a table can’t have a set number of columns to represent the levels in an organization.
In addition, everyone is an employee. The CEO is an employee and thus needs to be in the employee table. Each Vice President, Director, and Manager is also an employee. This means that there will be individual employees at each level of the hierarchy, and that the hierarchical structure is not well defined.
The classic way to handle this in a relational sense is to have an Employee ID field act as the primary key on the table. Then, in the same table, is a Manager ID field, which ties back to the Employee ID of that person’s manager. The employee with either a blank Manager ID or a Manager ID that is the same as the Employee ID is the top of the hierarchy. Everyone else falls below that.
As an example, take a look at Figure 3-3. This shows a simple organization chart for a very small company. Note that some Vice Presidents don’t have any Directors and that different chains contain a different number of levels.
Table 3-2 shows the relational structure that supports the organization chart from Figure 3-3. Note that a hierarchy of any level can be represented in a self-referencing table, which is a table in which one field is tied to another (usually the primary key).