Multimillion-row customer dimensions present two unique challenges that warrant special treatment. Even if a clean, flat dimension table has been imple-mented, it generally takes too long to constrain or browse among the relation-ships in such a big table. In addition, it is difficult to use our tried-and-true techniques from Chapter 4 for tracking changes in these large dimensions. We probably don’t want to use the type 2 slowly changing dimension technique and add more rows to a customer dimension that already has millions of rows in it. Unfortunately, huge customer dimensions are even more likely to change than moderately sized dimensions. We sometimes call this situation a rapidly changing monster dimension!
Business users often want to track the myriad of customer attribute changes.
In some businesses, tracking change is not merely a nice-to-have analytic capa-bility. Insurance companies, for example, must update information about their customers and their specific insured automobiles or homes because it is criti-cal to have an accurate picture of these dimensions when a policy is approved or claim is made.
Fortunately, a single technique comes to the rescue to address both the brows-ing-performance and change-tracking challenges. The solution is to break off frequently analyzed or frequently changing attributes into a separate dimen-sion, referred to as a minidimension. For example, we could create a separate minidimension for a package of demographic attributes, such as age, gender, number of children, and income level, presuming that these columns get used extensively. There would be one row in this minidimension for each unique combination of age, gender, number of children, and income level encoun-tered in the data, not one row per customer. These columns are the ones that are analyzed to select an interesting subset of the customer base. In addition, users want to track changes to these attributes. We leave behind more constant or less frequently queried attributes in the original huge customer table.
Sample rows for a demographic minidimension are illustrated in Table 6.3.
When creating the minidimension, continuously variable attributes, such as income and total purchases, should be converted to banded ranges. In other words, we force the attributes in the minidimension to take on a relatively small number of discrete values. Although this restricts use to a set of predefined bands, it drastically reduces the number of combinations in the minidimension.
If we stored income at a specific dollar and cents value in the minidimension, when combined with the other demographic attributes, we could end up with as many rows in the minidimension as in the main customer dimension itself. The use of band ranges is probably the most significant compromise associated
C H A P T E R 6 154
Table 6.3 Sample Rows from a Demographic Minidimension
DEMOGRAPHIC KEY AGE GENDER INCOME LEVEL
1 20-24 Male <$20,000
2 20-24 Male $20,000-$24,999
3 20-24 Male $25,000-$29,999
18 25-29 Male $20,000-$24,999
19 25-29 Male $25,000-$29,999
with the minidimension technique because once we decide on the value bands, it is quite impractical to change to a different set of bands at a later time. If users insist on access to a specific raw data value, such as a credit bureau score that is updated monthly, it also should be included in the fact table, in addition to being represented as a value band in the demographic minidimension. In Chapter 9 we’ll see how to construct on-the-fly value-banding queries against the facts in the fact table, although such queries are much less efficient than directly constraining the value band in our minidimension table.
Every time we build a fact table row, we include two foreign keys related to the customer: the regular customer dimension key and the minidimension demo-graphics key. As shown in Figure 6.4, the demodemo-graphics key should be part of the fact table’s set of foreign keys in order to provide efficient access to the fact table through the demographics attributes. This design delivers browsing and constraining performance benefits by providing a smaller point of entry to the facts. Queries can avoid the huge customer dimension table altogether unless attributes from that table are constrained.
When the demographics key participates as a foreign key in the fact table, another benefit is that the fact table serves to capture the demographic profile changes. Let’s presume that we are loading data into a periodic snapshot fact table on a monthly basis. Referring back to our sample demographic minidi-mension sample rows in Table 6.3, if one of our customers, John Smith, was 24 years old with an income of $24,000, we’d begin by assigning demographics key 2 when loading the fact table. If John has a birthday several weeks later, we’d assign demographics key 18 when the fact table was next loaded. The demographics key on the earlier fact table rows for John would not be changed. In this manner, the fact table tracks the age change. We’d continue to assign demographics key 18 when the fact table is loaded until there’s another change in John’s demographic profile. If John receives a raise to $26,000 sev-eral months later, a new demographics key would be reflected in the next fact table load. Again, the earlier rows would be unchanged. Historical demo-graphic profiles for each customer can be constructed at any time by referring
Customer Relationship Management 155
to the fact table and picking up the simultaneous customer key and its con-temporary demographics key, which in general will be different from the most recent demographics key.
Customer dimensions are unique in that customer attributes frequently are queried independently of the fact table. For example, users may want to know how many female customers live in Dade County by age bracket. Counts such as these are extremely common with customer segmentation and profiling.
Rather than forcing any analysis that combines solely customer and graphic data to link through the fact table, the most recent value of the demo-graphics key also can exist as a foreign key on the customer dimension table.
In this case, we refer to the demographics table as a customer dimension outrig-ger, as we discussed earlier in this chapter.
The minidimension terminology refers to when the demographics key is part of the fact table composite key; if the demographics key is a foreign key in the customer di-mension, we refer to it as an outrigger.
If you embed the most recent demographics key in the customer dimension, you must treat it as a type 1 attribute. If you tracked all the demographics changes over time as a type 2 slowly changing dimension, you would have reintroduced the rapidly changing monster dimension problem that we have been working to avoid! With a type 1 change, as we discussed in Chapter 4, we overwrite the demographics key in the customer row whenever it changes instead of creating a new customer row. We also recommend that these outrig-ger demographic attributes be labeled as most recent or current values to min-imize confusion. Even with unique labeling, be aware that presenting users with two avenues for accessing demographic data, through either the minidi-mension or the outrigger, can deliver more functionality and complexity than some users can handle.
Figure 6.4 Demographic minidimension with a customer dimension.
Becomes ... Customer Date of 1st Order
… Customer Date of 1st Order
…
Customer Number of Children Band Customer Marital Status
Figure 6.5 Separate demographic and behavioral minidimensions.
The demographic dimension itself cannot be allowed to grow too large. If we have 5 demographic attributes, each with 10 possible values, then the demo-graphics dimension could have 100,000 (105) rows. This is a reasonable upper limit for the number of rows in a minidimension. However, there are certainly cases where we need to support more than 5 demographic attributes with 10 values each. In this case, we would build a second demographics dimension, as shown in Figure 6.5. For example, we may have one set of attributes con-cerning traditional demographic income and lifestyle attributes and another set that focuses on purchase and credit behavioral scores. Multiple minidi-mensions address the issue of minidimension growth while also clustering like attributes together for a more intuitive user presentation. Another motiva-tion for creating these two minidimensions is that they are potentially sourced from two different data providers with different update frequencies. However, remember to bear in mind our advice from Chapter 2 concerning too many dimensions. We certainly don’t want to create a separate minidimension with a foreign key in the fact table for each demographic attribute, such as an age dimension, gender dimension, and income dimension. Likewise, we shouldn’t jump immediately on the minidimension technique unless we’re dealing with a large or rapidly changing dimension; we can’t forget the advantages of main-taining a simple, flat, denormalized dimension table.
The best approach for efficiently browsing and tracking changes of key attributes in really huge dimensions is to break off one or more minidimensions from the dimen-sion table, each consisting of small clumps of attributes that have been administered to have a limited number of values.
Customer Key (PK)
Relatively constant attributes … Customer Dimension
Customer Demographics Key (PK) Demographics attributes …
Customer Demographics Dimension
Customer Purchase-Credit Key (PK) Credit and payment behavioral attributes …
Customer Purchase-Credit Dimension
Customer Key (FK)
Customer Demographics Key (FK) Customer Purchase-Credit Key (FK) More Foreign Keys …
Facts …
Fact Table
Customer Relationship Management 157
Variable-Width Attribute Set
Finally, a minidimension can be created to handle a variable number of customer attributes. Obviously, the longer we have a relationship with a customer, the more descriptive information we know about him or her. If we think about the sales cycle, we have many more prospects than we do cus-tomers; however, we know much less about the prospects than we do about our customers. We may have 10 million initial prospects, described by a handful of characteristics, who are worked through the sales pipeline even-tually to result in 1 million official customers with a much broader set of known characteristics.
When using external prospect lists, we often are permitted only a one-time use of the list and don’t have the legal right to store the prospect information inter-nally. However, if we’ve generated our own prospect information, it certainly can be stored in the data warehouse. Let’s assume that we’re capturing metrics, perhaps associated with solicitation or quote-generation events that apply to both prospects and customers. We could store the prospects and customers together in a single contact dimension; however, there is a significant disparity between the numbers of attributes for prospective versus customer contacts. As illustrated in Figure 6.6, we may know only a handful of identification and loca-tion attributes about our prospects. On the other hand, we may know 50 addi-tional attributes for a customer, covering purchase, payment, credit and service behaviors, directly elicited profile attributes, and third-party purchased demo-graphic attributes. In the world of electronic retailing, we can equate prospects to be the anonymous Web site visitors as opposed to our registered customers.
Figure 6.6 Variable-width customer attributes handled as a base dimension and minidimension.
If we assume that many of the final 50 customer attributes are textual, we eas-ily could have a total row width of 1,000 bytes. Suppose that we have 10 mil-lion contacts (9 milmil-lion prospects and 1 milmil-lion official customers). Obviously, we are concerned that the trailing 50 columns in 90 percent of our contacts have no data. This gets our attention when we’re dealing with a 10-GB dimen-sion table. In this case, we may wish to introduce a minidimendimen-sion.
If we’re dealing with a database platform that supports variable-width rows, such as Oracle, we may be able to build a single dimension with the full com-plement of attributes if the total attributes list is not too long; in some of these cases, we don’t need to worry about all the prospects’ null columns because they take up virtually zero disk space. However, if we have a fixed-width data-base, or if the attributes list is very long, we are uncomfortable with all the empty columns for the prospects. In this case, as shown in Figure 6.6, we break the dimension into a 10-million-row base dimension table consisting of attrib-utes that are common to both prospects and customers, along with a 1-million-row customer minidimension that contains the additional attributes we know about our customers. Again, we include two foreign keys in the fact table.
Nine of ten fact table rows would join to an empty customer row in the extended customer minidimension.