The Dynamic Data Vault is an operational Data Vault with dynamic adaptation to the structure. In other words, the tables, columns, indexes, and keys are all subject to change – automatically. Of course to achieve this state requires a constant vigilant watch on the metadata, including but not limited to incoming structures. The incoming structures may include XSD, XML, staging tables, or other metadata (including queue based or process metadata) that describe the structure of the incoming data set.
The dynamic nature of the Data Vault means: new attributes may be added to Satellites, new Links and new Hubs may be formed on the fly. ETL /ELT loading code will be adjusted automatically, and BI Query views will also inherit certain changes. At the end of all the automatic model changes, emails of the changes are sent to the IT staff for review in the morning.
Supe
er Charge Yo
an Linstedt 2
Common
Data Vault st king, and que ughout the H uence numbe ord creation d st of these fie erated/maint itable as they mbers are two Data Vault w ehouse (in a mp). The load
trollable syste ciple does no
our Data Ware
2010-2011, a
n Attributes
tructures (ta erying. The c Hubs, Links a ers (line item dates, and re elds are EDW tained; as a y do not exist o cases that a works on the
single batch) d dates enfor em date time ot apply is du
ehouse
all rights rese
s
bles) contain common attr
nd Satellites m numbers), l ecord sources W (enterprise d
result, the da t in the sourc are auditable principles sim ) is stamped rce audit trai e available to uring real-tim
Figure 3-1: T
erved n standard at
ibutes in the s. The comm oad dates, lo s.
data wareho ata in these c ce system. H e particularly milar to geolo
with a “geolo ls and record o the EDW loa e feed proce
Time Series B
ttributes that Data Vault a on attributes oad end-date
use) system columns are However, reco
when they e ogical layerin ogical time b d history bas ading routine essing.
Batch Loaded
P
http:/
t assist with t are defined h s include: seq es, last seen
defined, and “reference d ord creation exist in the so ng where data based layer” ( sed on the on
d EDW system data” and are
dates and lin ource system a arriving in t (a load date
The gle batch with son) it is nece oval, replace eate the aud blems are not
l-time data lo val time. Rea ncy of arrival
sactions per
l-time data a gruent with ti gle time const ehouse. As d s between co
ssists with a h the same lo
essary to exa ement, or aug
it trail of the t often discov oads are trea al-time latenc being less th r second. An
rrival time-st ime intervals tant does not data loads sh onstant time
uditability an oad date time amine all row
gmentation to data for that vered for wee ated different
cy is typically han one min example ima
Figure 3-2 R tamping appe s or time-span t represent a hift to incorpo (batch loads eks or month tly. Real-time defined as m ute. Real-Tim age of real-tim
Real-Time Arr ears similar t ns. It can be any fixed laye orate real-tim ) and continu
ility by stamp he loading pr
oaded during me Loading is me data stam
ival, Data Ge to layers of p e grouped tog er of informat me data feeds uous time (re
ping all partic rocess fails m g that proces e only mecha side note the have occurre are stamped
val in a data s commonly mps is shown
eology pebbles on th
gether for ana tion in the en s (also known eal-time loads
cipating rows mechanically ss; resulting i anism availab ese mechanic
ed.
d based on m loading que defined in te n in Figure
3-he beach. Da alytic purpos nterprise data
n as trickle fe s) blurs.
Super Charge Your Data Warehouse Page 45 of 152
© Dan Linstedt 2010-2011, all rights reserved http://LearnDataVault.com 3.1 Sequence Numbers
Sequence numbers are required by relational database management systems (RDBMS) in order to process joins quickly and efficiently. Without sequence numbers the joins across huge amounts of information would operate comparatively slowly (compared to character based joins). The use of sequence numbers as primary keys for Hubs and Links also eliminates any possible issues maintaining multi-part cascading keys in Satellites or nested Link tables.
Staging area sequences are stored within the staging area. These sequences should be restarted and set to cycle over for each load to a specific table. Staging sequence numbers are utilized only to identify loaded duplicates. Staging area sequences should not ever leave the staging area, and should not be moved forward into the Data Vault.
Duplicates are rows that have 100% completely the same data - from the keys, to the nulls, to the descriptive fields. When the data is 100% duplicate, there needs to be a way to delete the rows from the staging table in order to proceed with loading only one unique copy to the target Data Vault.
Without a sequence number, there is no unique identifier on each row. With a sequence number it is easy to “pick” the first or last row as the candidate to leave in place and delete the rest.
Before deleting the duplicates – the Metrics Vault should record a history of how many duplicates there are per staging table per business key. By counting the duplicates auditability can be maintained if the IT staff is ever asked to reproduce the source load. The number of duplicates multiplied by one row provides the recreation with an accurate picture. In other words, a Cartesian join product is applied in order to reproduce the original duplicate row set.
Hub and Link sequence numbers are created 1 for 1 with each unique business key and unique association inserted to the respective table. Satellite sequence numbers are generally parent table sequence numbers, in other words they are inherited from the Hub or Link parent table.
It is a recommended practice to setup sequence numbers to be number(12). In Oracle there
appears to be no byte-storage difference between a number(12) and a number(38). Most sequence numbers will fit within this length, and will not require double or floating point math to resolve at query time.
Sequence numbers in the Data Vault should never be shown to business users, and must not leave the Data Vault going forward. First, sequence numbers are meaningless numbers which are there simply to provide uniqueness to the rows they represent. Second, the numbers are there merely for JOIN purposes at high rates of speed. Third, if I ask you: “please tell me what number 5 means to you?” Can you define it? Can you make sense of it? No. It’s a meaningless NUMBER. There is no context.
The sin of this is that once you expose the sequence number to the business – they will forever attach that “customer/product/employee/service” or what-ever-it-is to the number you give them.
Meaning that they give it context, they force it to mean something to the business! Now, you (as IT) no longer have the right or the ability to “change/alter/destroy and rebuild” that number, nor are you allowed to attach different rows to that number.
This will cause problems for future re-loading, re-building, or even fixing the Data Warehouse, regardless of the data modeling technique you choose! DON’T DO THIS, DON’T EXPOSE SEQUENCE NUMBERS TO THE BUSINESS… EVER!