CAPITULO 1: FUNDAMENTACION TEORICA
1.5 Metodologías de Desarrollo
Thedatathatdrives today’sbusinesssystemsoftencomesfromavarietyof sources anddisparatedatastructures.Asorganizationsgrow,theyretainolddata systemsandaugmentthemwith newandimprovedsystems. Databecomes difficult tomanageand use,anda clearpictureofa customer,product, orbuying trendcanbepractically impossibletoascertain.
Theprice ofpoordataisillustratedbytheseexamples:
v Adataerrorina bankcauses300 credit-worthycustomerstoreceivemortgage defaultnotices.Theerrorcosts thebanktime, effort,and customergoodwill.
v Amarketing organizationsendsduplicatedirectmailpieces.Asix percent redundancyineachmailingcosts hundredsofthousandsofdollarsayear.
v Amanaged-care agencycannotrelateprescriptiondrugusage topatientsand prescribingdoctors.Theagency’s OLAPapplicationfailstoidentifyareasto improveefficiencyandinventorymanagementandnew sellingopportunities.
Thesourceof qualityissuesisalackofcommonstandardsforhow tostoredata and aninconsistencyinhow thedataisinput.Differentbusinessoperationsare oftenverycreativewiththedatavaluesthattheyintroduceintoyourapplication environments. Inconsistencyacrosssources makesunderstandingrelationships betweencriticalbusinessentitiessuchascustomersandproductsverydifficult.In manycases,thereisnoreliableandpersistentkeythatyoucanuseacrossthe enterprisetogetall theinformationthatisassociatedwitha singlecustomeror product.
Without high-qualitydata,strategicsystemscannotmatchand integrateall related datatoprovidea completeview oftheorganizationand theinterrelationships within it.CIOscannolongercountona returnontheinvestmentsmadeincritical businessapplications.The solutioncallsforaproductthatcanautomatically re-engineer andmatchalltypesofcustomer,product,and enterprisedata,inbatch orat thetransactionlevelinrealtime.
WebSphereQualityStageisadatare-engineeringenvironmentthatisdesigned to help programmers,programmeranalysts,businessanalysts,andotherscleanseand enrichdatatomeetbusinessobjectivesanddataqualitymanagementstandards.
Introduction to WebSphere QualityStage
WebSphereQualityStagecomprisesaset ofstages,a MatchDesigner,andrelated capabilitiesthatprovideadevelopment environmentforbuildingdata-cleansing tasks calledjobs.
Using thestagesanddesigncomponents,youcanquickly andeasilyprocess large storesofdata,selectivelytransformingthedataasneeded.
WebSphereQualityStageprovidesa setofintegratedmodules foraccomplishing datare-engineeringtasks:
v Investigating
v Conditioning(standardizing) v Designingand runningmatches
v Determiningwhichdatarecordssurvive
The probabilisticmatchingcapabilityanddynamicweighting strategiesof WebSphereQualityStagehelpyoucreatehigh-quality,accuratedataand consistentlyidentifycorebusinessinformationsuchascustomer,location,and productthroughouttheenterprise. WebSphereQualityStagestandardizesand matches anytype ofinformation.Byensuringdataquality,WebSphere
QualityStagereducesthetimeandcosttoimplementCRM,businessintelligence, ERP,and otherstrategiccustomer-relatedIT initiatives.
Scenarios for data cleansing
Organizationsneedtounderstandthecomplexrelationshipsthattheyhavewith theircustomers, suppliersand distributionchannels.Theyneed tobasedecisions onaccurate countsofpartsandproductstocompeteeffectively, provide
exceptionalservice,andmeet increasingregulatory requirements.Considerthe followingscenarios:
Banking:Oneview ofhouseholds
Tofacilitatemarketingand mailcampaigns,a largeretailbankneededa single dynamicview ofitscustomers’households from60millionrecords in50sourcesystems.
ThebankusesWebSphereQualityStageto automatetheprocess.
Consolidatedviewsarematchedforall50sources,yieldinginformationfor all marketingcampaigns.Theresult isreducedcosts andimprovedreturn onthebank’smarketing investments.Householdingisnow astandard process atthebank,whichhasabetterunderstandingofitscustomersand more effectivecustomerrelationship management.
Pharmaceutical: Operationsinformation
Alargepharmaceuticalcompanyneededa datawarehouseformarketing and salesinformation.Thecompanyhaddiverselegacy datawithdifferent standards andformats,informationthatwas buriedinfree-formfields, incorrectdatavalues,discrepanciesbetweenfieldmetadataandactualdata inthefield,andduplicates.Itwas impossibletogeta complete,
consolidatedviewofan entitysuchastotalquarterly salesfromthe prescriptionsofonedoctor.Reportsweredifficult andtime-consumingto compile,andtheiraccuracywas suspect.
Mostvendortoolslacktheflexibilitytofindallthelegacydatavariants, differentformatsforbusinessentities,and otherdataproblems.The companychoseWebSphereQualityStagebecauseit goesbeyondtraditional data-cleansingtechniques toinvestigatefragmentedlegacydataat thelevel ofeachdatavalue.Analystscannowaccesscompleteandaccurateonline viewsofdoctors,theprescriptionsthatthey write,andtheirmanaged-care affiliations forbetterdecisionsupport,trendanalysis,and targeted
marketing.
Insurance: Onereal-timeviewofthecustomer
AleadinginsurancecompanylackedauniqueIDforeachsubscriber, manyofwhomparticipatedinmultiplehealth,dental, orbenefitplans.
Subscriberswho visitedcustomer portalscouldnotgetcomplete informationontheiraccountstatus,eligible services,andotherdetails.
Using WebSphereQualityStage,thecompanyimplemented areal-time, in-flightdataqualitycheckofall portalinquiries. WebSphereQualityStage and WebSphereMQtransactionswere combinedtoretrievecustomerdata frommultiple sourcesandreturn integratedcustomerviews.Thenew
process providesmorethan25millionsubscriberswitha real-time,
360-degreeview oftheirinsuranceservices.Auniquecustomer IDforeach subscriber isalso helpingtheinsurermovetowardasingle customer database forimprovedcustomerserviceandmarketing.
Where WebSphere QualityStage fits in the overall business context
WebSphereQualityStageperformsthepreparationstageofenterprisedata integration(oftenreferredtoasdatacleansing),asFigure36shows.WebSphere QualityStageleverages thesourcesystemsanalysisthatisperformedby WebSphereInformationAnalyzerandsupportsthetransformationfunctionsof WebSphereDataStage.
Working together,theseproductsautomatewhatwaspreviouslyamanualor neglectedactivitywithin adataintegrationeffort:dataqualityassurance.The combinedbenefitshelpcompaniesavoidoneofthebiggestproblems with data-centric ITprojects:lowreturnoninvestment(ROI)causedbyworkingwith poor-qualitydata.
Data preparationiscriticaltothesuccessofan integrationproject.Thesecommon businessinitiativesare strengthenedbyimproveddataquality:
Consolidating enterpriseapplications
High-qualitydataandtheabilitytoidentifycriticalrolerelationships improves thesuccessofconsolidationprojects.
Marketing campaigns
Strongunderstandingofcustomersandcustomerrelationshipscutscosts, improves customersatisfactionandattrition,and increasesrevenues.
Supplychain management
Betterdataqualityallowsbetter integrationbetweenan organizationand itssuppliersbyresolvingdifferencesin codesanddescriptionsforpartsor
Figure36.WebSphereQualityStagepreparesdataforintegration
Procurement
Identifyingmultiple purchasesfromthesamesupplierandmultiple purchases ofthesamecommodityleadstoimproved termsandreduced cost.
Frauddetection andregulatorycompliance
Betterreference dataenablesreductioninfraudlossthroughmoretimely identification offraudulentactivity.
Whetheranenterpriseismigratingitsinformationsystems, upgradingits organizationand itsprocesses,orintegratingandleveraginginformation,itmust determinetherequirementsandstructureofthedatathatwilladdressthebusiness goals.AsFigure37shows,youcanuseWebSphereQualityStagetomeetthose data qualityrequirementsthroughclassic datare-engineering.
Aprocess forreengineeringdatashouldaccomplishthefollowinggoals:
v Resolveconflictingand ambiguousmeaningsfordatavalues
v Identifynew orhiddenattributesfromfree-formand looselycontrolled source fields
v Standardizedatatomakeiteasier tofind
v Identifyduplicationand relationshipsamongsuchbusinessentitiesas customers,prospects,vendors,suppliers,parts,locations,andevents v Createoneuniqueview ofthebusinessentity
v Facilitateenrichmentofreengineered data,suchasaddinginformationfrom vendorsources orapplyingstandardpostalcertificationroutines
Youcanuseadatareengineeringprocessinbatchorrealtimeforcontinuousdata qualityimprovement.
Figure37.ClassicdatareengineeringwithWebSphereQualityStage
A closer look at WebSphere QualityStage
WebSphereQualityStageusesout-of-the-box,customizablerulestoprepare complex informationaboutyour businessentitiesfora varietyoftransactional, operational, andanalyticalpurposes.
WebSphereQualityStageautomatestheconversionofdataintoverifiedstandard formatsbyusingprobabilisticmatching,inwhichvariablesthatarecommonto records(forexample,givenname,dateofbirth,orsex)are matchedwhenunique identifiers arenotavailable.
WebSphereQualityStagecomponentsincludetheMatchDesigner,fordesigning and testingmatchpasses, anda setofdata-cleansingoperationscalledstages.
Informationisextractedfromthesourcesystem,measured,cleansed,enriched, consolidated, andloaded intothetarget system.
Atruntime, datacleansingjobsconsist ofthefollowingsequenceofstages:
Investigatestage
Givesyoucompletevisibilityintotheactualconditionofdata.
Standardize stage
Reformatsdatafrommultiple systemstoensurethateachdatatype has thecorrectcontentand format.
Match stages
Ensuredataintegritybylinkingrecordsfromoneormoredatasources thatcorrespondtothesamecustomer,supplier,orotherentity.Matching canbe usedtoidentifyduplicateentitiesthatarecausedbydataentry variationsor account-orientedbusinesspractices.Unduplicate matchjobs grouprecordsintosetsthathavesimilarattributes.TheReferenceMatch stagematches referencedatatosourcedatausingavarietyofmatch processes.
Survive stage
Ensures thatthebestavailabledatasurvivesand iscorrectlypreparedfor thetarget.
Business intelligencepackagesthatareavailablewith WebSphereQualityStage providedataenrichment thatisbased onbusinessrules. Theserulescanresolve issueswithcommon dataqualityproblemssuchasinvalidaddressfieldsacross multiple geographies.Thefollowingpackagesareavailable:
Worldwide AddressVerificationandEnhancementSystem(WAVES)
Matches addressdataagainststandardpostal referencedatathathelpsyou verifyaddress informationfor233countriesand regions.
Multinationalgeocoding
Used forspatialinformationmanagement andlocation-basedservices by addinglongitude,latitude,and censusinformationtolocationdata.
Postalcertificationrules
Providecertifiedaddressverification andenhancementto addressfieldsto enablemailerstomeet thelocalrequirementstoqualifyforpostal
discounts.
Where WebSphere QualityStage fits in the IBM Information Server architecture
WebSphereQualityStageisbuiltaround aservices-orientedvisionforstructuring dataqualitytasks thatareusedbymanynewenterprisesystemarchitectures.As part oftheintegratedIBMInformationServerplatform,itissupportedbyabroad rangeofshared servicesandbenefitsfromthereuseofseveralsuitecomponents.
WebSphereQualityStageandDataStageshare thesame infrastructurefor importing and exportingdata,designing,deploying,andrunningjobs,andreporting.The developerusesthesamedesigncanvastospecifytheflowofdatafrom
preparationtotransformationand delivery.
Multiple discreteservicesgive WebSphereQualityStagetheflexibilitytomatch increasingly variedcustomer environmentsand tieredarchitectures.Figure38on page65showshow theWebSphereDataStageand QualityStageDesigner(labeled
″Developmentinterface″)interactswith otherelementsoftheplatformtodeliver enterprisedataanalysisservices.
Thefollowingsuitecomponentsareshared:
Common userinterface
TheWebSphereDataStageandQualityStageDesignerprovidesa development environment.TheWebSphereDataStageand QualityStage Administratorprovides accesstodeploymentand administrativefunctions.
WebSphereQualityStageistightlyintegratedwith WebSphereDataStage and sharesthesamedesigncanvas,whichenablesuserstodesignjobs with datatransformationstagesand dataqualitystagesinthesame session.
Common services
WebSphereQualityStageusesthecommonservices inIBMInformation Server forloggingand security.Becausemetadataisshared“live” across tools,youcanaccessservicessuchasimpactanalysis withoutleavingthe designenvironment.Youcanalsoaccessdomain-specificservices for enterprisedatacleansingsuchasinvestigate,standardize,match,and survivefromthislayer.
Figure38.IBMInformationServerproductarchitecture
Common repository
Therepositoryholdsdatatobeshared bymultipleprojects.Clientscan accessmetadataandresultsofdataanalysisfromtherespectiveservice layers.
Common parallelprocessingengine
Theparallelprocessingengine addresseshighthroughput requirementsfor analyzing largequantitiesofsourcedataand handlingincreasingvolumes ofworkindecreasingtimeframes.
Common connectors
AnydatasourcethatissupportedbyIBMInformationServercanbeused asinputtoa WebSphereQualityStagejobbyusingconnectors.The connectors alsoenableaccesstothecommonrepositoryfromthe processingengine.
WebSphere QualityStage tasks
WebSphereQualityStagehelpsestablisha clearunderstandingofdataand uses bestpractices toimprovedataquality.
Asshown inFigure39,providingqualitydatahasfourstages:
Data investigation
Tofullyunderstandinformation.
Data standardization
Tofullycleanseinformation.
Data matching
Tocreatesemantickeystoidentifyinformationrelationships.
Data survivorship
Tobuildthebestavailableview ofrelatedinformation.
Figure39.StepsintheWebSphereQualityStageprocess
Investigate stage
Understandingyourdataisanecessary precursortocleansing.Youcanuse WebSphereInformationAnalyzertocreatea directinputintothecleansingprocess byusingsharedmetadata, orusetheInvestigatestagetocreatethisinput.
TheInvestigatestageshowstheactualconditionofdatainlegacy sourcesand identifies andcorrects dataproblemsbeforetheycorrupt newsystems.
Investigationparses andanalyzesfree-formfields,countsuniquevalues,and classifies orassignsabusinessmeaningtoeachoccurrenceof avalue withina field.
Investigationachieves thesegoals:
v Uncoverstrends,potentialanomalies,metadatadiscrepancies,and undocumentedbusinesspractices.
v Identifiesinvalidordefaultvalues.
v Revealscommonterminology.
v Verifiesthereliabilityoffields proposedasmatching criteria.
TheInvestigatestagetakes asingleinput,whichcanbea linkfromanydatabase connectorthatissupportedbyWebSphereDataStage,fromaflatfileordataset,or fromanyprocessingstage.InputstotheInvestigatestagecanbe fixedlengthor variable.
AsFigure40shows,youusetheWebSphereDataStageand QualityStageDesigner tospecifytheInvestigatestage.Thestagecanhaveoneortwooutputlinks, dependingonthetypeofinvestigationthatyouspecify.
Figure40.DesigningtheInvestigatestage
The WordInvestigationstageparsesfree-form datafieldsintoindividualtokens and analyzesthemtocreatepatterns.Thisstagealsoprovides frequencycountson thetokens. Tocreatethepatterns inaddressdata,forexample,theWord
Investigationstage usesa setofrulesforclassifyingpersonalnames,business names,andaddresses.Thestageprovidespre-builtrulesetsforinvestigating patterns onnamesandpostaladdressesforanumber ofdifferentcountries.For example,fortheUnitedStatesthestage parsesthefollowingcomponents:
USPREP
Name,address,and areaif thedataisnotpreviousformatted USNAME
Individualand organizationnames USADDR
Streetandmailingaddresses USAREA
City, state,ZIPcode,andsoon
The testfield123St. VirginiaSt.isanalyzedinthefollowingway:
1. Field parsingwouldbreaktheaddress intotheindividualtokensof 123,St., Virginia,andSt.
2. Lexicalanalysisdeterminesthebusinesssignificanceofeachpiece:
a. 123=number b. St.=streettype c. Virginia=alpha d. St.=Streettype
3. Contextanalysis identifiesthevariousdatastructuresand contentas123St.
Virginia,St.
a. 123=House number b. St.Virginia =Streetaddress c. St.= Streettype
The CharacterInvestigationstageparses asingle-domainfield(onethatcontains one dataelementortoken,suchasSocialSecuritynumber,telephonenumber,date, or ZIPcode)toanalyzeand classifydata.TheCharacterInvestigationstage
provides afrequencydistribution andpatternanalysisofthetokens.
Apatternreportispreparedforalltypesofinvestigationsanddisplaysthecount, percentage ofdatathatmatchesthispattern,thegeneratedpattern,andsample data.Thisoutputcanbepresentedina widerangeof formatstoconformto standardreportingtools.
Standardize stage
Based onanunderstandingofdatafromtheInvestigationstage,youcanapply out-of-the-box ruleswith theStandardize stagetoreformatdatafrommultiple systems. Thisstage facilitateseffectivematchingand outputformatting.
WebSphereQualityStagecantransformanydatatypeintoyour desiredstandards.
Itappliesconsistentrepresentations, correctsmisspellings,andincorporates businessorindustrystandards.Itformatsdata,placeseachvalue intoasingle domain field,and transformsdataintoastandardformat.
AsFigure41shows,youcanselectfrompredefinedrulestoapplytheappropriate standardization forthedataset.
Match stages overview
Data matchingfindsrecordsina singledatasourceorindependentdatasources thatrefertothesameentity (suchasaperson, organization,location,product,or material)evenifthereisnopredeterminedkey.
Toincreaseitsusability andcompleteness,datacanbeconsolidatedorlinkedalong anyrelationship,suchasa commonperson,business,place, product,part,or event.Youcanalsousematchingtofindduplicateentitiesthatarecausedbydata entryvariations oraccount-orientedbusinesspractices.
Duringthedatamatchingstage,WebSphereQualityStagetakestheseactions:
v Identifiesduplicateentities(suchascustomers, suppliers,products,orparts) withinoneormoredatasources
v
Createsaconsolidatedviewof anentityaccordingtobusinessrules
v Provideshouseholdingforindividuals(suchasa familyorgroupofindividuals atalocation)andhouseholding forcommercialentities(multiple businessesin thesame locationordifferentlocations)
v Enablesthecreationofmatchgroupsacrossdatasourcesthatmight ormight nothavea predeterminedkey
v Enrichesexistingdatawith newattributesfromexternal sourcessuchascredit bureaudataorchangeofaddress files
Match frequency stage
TheMatchFrequency stagegivesyoudirectcontroloverthedispositionof generatedfrequencydata.Thisstageprovidesresultsthatcanbeusedbythe MatchDesignerandmatchstages,but enablesyoutogeneratethefrequencydata independent ofrunningthematches.
Youcangeneratefrequencyinformationbyusinganydatathatprovidesthefields thatare neededbyamatch.Thenyoucanletthegeneratedfrequencydataflow
Figure41.Standardizeruleprocess
into amatchstage,storeitforlateruse,orboth.Figure42showshowStandardize stage andMatchFrequencystageare addedintheDesignerclient.
Inthisexample,inputdataisbeingprocessedintheStandardizestage witha rule set thatcreatesconsistentformats.Thedataisthensplitintotwodatastreams.
Onestreampassesdatatoa standardoutputandtheotherpassesdatatothe MatchFrequency stage.
Match stage
Matching isa two-stepprocess:firstyoublockrecordsand thenyoumatchthem.
Blockingstep
Blockingidentifies subsetsofdatainwhichmatches canbe moreefficiently performed. Blockinglimitsthenumberofrecord pairsthatarebeing examined,whichincreasestheefficiencyofthematching.
Tounderstandtheconceptofblocking,considera columnthatcontainsage data.Ifthereare100possibleages,blocking partitionsasourceinto100 subsets.Thefirst subsetisallpeoplewithanageofzero, thenext ispeople with anageof1,and soon.Thesesubsets arecalledblocks.
Tounderstandtheconceptofblocking,considera columnthatcontainsage data.Ifthereare100possibleages,blocking partitionsasourceinto100 subsets.Thefirst subsetisallpeoplewithanageofzero, thenext ispeople with anageof1,and soon.Thesesubsets arecalledblocks.