• No se han encontrado resultados

CAPITULO 1: FUNDAMENTACION TEORICA

1.5 Metodologías de Desarrollo

Thedatathatdrives today’sbusinesssystemsoftencomesfromavarietyof sources anddisparatedatastructures.Asorganizationsgrow,theyretainolddata systemsandaugmentthemwith newandimprovedsystems. Databecomes difficult tomanageand use,anda clearpictureofa customer,product, orbuying trendcanbepractically impossibletoascertain.

Theprice ofpoordataisillustratedbytheseexamples:

v Adataerrorina bankcauses300 credit-worthycustomerstoreceivemortgage defaultnotices.Theerrorcosts thebanktime, effort,and customergoodwill.

v Amarketing organizationsendsduplicatedirectmailpieces.Asix percent redundancyineachmailingcosts hundredsofthousandsofdollarsayear.

v Amanaged-care agencycannotrelateprescriptiondrugusage topatientsand prescribingdoctors.Theagency’s OLAPapplicationfailstoidentifyareasto improveefficiencyandinventorymanagementandnew sellingopportunities.

Thesourceof qualityissuesisalackofcommonstandardsforhow tostoredata and aninconsistencyinhow thedataisinput.Differentbusinessoperationsare oftenverycreativewiththedatavaluesthattheyintroduceintoyourapplication environments. Inconsistencyacrosssources makesunderstandingrelationships betweencriticalbusinessentitiessuchascustomersandproductsverydifficult.In manycases,thereisnoreliableandpersistentkeythatyoucanuseacrossthe enterprisetogetall theinformationthatisassociatedwitha singlecustomeror product.

Without high-qualitydata,strategicsystemscannotmatchand integrateall related datatoprovidea completeview oftheorganizationand theinterrelationships within it.CIOscannolongercountona returnontheinvestmentsmadeincritical businessapplications.The solutioncallsforaproductthatcanautomatically re-engineer andmatchalltypesofcustomer,product,and enterprisedata,inbatch orat thetransactionlevelinrealtime.

WebSphereQualityStageisadatare-engineeringenvironmentthatisdesigned to help programmers,programmeranalysts,businessanalysts,andotherscleanseand enrichdatatomeetbusinessobjectivesanddataqualitymanagementstandards.

Introduction to WebSphere QualityStage

WebSphereQualityStagecomprisesaset ofstages,a MatchDesigner,andrelated capabilitiesthatprovideadevelopment environmentforbuildingdata-cleansing tasks calledjobs.

Using thestagesanddesigncomponents,youcanquickly andeasilyprocess large storesofdata,selectivelytransformingthedataasneeded.

WebSphereQualityStageprovidesa setofintegratedmodules foraccomplishing datare-engineeringtasks:

v Investigating

v Conditioning(standardizing) v Designingand runningmatches

v Determiningwhichdatarecordssurvive

The probabilisticmatchingcapabilityanddynamicweighting strategiesof WebSphereQualityStagehelpyoucreatehigh-quality,accuratedataand consistentlyidentifycorebusinessinformationsuchascustomer,location,and productthroughouttheenterprise. WebSphereQualityStagestandardizesand matches anytype ofinformation.Byensuringdataquality,WebSphere

QualityStagereducesthetimeandcosttoimplementCRM,businessintelligence, ERP,and otherstrategiccustomer-relatedIT initiatives.

Scenarios for data cleansing

Organizationsneedtounderstandthecomplexrelationshipsthattheyhavewith theircustomers, suppliersand distributionchannels.Theyneed tobasedecisions onaccurate countsofpartsandproductstocompeteeffectively, provide

exceptionalservice,andmeet increasingregulatory requirements.Considerthe followingscenarios:

Banking:Oneview ofhouseholds

Tofacilitatemarketingand mailcampaigns,a largeretailbankneededa single dynamicview ofitscustomers’households from60millionrecords in50sourcesystems.

ThebankusesWebSphereQualityStageto automatetheprocess.

Consolidatedviewsarematchedforall50sources,yieldinginformationfor all marketingcampaigns.Theresult isreducedcosts andimprovedreturn onthebank’smarketing investments.Householdingisnow astandard process atthebank,whichhasabetterunderstandingofitscustomersand more effectivecustomerrelationship management.

Pharmaceutical: Operationsinformation

Alargepharmaceuticalcompanyneededa datawarehouseformarketing and salesinformation.Thecompanyhaddiverselegacy datawithdifferent standards andformats,informationthatwas buriedinfree-formfields, incorrectdatavalues,discrepanciesbetweenfieldmetadataandactualdata inthefield,andduplicates.Itwas impossibletogeta complete,

consolidatedviewofan entitysuchastotalquarterly salesfromthe prescriptionsofonedoctor.Reportsweredifficult andtime-consumingto compile,andtheiraccuracywas suspect.

Mostvendortoolslacktheflexibilitytofindallthelegacydatavariants, differentformatsforbusinessentities,and otherdataproblems.The companychoseWebSphereQualityStagebecauseit goesbeyondtraditional data-cleansingtechniques toinvestigatefragmentedlegacydataat thelevel ofeachdatavalue.Analystscannowaccesscompleteandaccurateonline viewsofdoctors,theprescriptionsthatthey write,andtheirmanaged-care affiliations forbetterdecisionsupport,trendanalysis,and targeted

marketing.

Insurance: Onereal-timeviewofthecustomer

AleadinginsurancecompanylackedauniqueIDforeachsubscriber, manyofwhomparticipatedinmultiplehealth,dental, orbenefitplans.

Subscriberswho visitedcustomer portalscouldnotgetcomplete informationontheiraccountstatus,eligible services,andotherdetails.

Using WebSphereQualityStage,thecompanyimplemented areal-time, in-flightdataqualitycheckofall portalinquiries. WebSphereQualityStage and WebSphereMQtransactionswere combinedtoretrievecustomerdata frommultiple sourcesandreturn integratedcustomerviews.Thenew

process providesmorethan25millionsubscriberswitha real-time,

360-degreeview oftheirinsuranceservices.Auniquecustomer IDforeach subscriber isalso helpingtheinsurermovetowardasingle customer database forimprovedcustomerserviceandmarketing.

Where WebSphere QualityStage fits in the overall business context

WebSphereQualityStageperformsthepreparationstageofenterprisedata integration(oftenreferredtoasdatacleansing),asFigure36shows.WebSphere QualityStageleverages thesourcesystemsanalysisthatisperformedby WebSphereInformationAnalyzerandsupportsthetransformationfunctionsof WebSphereDataStage.

Working together,theseproductsautomatewhatwaspreviouslyamanualor neglectedactivitywithin adataintegrationeffort:dataqualityassurance.The combinedbenefitshelpcompaniesavoidoneofthebiggestproblems with data-centric ITprojects:lowreturnoninvestment(ROI)causedbyworkingwith poor-qualitydata.

Data preparationiscriticaltothesuccessofan integrationproject.Thesecommon businessinitiativesare strengthenedbyimproveddataquality:

Consolidating enterpriseapplications

High-qualitydataandtheabilitytoidentifycriticalrolerelationships improves thesuccessofconsolidationprojects.

Marketing campaigns

Strongunderstandingofcustomersandcustomerrelationshipscutscosts, improves customersatisfactionandattrition,and increasesrevenues.

Supplychain management

Betterdataqualityallowsbetter integrationbetweenan organizationand itssuppliersbyresolvingdifferencesin codesanddescriptionsforpartsor

Figure36.WebSphereQualityStagepreparesdataforintegration

Procurement

Identifyingmultiple purchasesfromthesamesupplierandmultiple purchases ofthesamecommodityleadstoimproved termsandreduced cost.

Frauddetection andregulatorycompliance

Betterreference dataenablesreductioninfraudlossthroughmoretimely identification offraudulentactivity.

Whetheranenterpriseismigratingitsinformationsystems, upgradingits organizationand itsprocesses,orintegratingandleveraginginformation,itmust determinetherequirementsandstructureofthedatathatwilladdressthebusiness goals.AsFigure37shows,youcanuseWebSphereQualityStagetomeetthose data qualityrequirementsthroughclassic datare-engineering.

Aprocess forreengineeringdatashouldaccomplishthefollowinggoals:

v Resolveconflictingand ambiguousmeaningsfordatavalues

v Identifynew orhiddenattributesfromfree-formand looselycontrolled source fields

v Standardizedatatomakeiteasier tofind

v Identifyduplicationand relationshipsamongsuchbusinessentitiesas customers,prospects,vendors,suppliers,parts,locations,andevents v Createoneuniqueview ofthebusinessentity

v Facilitateenrichmentofreengineered data,suchasaddinginformationfrom vendorsources orapplyingstandardpostalcertificationroutines

Youcanuseadatareengineeringprocessinbatchorrealtimeforcontinuousdata qualityimprovement.

Figure37.ClassicdatareengineeringwithWebSphereQualityStage

A closer look at WebSphere QualityStage

WebSphereQualityStageusesout-of-the-box,customizablerulestoprepare complex informationaboutyour businessentitiesfora varietyoftransactional, operational, andanalyticalpurposes.

WebSphereQualityStageautomatestheconversionofdataintoverifiedstandard formatsbyusingprobabilisticmatching,inwhichvariablesthatarecommonto records(forexample,givenname,dateofbirth,orsex)are matchedwhenunique identifiers arenotavailable.

WebSphereQualityStagecomponentsincludetheMatchDesigner,fordesigning and testingmatchpasses, anda setofdata-cleansingoperationscalledstages.

Informationisextractedfromthesourcesystem,measured,cleansed,enriched, consolidated, andloaded intothetarget system.

Atruntime, datacleansingjobsconsist ofthefollowingsequenceofstages:

Investigatestage

Givesyoucompletevisibilityintotheactualconditionofdata.

Standardize stage

Reformatsdatafrommultiple systemstoensurethateachdatatype has thecorrectcontentand format.

Match stages

Ensuredataintegritybylinkingrecordsfromoneormoredatasources thatcorrespondtothesamecustomer,supplier,orotherentity.Matching canbe usedtoidentifyduplicateentitiesthatarecausedbydataentry variationsor account-orientedbusinesspractices.Unduplicate matchjobs grouprecordsintosetsthathavesimilarattributes.TheReferenceMatch stagematches referencedatatosourcedatausingavarietyofmatch processes.

Survive stage

Ensures thatthebestavailabledatasurvivesand iscorrectlypreparedfor thetarget.

Business intelligencepackagesthatareavailablewith WebSphereQualityStage providedataenrichment thatisbased onbusinessrules. Theserulescanresolve issueswithcommon dataqualityproblemssuchasinvalidaddressfieldsacross multiple geographies.Thefollowingpackagesareavailable:

Worldwide AddressVerificationandEnhancementSystem(WAVES)

Matches addressdataagainststandardpostal referencedatathathelpsyou verifyaddress informationfor233countriesand regions.

Multinationalgeocoding

Used forspatialinformationmanagement andlocation-basedservices by addinglongitude,latitude,and censusinformationtolocationdata.

Postalcertificationrules

Providecertifiedaddressverification andenhancementto addressfieldsto enablemailerstomeet thelocalrequirementstoqualifyforpostal

discounts.

Where WebSphere QualityStage fits in the IBM Information Server architecture

WebSphereQualityStageisbuiltaround aservices-orientedvisionforstructuring dataqualitytasks thatareusedbymanynewenterprisesystemarchitectures.As part oftheintegratedIBMInformationServerplatform,itissupportedbyabroad rangeofshared servicesandbenefitsfromthereuseofseveralsuitecomponents.

WebSphereQualityStageandDataStageshare thesame infrastructurefor importing and exportingdata,designing,deploying,andrunningjobs,andreporting.The developerusesthesamedesigncanvastospecifytheflowofdatafrom

preparationtotransformationand delivery.

Multiple discreteservicesgive WebSphereQualityStagetheflexibilitytomatch increasingly variedcustomer environmentsand tieredarchitectures.Figure38on page65showshow theWebSphereDataStageand QualityStageDesigner(labeled

″Developmentinterface″)interactswith otherelementsoftheplatformtodeliver enterprisedataanalysisservices.

Thefollowingsuitecomponentsareshared:

Common userinterface

TheWebSphereDataStageandQualityStageDesignerprovidesa development environment.TheWebSphereDataStageand QualityStage Administratorprovides accesstodeploymentand administrativefunctions.

WebSphereQualityStageistightlyintegratedwith WebSphereDataStage and sharesthesamedesigncanvas,whichenablesuserstodesignjobs with datatransformationstagesand dataqualitystagesinthesame session.

Common services

WebSphereQualityStageusesthecommonservices inIBMInformation Server forloggingand security.Becausemetadataisshared“live” across tools,youcanaccessservicessuchasimpactanalysis withoutleavingthe designenvironment.Youcanalsoaccessdomain-specificservices for enterprisedatacleansingsuchasinvestigate,standardize,match,and survivefromthislayer.

Figure38.IBMInformationServerproductarchitecture

Common repository

Therepositoryholdsdatatobeshared bymultipleprojects.Clientscan accessmetadataandresultsofdataanalysisfromtherespectiveservice layers.

Common parallelprocessingengine

Theparallelprocessingengine addresseshighthroughput requirementsfor analyzing largequantitiesofsourcedataand handlingincreasingvolumes ofworkindecreasingtimeframes.

Common connectors

AnydatasourcethatissupportedbyIBMInformationServercanbeused asinputtoa WebSphereQualityStagejobbyusingconnectors.The connectors alsoenableaccesstothecommonrepositoryfromthe processingengine.

WebSphere QualityStage tasks

WebSphereQualityStagehelpsestablisha clearunderstandingofdataand uses bestpractices toimprovedataquality.

Asshown inFigure39,providingqualitydatahasfourstages:

Data investigation

Tofullyunderstandinformation.

Data standardization

Tofullycleanseinformation.

Data matching

Tocreatesemantickeystoidentifyinformationrelationships.

Data survivorship

Tobuildthebestavailableview ofrelatedinformation.

Figure39.StepsintheWebSphereQualityStageprocess

Investigate stage

Understandingyourdataisanecessary precursortocleansing.Youcanuse WebSphereInformationAnalyzertocreatea directinputintothecleansingprocess byusingsharedmetadata, orusetheInvestigatestagetocreatethisinput.

TheInvestigatestageshowstheactualconditionofdatainlegacy sourcesand identifies andcorrects dataproblemsbeforetheycorrupt newsystems.

Investigationparses andanalyzesfree-formfields,countsuniquevalues,and classifies orassignsabusinessmeaningtoeachoccurrenceof avalue withina field.

Investigationachieves thesegoals:

v Uncoverstrends,potentialanomalies,metadatadiscrepancies,and undocumentedbusinesspractices.

v Identifiesinvalidordefaultvalues.

v Revealscommonterminology.

v Verifiesthereliabilityoffields proposedasmatching criteria.

TheInvestigatestagetakes asingleinput,whichcanbea linkfromanydatabase connectorthatissupportedbyWebSphereDataStage,fromaflatfileordataset,or fromanyprocessingstage.InputstotheInvestigatestagecanbe fixedlengthor variable.

AsFigure40shows,youusetheWebSphereDataStageand QualityStageDesigner tospecifytheInvestigatestage.Thestagecanhaveoneortwooutputlinks, dependingonthetypeofinvestigationthatyouspecify.

Figure40.DesigningtheInvestigatestage

The WordInvestigationstageparsesfree-form datafieldsintoindividualtokens and analyzesthemtocreatepatterns.Thisstagealsoprovides frequencycountson thetokens. Tocreatethepatterns inaddressdata,forexample,theWord

Investigationstage usesa setofrulesforclassifyingpersonalnames,business names,andaddresses.Thestageprovidespre-builtrulesetsforinvestigating patterns onnamesandpostaladdressesforanumber ofdifferentcountries.For example,fortheUnitedStatesthestage parsesthefollowingcomponents:

USPREP

Name,address,and areaif thedataisnotpreviousformatted USNAME

Individualand organizationnames USADDR

Streetandmailingaddresses USAREA

City, state,ZIPcode,andsoon

The testfield123St. VirginiaSt.isanalyzedinthefollowingway:

1. Field parsingwouldbreaktheaddress intotheindividualtokensof 123,St., Virginia,andSt.

2. Lexicalanalysisdeterminesthebusinesssignificanceofeachpiece:

a. 123=number b. St.=streettype c. Virginia=alpha d. St.=Streettype

3. Contextanalysis identifiesthevariousdatastructuresand contentas123St.

Virginia,St.

a. 123=House number b. St.Virginia =Streetaddress c. St.= Streettype

The CharacterInvestigationstageparses asingle-domainfield(onethatcontains one dataelementortoken,suchasSocialSecuritynumber,telephonenumber,date, or ZIPcode)toanalyzeand classifydata.TheCharacterInvestigationstage

provides afrequencydistribution andpatternanalysisofthetokens.

Apatternreportispreparedforalltypesofinvestigationsanddisplaysthecount, percentage ofdatathatmatchesthispattern,thegeneratedpattern,andsample data.Thisoutputcanbepresentedina widerangeof formatstoconformto standardreportingtools.

Standardize stage

Based onanunderstandingofdatafromtheInvestigationstage,youcanapply out-of-the-box ruleswith theStandardize stagetoreformatdatafrommultiple systems. Thisstage facilitateseffectivematchingand outputformatting.

WebSphereQualityStagecantransformanydatatypeintoyour desiredstandards.

Itappliesconsistentrepresentations, correctsmisspellings,andincorporates businessorindustrystandards.Itformatsdata,placeseachvalue intoasingle domain field,and transformsdataintoastandardformat.

AsFigure41shows,youcanselectfrompredefinedrulestoapplytheappropriate standardization forthedataset.

Match stages overview

Data matchingfindsrecordsina singledatasourceorindependentdatasources thatrefertothesameentity (suchasaperson, organization,location,product,or material)evenifthereisnopredeterminedkey.

Toincreaseitsusability andcompleteness,datacanbeconsolidatedorlinkedalong anyrelationship,suchasa commonperson,business,place, product,part,or event.Youcanalsousematchingtofindduplicateentitiesthatarecausedbydata entryvariations oraccount-orientedbusinesspractices.

Duringthedatamatchingstage,WebSphereQualityStagetakestheseactions:

v Identifiesduplicateentities(suchascustomers, suppliers,products,orparts) withinoneormoredatasources

v

Createsaconsolidatedviewof anentityaccordingtobusinessrules

v Provideshouseholdingforindividuals(suchasa familyorgroupofindividuals atalocation)andhouseholding forcommercialentities(multiple businessesin thesame locationordifferentlocations)

v Enablesthecreationofmatchgroupsacrossdatasourcesthatmight ormight nothavea predeterminedkey

v Enrichesexistingdatawith newattributesfromexternal sourcessuchascredit bureaudataorchangeofaddress files

Match frequency stage

TheMatchFrequency stagegivesyoudirectcontroloverthedispositionof generatedfrequencydata.Thisstageprovidesresultsthatcanbeusedbythe MatchDesignerandmatchstages,but enablesyoutogeneratethefrequencydata independent ofrunningthematches.

Youcangeneratefrequencyinformationbyusinganydatathatprovidesthefields thatare neededbyamatch.Thenyoucanletthegeneratedfrequencydataflow

Figure41.Standardizeruleprocess

into amatchstage,storeitforlateruse,orboth.Figure42showshowStandardize stage andMatchFrequencystageare addedintheDesignerclient.

Inthisexample,inputdataisbeingprocessedintheStandardizestage witha rule set thatcreatesconsistentformats.Thedataisthensplitintotwodatastreams.

Onestreampassesdatatoa standardoutputandtheotherpassesdatatothe MatchFrequency stage.

Match stage

Matching isa two-stepprocess:firstyoublockrecordsand thenyoumatchthem.

Blockingstep

Blockingidentifies subsetsofdatainwhichmatches canbe moreefficiently performed. Blockinglimitsthenumberofrecord pairsthatarebeing examined,whichincreasestheefficiencyofthematching.

Tounderstandtheconceptofblocking,considera columnthatcontainsage data.Ifthereare100possibleages,blocking partitionsasourceinto100 subsets.Thefirst subsetisallpeoplewithanageofzero, thenext ispeople with anageof1,and soon.Thesesubsets arecalledblocks.

Tounderstandtheconceptofblocking,considera columnthatcontainsage data.Ifthereare100possibleages,blocking partitionsasourceinto100 subsets.Thefirst subsetisallpeoplewithanageofzero, thenext ispeople with anageof1,and soon.Thesesubsets arecalledblocks.

Documento similar