• No se han encontrado resultados

Sistemas de Gestión de Bases de Datos

The implementation of the EM algorithm described above does not utilise the information content within the fully observed UKSIC_Category, Easting and Northing variables (for the reasons given in the preceding section). However, preliminary experiments revealed that the accuracy of the imputed values increased when these three variables were utilised by the NN algorithm. The question was: would the NN algorithm produce more accurately imputed financial values than the EM algorithm because it utilised this additional information?

The NN algorithm given in Fig. 3.4 was used to perform the imputation process. This algorithm imputes the missing financial value in a particular Firm (matrix row) Fm by taking a copy of the known value from the closest donor Firm

F

d , such that;

Imputed financial value

=

F

mc

=

F

dc

Where c is the matrix column with the missing financial value (to be imputed). And where m is the matrix row index of the Firm with the missing value. And where d is the matrix row index of the Firm containing the donor value.

Where the closest donor Firm

F

d is found by comparing Fm with all of the other Firms in the same UKSIC_Category as Fm , and using the Firm that returns the smallest value of the multivariate Euclidean distance function dist

(

Fm,Fd

)

as the donor - i.e. Finding the minimum value of

dist

(F

m

,F

d

)

for all

F

d

R

, where;

for for

(

)

(

)

=

S j mj dj d m

F

F

F

F

dist

,

2 for all

F

d

R

(6.3)

Where R = {F1,.... Fk} is the set of all Firms in the same UKSIC_Category as

F

m

And where d = 1 to k (dm) indexes the Firms (matrix rows) in the set R

And where jS indexes the matrix columns that have known values in both Fm and

F

d

i.e. using the matrix row comparison method defined in Fig. 3.1.

Searching for donor Firms within the most suitable UKSIC categories

The method used to decide whether a set of Firms were in the same UKSIC category (deciding which

F

d

R

) requires further explanation, because the Firms within the TCD dataset can be segmented at five different levels of UKSIC granularity. The lowest level of granularity (level 1) creates the smallest number of segments and the highest level (level 5) creates the largest number of segments, as shown in table 6.5, below.

Table 6.5 - Representation of the Education / Health & Social Work UKSIC categories in the TCD

UKSIC segmentation levels 1 to 5

(Number of segments created) UKSIC category details as stored in the TCD database 1 (1) 2 (2) 3 (7) 4 (12) 5 (15) UKSIC

code UKSIC category description

Number of Firms

8 0 0 0 0 80000 Education 58

8 0 2 1 0 80210 General secondary education 27,997

8 0 2 2 0 80220 Technical and vocational secondary education 183

8 0 3 0 1 80301 Sub-degree level higher education 76

8 0 3 0 2 80302 First-degree level higher education 4,563

8 0 4 2 1 80421 Activities of private training providers 11,359 8 0 4 2 9 80429 Other adult and other education not elsewhere classified 25,691

8 5 1 1 0 85110 Hospital activities 5,287

8 5 1 1 3 85113 Nursing home activities 3,003

8 5 1 2 0 85120 Medical practice activities 17,418

8 5 1 3 0 85130 Dental practice activities 11,308

8 5 1 4 0 85140 Other human health activities 30,574

8 5 2 0 0 85200 Veterinary activities 4,367

8 5 3 1 2 85312 Non-charitable social work activities with accommodation 19,034 8 5 3 2 2 85322 Non-charitable social work activities without accommodation 31,467

The TCD dataset contains ten level 1 UKSIC segments, numbered 0 to 9, and all of the UKSIC categories in segment number 8 are shown in table 6.5, above. Hence, if the dataset was segmented at level 1, then ten segments would be created and the search for each donor Firm would take place within the largest possible number of UKSIC categories. It follows that segmenting the Firms at level 5 should produce the most accurately imputed values,

category. However, preliminary experiments revealed that segmenting at level 3 produced the best results, because segmenting at levels 4 and 5 created several segments with 100% missing values, which meant that no donor Firms could be found for many of the missing values.

However, the benefits of searching for donor Firms within the best UKSIC segments were somewhat reduced, for the following reasons. Firstly, some UKSIC categories had much larger proportions of missing data than others. Secondly, some of the Firms in the TCD had been placed in the wrong UKSIC categories by mistake. This was partly caused by placing Firms which could not be easily categorised into “catch all” UKSIC categories, such as the “Other adult and other education not elsewhere classified” category, shown in table 6.5.

Scaling the variables used for the Euclidean distance calculations

Equation (6.3) repeatedly measures the distance between two Firms in nine dimensional Euclidean space, because nine of the ten variables given in table 6.1 are included in the

(F

m

F

d

)

dist

,

computations (the UKSIC_Category is excluded). To clarify, if only the Easting and Northing variables were included in the computation then equation (6.3) would find the geographically closest donor Firm in the same UKSIC_Category as the Firm with the missing value - i.e. by comparing all of the two dimensional Euclidean distances, which can be easily visualised.

However, some of the nine variables included in the computation had much larger values than others (such as NetWorth) and these variables were having a disproportionate effect on the

(F

m

F

d

)

dist

,

values. In particular, the Employees variable was being “swamped” by the (much larger) financial variables, so that the number of Employees was having very little effect on the

dist(F

m

,F

d

)

results. This problem was solved as described below.

Firstly, the Employees variable and the six financial variables were scaled, so that they all carried the same weight in the distance calculations. This was achieved by transforming the variable values to their Z scores prior to executing the NN algorithm, as suggested by Manly (1986). This simple process noticeably improved the accuracy of the imputed financial values. Secondly, the Easting and Northing variables were divided by 100,000 just after they were loaded into the data matrix. This gave these variables about one tenth of the weight of the other variables, which proved to be very effective. That is, various weighting schemes were tried for the Easting and Northing variables using a trial and error approach and dividing by 100,000 seemed to produce the most accurately imputed financial values.

Documento similar