Metodología empleada - Desarrollo de una herramienta web para la visualización de comunicacione

The missingness pattern analysis process described in section 6.2 defined the TCD missing data problem. This led to the conclusion that all further efforts should be concentrated on the imputation of missing financial values within the SME and LARGE Firm categories. Following this, the investigations described in section 6.3 led to the formulation of the hypothesis that removing all Firms containing one or more financial outlier values might improve imputation accuracy. This led to the implementation of a method for removing the outlier Firms, which proved to be very effective. Following this, the work described in section 6.4 led to the implementation of the EM and NN algorithm modifications needed for the imputation of the missing TCD financial values.

Thus, the stage has been set for the description of the experiments which were designed to find the most accurate methods for imputing the missing financial values. The following two sections give a detailed description of these experiments.

Description of the 48 financial variable imputation experiments

Twelve sets of four imputation experiments were performed. The same TCD financial variable was imputed for each set of four experiments. For example, the list below describes the set of experiments that were performed for the SME Firm Payroll variable.

1. Imputation using the EM algorithm with outlier Firms retained.

2. Imputation using the EM algorithm with outlier Firms deleted.

3. Imputation using the NN algorithm with outlier Firms retained.

4. Imputation using the NN algorithm with outlier Firms deleted.

The objective was to discover which of these four imputation methods would produce the most accurately imputed SME Payroll values. The same set of four experiments were performed for each of the following twelve variables;

•

SME Firm variables Sales, Payroll, Depreciation, DirectorPay, NetWorth, PBT

•

LARGE Firm variables Sales, Payroll, Depreciation, DirectorPay, NetWorth, PBT

Thus, 48 experiments were performed in total. Fifty consecutive executions of the required imputation algorithm (EM or NN) were performed for each of the 48 experiments. Hence, 2,400 executions of the imputation algorithms (1,200 for EM and 1,200 for NN) were performed. A small proportion of the known values were randomly deleted and “put back” for each execution of the EM and NN algorithms, using the procedure given in Fig. 4.5. That is, the proposed imputation evaluation method was executed 48 times, using 50 iterations per execution. The following two sections describe the 48 experiments in more

6.5.1 Definition of the EM Imputation Experiments

Tables 6.6 and 6.7 describe the EM algorithm imputation evaluation experiments that were performed for the SME Payroll variable. The same pair of experiments were repeated for all 12 of the SME and LARGE Firm financial variables. That is, the descriptions given in tables 6.6 and 6.7 hold for all of the TCD financial variables, with the only difference being the proportion of missing values for each variable, as given in table 6.2, above.

Table 6.6 - Description of TCD imputation evaluation experiment 1 (EM retaining outlier Firms) Imputation of SME Payroll values using 50 executions of the EM algorithm

EXPERIMENTAL QUESTION

Can the missing SME Payroll figures be accurately imputed using the EM imputation process described below?

Description of the missing value dataset

• All 61389 SME Firms were loaded into the data matrix from the TCD database, using SQL. • The TCD columns loaded into the matrix were: Sales, Payroll, Depreciation, DirectorPay,

NetWorth, PBT and Employees. All columns contained integer values only.

Variable to be imputed and evaluated

• The variable to be imputed and evaluated was Payroll.

• Payroll had 38724 missing values - i.e. 63.08% of the 61389 data matrix rows had missing values.

Imputation method used for the experiment

• Imputation was performed using the EM algorithm • The EM algorithm convergence value was 0.0001

• Box-Cox power transformations were performed for all variables.

• The initial covariance matrix was created using all data matrix rows with a full set of known values. • All imputed values were rounded to the nearest integer before estimating the predictive accuracy of

the imputed values.

Imputation evaluation method

• 50 executions of the EM imputation algorithm were performed (using the options described above). • No outlier Firms were removed from the matrix.

• 4.16% of the known Payroll values were randomly deleted and “put back” for each EM execution, using the Fig.4.5 algorithm. With balanced random deletion across all missingness patterns.

Table 6.7 - Description of TCD imputation evaluation experiment 2 (EM deleting outlier Firms) Imputation of SME Payroll values using 50 executions of the EM algorithm

EXPERIMENTAL QUESTION

How would deleting outlier Firms from the data matrix affect EM imputation of the missing SME Payroll figures?

This experiment was identical to the experiment described in table 6.6, except that all Firms (matrix rows) that contained

any financial variable with a robust Z score of more than ±4 were deleted from the data matrix. That is, 8251 of the 61389 Firms were deleted from the matrix prior to the first execution of the EM imputation process

6.5.2 Definition of the Nearest Neighbour Imputation Experiments

Tables 6.8 and 6.9 describe the NN algorithm imputation evaluation experiments that were performed for the SME Payroll variable. The same pair of experiments were repeated for all 12 of the SME and LARGE Firm financial variables. That is, the descriptions given in tables 6.8 and 6.9 hold for all of the TCD financial variables, with the only difference being the proportion of missing values for each variable, as given in table 6.2, above.

Table 6.8 - Description of TCD imputation evaluation experiment 3 (NN retaining outlier Firms) Imputation of SME Payroll values using 50 executions of the NN algorithm

EXPERIMENTAL QUESTION

Can the missing SME Payroll figures be accurately imputed using the NN imputation process described below?

Description of the missing value dataset

• All 61389 SME Firms were loaded into the data matrix from the TCD database, using SQL. • The TCD columns loaded into the matrix were: Sales, Payroll, Depreciation, DirectorPay,

NetWorth, PBT, Employees, Easting, Northing and UKSIC_Category. All columns contained integer values (the UKSIC_Category contained integer representations of alphanumeric codes).

Variable to be imputed and evaluated

• The variable to be imputed and evaluated was Payroll.

• Payroll had 38724 missing values - i.e. 63.08% of the 61389 data matrix rows had missing values.

Imputation method used for the experiment

• Imputation was performed using the nearest neighbour algorithm described in section 3.2.3 • The Euclidean distance was used to measure the similarity between Firms (data matrix rows). • All variables except Easting and Northing were transformed to standard Z scores prior to

imputation - so that each variable would carry equal weight in the Euclidean distance calculations. • The Easting and Northing variables were divided by 100,000 just after they were loaded into the

data matrix (see the explanation for this given in section 6.4.2).

• The search for each nearest neighbour was carried out within the UKSIC segment to which the recipient Firm (the Firm with a missing value) belonged - i.e. only those Firms in the same UKSIC segment as the recipient Firm were considered as potential donors (as explained in section 6.4.2).

Imputation evaluation method

• 50 executions of the NN imputation algorithm were performed (using the options described above). • No outlier Firms were removed from the matrix.

• 4.16% of the known Payroll values were randomly deleted and “put back” for each NN execution, using the Fig.4.5 algorithm. With balanced random deletion across all UKSIC segments.

Table 6.9 - Description of TCD imputation evaluation experiment 4 (NN deleting outlier Firms) Imputation of SME Payroll values using 50 executions of the NN algorithm

EXPERIMENTAL QUESTION

How would deleting outlier Firms from the data matrix affect NN imputation of the missing SME Payroll figures?

This experiment was identical to the experiment described in table 6.8, except that all Firms (matrix rows) that contained

In document Desarrollo de una herramienta web para la visualización de comunicaciones de red (página 45-48)