4.4.1 Stages of advanced validation
As for the previous two components of validation, there are also three stages for advanced validation .
• The first stage concerns micro data and should be performed by Member States in the end of the collection stage, because they are responsible for survey conducting, even when Eurostat receives this type of data. It is absolutely crucial that statistical data analysis and in particular outlier detection and correction are made at this early stage of the production process. In fact, it is still possible at this stage, and only at this stage, to correct eventual problems through the contact with the respondents without jeopardizing the deadlines until dissemination. Respondents can confirm or change their information and this is in many cases the best way to solve problems that may arise.
• The second stage concerns country data, i.e., the micro-data country aggregates. Leaving undetected problems in these data by a Member State and, in particular, the presence of outliers, requires Eurostat to filter and clean up the information received from MS.
• The third stage concerns aggregate (Eurostat) data before their dissemination. The country data sets used for aggregation should ideally be complete and free from errors, but there might be still a need for adjustment and therefore a final examination of the aggregates, in particular concerning outlier detection, has to be performed before the information is sent for dissemination.
The statistical methods and tests discussed above can be used in any of the three validation stages, i.e., the methods are valid for any data type. Nevertheless, some methods are more suited depending on the type of data and the level of aggregation level.
4.4.2 Micro data
At the data collection stage, the detection and correction of problems should be made by Member States, even when Eurostat receives the micro data. Advanced validation should be run according to the following main procedures and tests whose results must always lead to a decision about what steps to take to correct those problems.
4.4.2.1 Advanced detection of problems
Being more elaborate and careful, advanced detection can uncover problems that are left undetected by other methods and procedures. We next discuss the main steps to carefully scrutinize the data.
• Data mining – Examination of the main characteristics of the data based on graphical displays and on numerical measures and coefficients will most likely uncover the majority of the problems the data may have. With any statistical software, a standardized output can easily be constructed in such a way that it is only required to input the data (typing or transferring them) and the set of displays, numerical measures and eventual flagging of extreme values is automatically produced for analysis. This is a very simple but extremely powerful approach and it is highly recommended. It should be used with each variable included in any survey. For example, in foreign trade, it should be used with the data on exports and imports, on their country and product breakdown, or on unit prices; concerning industrial output, it should be applied to the total output and to its breakdown by sector or by product.
• Detection of outliers – Empirical rules such as those mentioned in the above examples and actually used by MS in data checking should be avoided and replaced by more sophisticated and accurate statistical methods, namely:
– Classification of outliers – A simple procedure is the classification of moderate or severe outliers described above, eventually changing the values of 1.5 or 3 that multiply the IQR, according to the specific data at hand.
– Tukey’s algorithm – This is a similar and more elaborate approach, combining robust estimation methods with discordancy testing. Some of the parameters used may be changed according to the data. This is definitely a good procedure and very easy to use. – Statistical tests – The tests described above should always be performed even if other
procedures are also applied because they can be more powerful. Nevertheless, they are also very easy and simple to apply. More than one test may be used and it is likely that some tests may be more adequate for a given data set than others. Pre-testing and simulation experiments should be conducted for any given project. Eurostat should play an important role in the harmonization, development and implementation of a common battery of tests in Member States.
– Cluster analysis – As mentioned above, this is a powerful tool for outlier identification in multivariate data. It is also simple to use and is included in most statistical software. Data classification can also show several other features, eventually showing other problems.
• Time series – A different perspective is adopted here. In fact, the analysis of the behaviour of the data through time may show some problems more clearly (outlier detection is one of them) or even show new problems. The only drawback is that usually time series models and methods are not automatic and not so simple to apply, requiring specific knowledge and the analyst’s intervention. This may cause some difficulties in our context, because of the large number of data sets and variables, i.e., the large number of time series, and the tight deadlines imposed. Thus, it is recommended that this approach is left for later stages of the validation process and is only used at the micro level in some special cases where strictly necessary.
4.4.2.2 Error correction
When errors or other problems are detected in the micro data, they have to be corrected by Member States, even when Eurostat receives such data. It is very important that the correction is made at this early stage, i.e., at micro data level and as soon as possible after data collection. The main steps that should be adopted to this purpose are the following.
• Contact with the respondents – The best way to solve the detected problems is clearly to contact the respondents to correct or confirm the values provided. Therefore, this is the first step and great effort should be put into it. Alternative approaches are second-best solutions and should only be considered when this fails.
• Imputation of missing values – If the previous step fails, or does not lead to a solution on time, the values requiring correction have to be discarded from the data set generating missing values. For example, it is not possible to leave in the data set an observation that has been previously considered an outlier. Consequently, those missing values will have to be imputed with the methods of section 3.
It is then obvious why Member States should be in charge of advanced validation of micro data: contacting the respondents is clearly their task and they can also perform imputation at this level in a much more efficient way and with better quality. It is also more expedite which is very important because of the dissemination deadlines.
4.4.3 Country data
Country aggregates received by Eurostat should already be validated at the micro level by the national statistical organizations. Nevertheless, some errors or problems can only be detected when data from the different countries are combined and therefore Eurostat should filter and solve those problems, consulting the country involved when possible.
The methods of advanced validation are the same as for micro data and therefore will not be repeated here, but it is important to note that they are easier to apply because the number of data points is much smaller. In fact, for each variable, the number of observations is the number of countries while for micro data it is the number of respondents. Therefore, statistical analysis is an easier task which also means that it may be even more careful and pay attention to several aspects that may have been ignored or overlooked in the previous stage because of the size of the data sets. For example, several of the plots mentioned above are easier to analyse or the number of possible outliers is much smaller.
Moreover, time series analysis is now more manageable and consequently it becomes a very powerful tool for data analysis and problem detection. In particular, ARIMA modelling, fitting decomposition models or outlier testing and accommodation is easier and should in fact be tried.
When errors such as outliers are found in country data sets, Eurostat has to correct them, possibly after discussion with the national statistical organization involved, always keeping in mind the deadlines for dissemination. Discarding country data is not adequate because it would generate non-available values providing no information on that (those) country(ies) and preventing the computation of Eurostat aggregates. Consequently, imputation of those values is required with the methods of section 3 or with time series modelling that can in fact be extremely powerful and useful to this purpose because it can predict the missing observations with good accuracy. This approach is strongly recommended and should be applied.
4.4.4 Aggregate (Eurostat) data
Some problems may become apparent only when the national data sets are aggregated, although the previous two stages of advanced validation will leave none or very few errors uncorrected. For example, an extreme value may be obtained for the aggregate of a given geographical zone resulting from the combination of the values of several countries that are
very high (or low) but are not clearly extreme and passed unnoticed at the two previous validation stages, particularly at the country level.
The errors found have to be corrected at the country level and consequently we are back to the previous stage. In particular, imputation may be required if correction is not possible on time for dissemination. After this final validation stage is complete, country and Eurostat (aggregated) data are ready for dissemination.
4.4.5 Concluding remarks
The objective of advanced validation is to detect problems and errors undiscovered by other procedures and checks. In fact, the statistical methods included here are more elaborate, have good properties and have shown satisfactory performances in applied analysis. These methods can be used at any of the three validation stages: micro-data level, country level and aggregate (Eurostat) level. However, using time series analysis may not be manageable at the first stage and is thus recommended for the other two where it is a very powerful tool, although it requires moderate or large sample sizes. Moreover, the first stage should be carried out by Member States and the other two by Eurostat. It is very important that error detection and correction (particularly concerning outliers) is performed at the earliest stage possible, otherwise those problems may be amplified at later stages and have a serious negative impact on the quality of the data. The earlier the stage, the more accurate and efficient the correction can be, bringing substantial advantages for the timely dissemination of the data. When this process is complete, the data are hopefully error-free, especially free of outliers, with a significant improvement of the quality of published statistical information.
The performance of advanced validation methods can be assessed by comparing the corrected values with the corresponding revised data that will be obtained later. In particular, the detection and correction of outliers is especially relevant. To this end, accuracy measures such as the mean squared error may be calculated. It is also important to keep a record of the errors and in particular of the outliers detected in order to identify their sources and prevent the problems causing them in the future.
REFERENCES
Barnett, V. and Lewis, T. (1995). Outliers in Statistical Data. 3rd ed., John Wiley and Sons, Chichester, West Sussex.
Everitt, B.S., Landau, S. and Leeds, M. (2001). Cluster Analysis. Arnold, London.
Fellegi, P. and Holt, D. (1976). A systematic approach to automatic edit and imputation,
Journal of the American Statistical Association, 71, 17-35.
Heiberger, H. and Holland, B. (2004). Statistical Analysis and Data Display. Springer, New York.
Lehtonen, R. and Pahkinen, R. (2004). Practical Methods for Design and Analysis of
Complex Surveys. 2nd ed., John Wiley and Sons, Chichester, West Sussex.
Little, R.J.A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data. 2nd ed., John Wiley and Sons, Hoboken, NJ.
Makridakis, S.G., Wheelwright, S.C. and Hyndman, R.J. (1998). Forecasting: Methods and
Applications. 2nd ed., John Wiley and Sons, New York.
Eurostat Internal Document (2000). Imputation – Overview of Methods with Examples of
Procedures Used in Eurostat. Unit A4 (Research and Development, Methodology and Data Analysis), Eurostat, Luxembourg.
Pena, D., Tiao, G.C. and Tsay, R.S. (2001). A Course in Time Series Analysis. John Wiley and Sons, New York.
Rubin, D. B. (2004). Multiple Imputation for Nonresponse in Surveys. 2nd ed., John Wiley and Sons, Hoboken, NJ.
Wei, W.W.S. (2006). Time Series Analysis – Univariate and Multivariate Methods. 2nd ed., Addison-Wesley, New York..