Since the principal analysis of the paper relies on three main variables--sales, labor and unionization--the working sample is restricted to those establishments with complete information on these three variables. Because most large establishments in the sample are either unionized or non-unionized, which can cause a bias on the estimations, the sample is constrained to observations with at most 500 permanent workers. Given that the interest is to characterize establishments in the private sector, establishments owned by the public sector (more than 50%) are excluded of the analysis. To avoid any bias caused by establishments that recently started their economic activities, the sample is restricted to establishments with at least 3 years of operation in the market. Finally, in order to avoid biases due to errors and inconsistencies within
76
the data set itself, some minor edits of the data set are implemented.46 This reduces the working
sample from 3,277 to 2,812 enterprises across the 6 countries.
In order to maintain a minimum level of consistency on the imputations across the countries, the specification of the missing information process is kept constant across countries,
except for characteristics of region and industry.47 Regarding other characteristics, the
imputation model includes variables, such as market competition, establishment ownership structure, infrastructure characteristics, production policies, investment in research and development (R&D) and physical capital, labor force characteristics and level of unionization at the establishment level. All imputation models are estimated using weights provided in the survey to obtain results representative at the national level. Given that the missingness across the variables of interest is assumed to follow an arbitrary pattern, iterative chained equations (ICE) are used to obtain imputed values given the observed data. While some of the literature recommends that 5-10 imputed samples are enough to obtain appropriate inferences (Rubin, 1987), there are arguments that some applications may need more imputations to obtain stable results (Horton & Lipsitz, 2001). Given the incidence of missing information, 50 imputed samples are used to provide the main results. Results using fewer imputations are also provided to show the stability of results. Finally, following the literature an examination of the imputed data suggests that 20 iterations for the burn-in period are sufficient to achieve convergence on
the system (van Buuren, 2007).48
46
In some instances, information such as wages, sales or costs are either too high or too low, compared to other information within the establishment and compared to other similar establishments that can be interpreted as typos on transcription. Depending on each case, the values were inflated or deflated (reducing the excess of zeroes), or change the value to missing data.
47
The regions with major economic activity are selected for interviews in each region. The industry fixed effects correspond to the ISIC codes 15-37 (ISIC Rev.3.1). A complete list of the variables that are used in the imputation process can be found in the appendix C.
48
Appendix D provides a plot of the means and standard deviations of the main imputed variables used to analyze the stability of the processes.
77
One cannot rule out the possibility that part of the information in the dataset is “missing not at random” (MNAR), depending in part on unobserved and unmeasured characteristics, potentially introducing non-ignorable response bias. Graham, et al. (1997) show that the sensitivity of results to the observed missing process is frequently small in the multiple imputation framework. Moreover, they indicate that even under such circumstances, the MI approach might provide better inferences than working with samples with complete reported data.
Table 10. Multiple Imputation Summary
Variable Metho d Complet e Impute d % Imputed Tota l Nr of workers in t-1 PMM 2623 189 6.7% 2812
Cost of labor as share of sales PMM 2563 249 8.9% 2812
Cost of electricity as share of sales PMM 2572 240 8.5% 2812
Cost of communications as share of sales PMM 2570 242 8.6% 2812
Cost of materials and inputs as share of
sales PMM 2479 333 11.8% 2812
Cost of fuel as share of sales PMM 2441 371 13.2% 2812
Cost of transportation as share of sales PMM 2460 352 12.5% 2812
Cost of water as share of sales PMM 2408 404 14.4% 2812
Cost of rentals as share of sales PMM 2453 359 12.8% 2812
Log Nr of workers in t-1 OLS 2623 189 6.7% 2812
Log sales in t-1 OLS 2288 524 18.6% 2812
Log wages production workers OLS 2721 91 3.2% 2812
Log wages non production workers OLS 2589 223 7.9% 2812
Log capital (book value) OLS 1961 851 30.3% 2812
Log capital (market value) OLS 2346 466 16.6% 2812
Log materials and Inputs OLS 2441 371 13.2% 2812
Log salaries OLS 2574 238 8.5% 2812
Note: the complete set of the variables and imputations are shown in appendix C. OLS imputation uses linear predictions to obtain the imputed values. PMM is a predictive mean matching algorithm that uses the value of the closest observation (using predicted means) to impute missing information.
Table 10 presents a summary of the imputations for some of the most important variables in the study. As one can observe, information regarding capital, a fundamental variable in the analysis, has one of the largest incidence of missing information, with 30.3% of missing
78
information in the case of book value of capital, and 16.6% in the case of hypothetical or market value. Among production costs, the costs of electricity and communication have the lowest missing rates (8.5% and 8.6%), while costs of fuel and water have the highest rates of missing information (13.2% and 14.4%).