Descripción de los módulos de Perl y las funciones más importantes

2. CAPÍTULO DESCRIPCIÓN Y ANÁLISIS DE LA SOLUCIÓN PROPUESTA

2.7. Descripción de los módulos de Perl y las funciones más importantes

One of the key points of the exploratory analysis is the regression analysis.

As discussed in the Theoretical background chapter, some authors state that high correlation between the voter turnout and the winning party share points to the electoral fraud (Myagkov et al, 2009, Mebane and Kalinin, 2009, Klimek et al, 2012, Sonin, 2012). In most cases, they investigate this correlation working only with the entire country without breaking the data into subsets for different geographic regions.

The first step to do for an overview is to build a correlation coefficient matrix for the entire dataset. It is shown below:

Voter turnout Conservative Liberal New Democratic Voter turnout 1.00000000 0.07187695 -0.02489927 -0.1067010 Conservative 0.07187695 1.00000000 -0.22397221 -0.7158480 Liberal -0.02489927 -0.22397221 1.00000000 -0.3356734 New Democratic -0.10670100 -0.71584805 -0.33567338 1.0000000

Table 1. Matrix of correlation coefficients for the main variables (Canada, 2011).

It is obvious that correlation between different party shares is negative because these are the percentages from the entire amount of voters. The more votes one party gets, the less votes are left for the others. The brightest example is -0.7 between Conservative and New Democratic parties. This is somewhat natural, and in fact, the only relationship that is less natural is the relationship between voter turnout and party shares. It deserves a special investigation.

At the entire country level (without any aggregation) it is better to build a density scatterplot because there are too many observations for a typical scatterplot. We have created such scatterplots for the main parties, and in most cases they demonstrate the smooth bivariate distribution with a single hot spot in the center and the density falling towards its periphery. The only exclusion is the scatterplot for Conservative party share in 2011 (Figure 15a) which looks the same as one in Klimek et al, 2011:

Conservative Liberal New Democratic

Figure 15. Voter turnout against party shares for all polling divisions (Canada, 2011).

Klimek et al state that a smaller area at the bottom stands for French Canada (Quebec province) and a larger area on top is for English Canada (all the rest Canadian provinces and territories). This assumption was checked by looking at province-level results which are published as well: “Looking at their results by province, they [Conservatives] tallied 16.5% of votes cast in Quebec but more than 40% of votes cast in 8 of the remaining 12 other provinces.” (Klimek et al, 2011).

This can be enough but since we have defined the relationship between the provinces and polling divisions, we could visualize this on the same scatterplot. To do this, we are plotting semi-transparent white points above the existing graph for the selected provinces: Ontario (the largest English-speaking province) and Quebec:

Ontario Quebec

Figure 16. Voter turnout against Conservative party share at polling division level (entire country, Canada, 2011), combined with point clouds and convex hulls

for selected Canadian provinces.

The plots above confirm the given statement. We can see that the areas of higher concentration of white points are located on respective hot spots of the density scatterplot. At the same time, they give more information: we can observe that even though the points are highly concentrated, there are the outliers which are very different from the main pattern. Convex hull shows the character of the local bivariate variance and how does it match with the global bivariate variance.

This is a very important outcome because if we are looking for data irregularities to detect fraud, we should take the presence of such outliers into account. In Annex II, the remaining scatterplots can be found.

Using the same method, we could build the graph for any geographic unit, whether it is an urban municipality or an electoral district. Of course, it is hard to give estimation of each graph (i.e. 308 graphs for the electoral districts). Instead, we did the following:

 get correlation coefficient for each geographic unit and draw a histogram with their distribution, along with the plot for correlation coefficient and the number of observations;

 calculate the area of the convex hull for each territory unit and divide it by the area of the convex hull for the entire dataset. If the ratio is closer to one, the variability of voter turnout and selected party share combinations is close to that for the entire country. On the other hand, values closer to zero mean similar behavior within the unit.

Correlation coefficients for Conservative party share and voter turnout in 2011 look like shown below:

a) distribution of local correlation coefficients;

b) scatterplot of correlations coefficient and the number of observations;

c) distribution of convex hull area ratios;

d) scatterplot of convex hull area ratios and the number of observations.

Administrative districts

Urban municipalities

Electoral districts

a) b) c) d)

Figure 17. Distribution of the correlation coefficients for voter turnout and Conservative party share at main aggregation levels (Canada, 2011).

The remaining graphs are available in a digital annex. From all graphs, the outcomes are stated below:

 Though the global correlation coefficient might be very small, there can be large local coefficients. For example, the global coefficient for Conservative party share in 2011 is 0.07, while the local values for the electoral districts can go above 0.5;

 Correlation coefficients distribution changes between years but still each party has its own main range where most of the coefficients fall into:

o Conservative party: -0.1 – 0.5, o Liberal party: -0.3 – 0.3,

o New Democratic party: -0.5 – 0.2;

 Unlike variability indices, correlation coefficients for turnout and party shares do not follow the overall party success or failure in time;

 As opposed to variability indices, there is no relationship between the correlation coefficients and the number of observations at any aggregation level.

Least-squares equation lines on all graphs are mostly horizontal;

 Relationship between the convex hull area ratios and the number of observations is very weak. It means that variability of voter turnout and party shares does not depend of territory population. For example, New Brunswick province demonstrates the most similar results within itself, while Saskatchewan province, having the same population, has much higher variability;

 For the electoral districts, convex hull area ratio higher than 0.4 is extremely rare. This means that each electoral district has its own set of combinations of voter turnout and party shares but this set is always not as full as the entire country’s set;

 We can see that convex hull area ratios follow the global party share. Better is the result of the elections for a party, more dispersed is the behavior within the territory unit, and vice versa.

Everything we did before was done on the aggregated data, i.e. polling divisions data grouped into subsets according to some geographies. The result was a set of indicators, like correlation coefficients, etc. At the same time, it is necessary to summarize data variables within to the same geographies, i.e. have 1 value for each territory unit. For instance, by dividing the total amount of participating voters by the total amount of possible electors for each unit we get the summarized voter turnouts. Doing the same with party shares, we can estimate their regression. An example of such summarization for Conservative party share at the electoral district level in 2011 can be seen below:

a) distribution of summarized voter turnout values;

b) distribution of summarized party share values;

c) scatterplot of A against B.

Figure 18. Correlation between summarized voter turnout and Conservative party share (electoral districts, Canada, 2011).

We can see that summarized data does not demonstrate any correlation patterns in any years for any variables.

As a conclusion, we can confirm that there is no expressed relationship between voter turnout and party shares. Though we observed local correlation in some of the territory units, the strongest pattern is the independence between the specified variables. Another important outcome is that the electoral districts are the best aggregation level for the study because they contain similar number of observations. When the number of observations is very different, i.e. there are very small and very big cities, it leads to a statistical bias in the analysis. Thus, using the administrative boundaries is possible for the exploratory analysis but it is not recommended in spatial analysis that is boundary-sensitive.

In document Adaptacion del Modulo de Circulacion del Sistema Integrado de Gestion Bibliotecaria Koha a la Biblioteca Nacional de Cuba (página 60-70)