• No se han encontrado resultados

Evolución del empleo registrado en el comercio, la industria y la construcción

There are a number of functions for extracting information from contingency tables and data frames. In order to apply a function to the rows or columns of a matrix, useapply(). Its first argument should be your matrix, the second argument should specify rows (1) or columns (2), and the third argument should specify the function to be used. The next examples show how to calculate row and column totals:

> apply(xt, 1, sum) hebben zijn zijnheb

212 15 58

> apply(xt, 2, sum) irregular regular

142 143

In order to convert the columns of our contingency table into by-row percentages, we need to divide each row by its row total. This can be accomplished withsweep(), which needs a matrix, a 1 or 2 for row or column manipulations, a number or vector to work with, and the operation to be carried out (the default operation is subtraction).

> sweep(xt, 1, apply(xt, 1, sum), "/") Regularity

Aux irregular regular hebben 0.4433962 0.5566038 zijn 0.8000000 0.2000000 zijnheb 0.6206897 0.3793103

As the row totals are a single number for each line, each element of the line is divided by this row total. If you supply a vector of numbers, this vector (which in this case must have length 2) is applied. For instance, if we usecolMeans()to obtain the column means, we can centralize the counts around zero with

> sweep(xt, 2, colMeans(xt)) # default is "-"

Regularity

Aux irregular regular hebben 46.66667 70.33333 zijn -35.33333 -44.66667 zijnheb -11.33333 -25.66667

Another important function is tapply(), which applies a specified function to all subsets of observations determined by a factor or combinations of factors. Its first argu-ment is the vector of observed counts or measureargu-ments. Its third arguargu-ment is the function to be applied. The second argument is a factor, or a list of two or more factors, that spec-ifies how to split the vector of observations into subsets corresponding to (combinations of) factor levels. The following call totapply()produces the variances for each combi-nation of the levels ofAuxandRegularity:

> tapply(dverbs$nSynV, list(dverbs$Aux, dverbs$Regularity), var) irregular regular

hebben 7.725578 4.3656381 zijn 22.265152 0.3333333 zijnheb 13.714286 8.0519481

Tables are not restricted to two dimensions, as shown by the following example in which we break this classification down by median (50% quantile) orthographic length:

> tab = tapply(dverbs$nSynV, list(dverbs$Aux, dverbs$Regularity, + dverbs$oLength <= median(dverbs$oLength)), mean)

> tab , , FALSE

irregular regular hebben 3.261905 2.875000 zijn 2.500000 2.500000 zijnheb 4.541667 5.166667 , , TRUE

irregular regular hebben 4.461538 3.217949 zijn 6.333333 3.000000 zijnheb 6.916667 4.000000

Note that you can access, for instance, the lower subtable oftabwith tab[,,"TRUE"]. Finally, here is an example of how you can convert a frequency table like

> t3 = table(dverbs[,c("Regularity", "Aux")])

> t3

Aux

Regularity hebben zijn zijnheb irregular 94 12 36

regular 118 3 22

into a data frame with the same information. This will be useful later on, as some mod-eling functions require the data frame format. First note that we can use lapply() to create a list of the factors and their levels:

> lapply(dverbs[,c("Regularity", "Aux")], levels)

$Regularity

[1] "irregular" "regular"

$Aux

[1] "hebben" "zijn" "zijnheb"

We feed the output oflapply()intoexpand.grid(), which creates a data frame with all combinations of the levels of the two factors:

> t3c = expand.grid(lapply(dverbs[,c("Regularity", "Aux")], + levels))

> t3c

Regularity Aux 1 irregular hebben 2 regular hebben 3 irregular zijn 4 regular zijn 5 irregular zijnheb 6 regular zijnheb

As a final step, we add the counts to this data frame:

> t3c$Freq = as.vector(t3)

> t3c

Regularity Aux Freq 1 irregular hebben 94 2 regular hebben 118 3 irregular zijn 12

4 regular zijn 3

5 irregular zijnheb 36 6 regular zijnheb 22

2.3.1 Interim summary

cheating picky rank tests jitter() arithmetic functions abs()

distribution tests ks.test() shapiro.test() tests for means wilcox.test() t.test()

tests for contingency tables chisq.test() fisher.test() functions for correlations cor() cor.test()

linear models lm() lmsreg() anova() predict() matrix functions t()

functions for tables apply() tapply() sweep() lapply() expand.grid() density estimation kde2d()

graphics persp()

distributions mvrnorm() rlnorm()

2.4 Problems

1. Is the frequency of the determiner the in the text of Multatuli (Figure 2.4) Poisson-distributed?

2. Recreate the upper right panel of Figure 2.5.

3. Make a plot of normal(4,2) density and highlight the area representing the probabil-ity of a value between 3 and 4.

4. If the definite article the has a relative frequency of 0.07, what is the probability that it will be observed at least 15 times and at most 20 times in a text of 200 words?

5. Load DATA/data.ont.txtand test whether the acoustic length of the vowel of the prefix (in the column labeledlengteprefixklinker) is normally distributed.

6. If you look carefully at the scatterplot for the correlation of weight and size ratings in Figure 2.11, you can see that there are two denser areas. What kind of visualization technique would be most useful here to investigate this further?

7. Write a function that creates two independent standard normal variables with dif-ferent means m1 and m2 and standard deviations s1 and s2. with n observations.

This function should produce a scatterplot, and return a list with as components (1) a data frame with the two vectors, and (2) the summary of a t-test comparing the means. Use this function to familiarize yourself with how the magnitude of s1 and s2 affect the possibility of detecting a difference in the means, and with the role of n in this respect.

8. Usecor.test()to calculate the 99% confidence interval for the weight and size ratings, and store these results in a vector ci99. You are not allowed to create this vector by inspecting the summary ofcor.test(), instead, extract it from the out-put ofcor.test(). Hint: begin withnames(cor.test(...)).

9. The data for the upper right panel of Figure 2.11 are available inDATA/bivar.neg.txt. Fit a linear regression model to these data with B as the dependent variable, and check the p-value for the slope by means of the functionpt(). Also compare the p-value in the summary of the linear model for the F -test with the p-value for the t-test that comes withcor.test().

10. Fit a quadratic regression model to the rating data for the subset of animals. Why is the model more hesitant about a frequency effect?

11. Three out of 15 regular verbs take hebben, while 118 out of 212 regular verbs take hebben. Test whether the proportions 0.2 and 0.557 are significantly differ-ent. (Hint:help.search("proportions"))

12. Create a version of Figure 1.6 in which the panels in the upper triangle of the scatter-plot matrix presents the data points with a smoothed nonparametric regression line, and in which the panels in the lower triangle present the corresponding Pearson and Spearman correlations. The help page forpairs()provides heaps of information on how to proceed.

13. Create a data frame fromweightthat specifies, for each individual word (in the col-umn labeledweight$English) the columnsFEnglish,FamEnglishand Syn-English, as well as the mean ratingRatingaveraged over subjects (Subject).

14. Fit a linear regression model with mean rating as dependent variable and English family size as predictor, using the data frame you just created.

Chapter 3

Documento similar