Etología aplicada en los porcinos - Libro Etologia Final

The basic linear regression model presented in the previous section is appropriate for a simple case of numeric variables which display a linear relationship. However, as there are different types of variables and relationships, there are also different types of regressions. There are three main types of variables: numeric, categorical and ordinal. Numeric variables correspond to quantitative values and are represented by real numbers. They are described as continuous variables, because they can take an infinite amount of possible values. Speed and time measurements are examples of such variables. Categorical variables, on the other hand, can only take a fixed number of values that can be assigned on the basis of qualitative criteria. They are considered to be discrete variables. Eye colour is an example of such a variable, as it can only take a set number of values (e.g. brown, blue, green, grey), although taking heterochromia into account would make for a larger number of possibilities. Ordinal variables

are somewhat in between numeric and categorical variables in that they correspond to categorical variables that can be ordered. For example, a satisfaction variable that can take the values “very good”, “good”, “bad” and “very bad” is ordinal because they imply a notion of order.

In the typing speed example above, the variables are numeric and display a linear relationship, which means that a basic linear regression model can be used. In this dissertation however, the dependent variable is a categorical one, namely whether an item is grammatical or lexical. This is why the type of regression that has to be used is a binary logistic one. A binary logistic regression is a type of regression in which the dependent variable only has two possible outcomes that are also mutually exclusive, such as pass/fail, alive/dead or win/lose. The aim is to predict which of these two outcomes is the most likely on the basis of a set of predictor (independent) variables. Note that these predictor variables can be a mixture of numeric and categorical variables. The outcome of the model is a score between zero and one, which depends on the initial encoding of the variables. Therefore, if pass is encoded as one and fail is encoded as zero, then the score will correspond to the likelihood of passing. A cut- off point of 0.5 can then be established, as any value above this point is more likely to be a pass. In terms of equation, the basic principle is the same as the linear equation, namely that it involves a dependent variable, (at least) one predictor variable and a corresponding number of parameters that are determined empirically. However, because a binary logistic regression deals with a categorical dependent variable with two outcomes and their probabilities, it uses the following logistic equation:

The element p(1|X) reads as “the probability of 1 given X”, in which 1 corresponds to the element encoded as 1 (e.g. pass). This equation will always result in a value between zero and one, as it is a property of the logistic function. As one can notice, the elements in the exponent (i.e. a0 + a1*x1 + a2*x2 + ... + ai*xi) correspond to the basic regression equation introduced

earlier. Therefore, the bigger coefficients have more weight in the final score.

To illustrate how a logistic regression can be used to investigate a categorical variable, a simplified example will be discussed. Let us say that we are interested in investigating the differences between academic and non-academic texts. In this example, we assume a binary categorical variable, i.e. the type of text, which can only take two mutually exclusive values,

namely academic and non-academic. A binary logistic regression can help us predict the type of a text based on a number of predictor variables. In the present case, three predictor variables are included in the model: type-token ratio, average sentence length and average word length. Type-token ratio is a textual measure obtained by dividing the number of different types by the total number of tokens in a text (Baayen 2008). A higher type-token ratio corresponds to a more diverse vocabulary (see also section 3.3). Average sentence length consists in the average number of words per sentence, and average word length consists in the average number of letters in each word. This is a toy example and a real study of such a question would obviously require more variables and a more thorough discussion.

A simulated dataset was constructed in order to illustrate what a binary logistic regression can do. This imaginary dataset contains fifteen academic texts, and fifteen non- academic ones. The values for the type-token ratio were made deliberately higher for the academic texts, while the two other variables were randomized. This was done so that the type of text can mostly be predicted on the basis of its type-token ratio, while the other variables are virtually useless. Table 8 shows this simulated dataset in detail.

A binary logistic regression was conducted on the data presented in Table 8, using the type of text as dependent variable. Note that academic texts were encoded as one, whereas non-academic texts were encoded as zero. Type-token ratio, average sentence length and average word length were used as predictors. The binary logistic regression can calculate a regression score for each text, which ranges from zero to one. Texts closer to one (i.e. above 0.5) are more likely to be an academic text, whereas texts closer to zero are more likely to be non-academic.

Table 9 shows the regression coefficients for each variable (Column B), the standard error of these coefficients (Column S.E), their Wald coefficient (Column Wald) and their p value (Column Sig). The result is that one of the three predictor variables emerges as highly significant, namely type-token ratio, as it has a p value of 0.003. It means that there is only a 0.3% chance that this coefficient is actually equal to zero and therefore irrelevant to the model. This is an expected result, as this variable was designed for this purpose. The other predictor variables are not significant, with p values that are quite high (p>0.2). The p values are computed on the basis of the Wald score, where a higher Wald score means a more significant p value. The Wald score is used in the same way as the t-test for a linear regression (section 4.1.1). It also uses the standard error of the coefficient to determine whether its actual value is different from zero.

ID Type of text Type-token ratio Average Sentence Length Average Word Length Binary Logistic Regression Score Text_1 Academic 0.640 9.480 4.390 0.996 Text_2 Academic 0.610 9.140 4.320 0.988 Text_3 Academic 0.590 8.990 5.290 0.923 Text_4 Academic 0.590 8.970 5.580 0.888 Text_5 Academic 0.580 9.460 5.310 0.949 Text_6 Academic 0.580 8.440 5.940 0.644 Text_7 Academic 0.580 8.690 5.790 0.761 Text_8 Academic 0.560 8.850 5.960 0.672 Text_9 Academic 0.550 8.080 4.110 0.850 Text_10 Academic 0.530 9.720 5.130 0.917 Text_11 Academic 0.530 9.950 4.110 0.983 Text_12 Academic 0.510 8.340 4.210 0.746 Text_13 Academic 0.510 9.930 4.900 0.927 Text_14 Academic 0.420 9.780 5.930 0.257 Text_15 Academic 0.310 8.690 4.180 0.051 Text_16 Non-Academic 0.550 8.780 5.610 0.698 Text_17 Non-Academic 0.530 9.190 5.430 0.773 Text_18 Non-Academic 0.480 8.810 5.600 0.331 Text_19 Non-Academic 0.460 8.150 4.870 0.232 Text_20 Non-Academic 0.420 8.270 4.240 0.246 Text_21 Non-Academic 0.400 8.760 4.680 0.195 Text_22 Non-Academic 0.390 9.260 5.740 0.093 Text_23 Non-Academic 0.380 8.860 4.760 0.138 Text_24 Non-Academic 0.370 9.510 4.470 0.327 Text_25 Non-Academic 0.350 8.310 5.600 0.012 Text_26 Non-Academic 0.350 9.310 5.250 0.077 Text_27 Non-Academic 0.340 9.140 4.460 0.125 Text_28 Non-Academic 0.330 9.770 5.070 0.117 Text_29 Non-Academic 0.310 8.430 4.240 0.032 Text_30 Non-Academic 0.300 9.820 5.310 0.050

Table 8. Imaginary dataset of academic and non-academic texts.

Type-token ratio is the only relevant variable in the model, which means that analysing its coefficient will inform us about its relationship with the dependent variable. In the present case, the coefficient is positive (22.855), which means that an increase in the type-token ratio tends to increase the likelihood of a text being categorized as academic by the model. The specific regression score for any given text is computed by using the binary logistic regression equation presented above. For example, the data for Text_4 results in the following score:

As can be observed from the formula, knowing which coefficient has the most weight is not straightforward in the present situation. Indeed, while the type-token ratio coefficient is the biggest one (22.855), type-token ratio values range from 0.300 to 0.640, which means that this will result in a smaller value after multiplication. Conversely, average sentence and word lengths have small coefficients (1.484 and -1.291, respectively), but they multiply with bigger values ranging from 4.110 to 9.950. A way of getting rid of this problem is to standardize the dataset. Standardization is a process by which variables are re-scaled so that their values are expressed as the number of standard deviations above (or below) the mean. This allows comparison between variables that use different measurement units. Standardizing the dataset and running the binary logistic regression again results in the coefficients shown in Table 10. The first observation is that the Wald score and the p values are not affected by standardization. The second observation is that the type-token coefficient is the one with the largest magnitude, as it is approximately three times bigger than the other coefficients.

B S.E Wald Sig.

Type-token ratio 22.855 7.764 8.666 0.003 Average Sentence Length 1.484 1.217 1.486 0.223 Average Word Length -1.291 1.050 1.512 0.219

Constant -17.517 11.190 2.451 0.117

Table 9. Coefficients of the binary logistic regression model conducted on Table 8.

B S.E Wald Sig.

Type-token ratio 2.432 0.826 8.666 0.003 Average Sentence Length 0.831 0.682 1.486 0.223 Average Word Length -0.811 0.659 1.512 0.219

Constant 0.109 0.573 0.036 0.849

Table 10. Standardized coefficients of the binary logistic regression presented in Table 9. The scores calculated on the basis of these coefficients may be used to discuss the degree to which a text is academic or non-academic. For example, Texts 1, 2, 3, 5, 10, 11 and 13 are extreme examples of academic texts, as they get values above 0.9. However, some texts do not get scores that correspond to their actual categories. Indeed, Texts 16 and 17 have

higher scores than other non-academic texts, with 0.698 and 0.773 respectively. They both have a type-token ratio value that is higher than the other non-academic texts, which explains why the model gives them higher scores.

There is therefore a continuum of “academicness” on which each text can be placed. This continuum consists in scores between zero and one. These scores are a fine-grained type of information, as they can take an infinite amount of possible values in this given range. The initial dataset used a coarse-grained type of information, namely the type of text, which could only take two possible values, i.e. academic or non-academic. The binary logistic regression makes it possible to use this coarse-grained information to compute a more fine-grained one, and also to highlight the more extreme members of the initial categories. This is why this method is useful when trying to place an item on a lexical to grammatical continuum.

These fine-grained values can be, if one wishes to, classified back into coarse-grained categories. A binary logistic regression calculates, as stated earlier, the probability of a pass/fail event. Therefore, a score above 0.5 is an above-chance of a pass, whereas a score under this threshold is a fail. In terms of “academicness” score, it means that a text above 0.5 is classified as an academic text, and a text under this score is classified as a non-academic one.

From a practical point of view, this information can be used to test the correct classification rate of the binary logistic regression. Indeed, it is possible to compare the category attributed given the regression score with the actual type of text. Table 8 shows that there are two academic texts (14 and 15) that have a score under 0.5, which means that the model classifies them as non-academic. There are also two non-academic texts (16 and 17) that have a score above 0.5, which means that the model classifies them as academic. These four texts therefore get a classification by the model that is different from their actual type. It means that, when it comes to predicting the type of text, the model has a correct classification rate of 26/30 (86.67%). These classifications are usually displayed in a classification matrix (also called a confusion matrix), which is shown in Table 11.

There are also general indicators of the quality of the model. The first one is a chi- square test of the coefficients of the model, which in the present case corresponds to 19.034 (DF=3, p<0.000), which means that there is indeed a link between the type of text and the variables under investigation. Furthermore, the R2 (percentage of variance explained by the model) can also be computed for a binary logistic regression model, however it is only an approximation. The two main ways to calculate an approximation of the R2 score for such a model are the Cox & Snell R2 (Cox and Snell 1989) and the Nagelkerke R2 (Nagelkerke

1991). They amount to 0.470 and 0.626, respectively. They range from zero (poor) to one (excellent), so the values observed here are rather average. An important note is that there is a lot of debate regarding the accuracy and relevance of these indices (e.g. Menard 2000, Smith and McKenna 2013). But these values are not necessarily critical as one can also look at the classification rates to have some idea of the quality of the model. In the present case, the model does have a high correct classification rate.

Predicted type

Academic Non-Academic

Actual type Academic 13 2 86.67

Non-Academic 2 13 86.67

Total percentage of correct classification 86.67

Table 11. Classification matrix, observed types of texts versus predicted types of texts by the binary logistic regression.

One of the limitations of the basic binary logistic regression model presented so far is that the two variables that do not have significant scores are useless to the model. There are cases where some variables may even worsen the model. This is due to a situation known as multicollinearity, which happens when predictor variables are strongly correlated with each other. In such a case, a strong correlation means that these predictors are redundant to the model, as they bring the same kind of information to it. This situation can result in a major reduction of the significance scores of some of the variables in the model. This is why it is often useful to conduct a stepwise binary logistic regression.

In a stepwise regression, the variables are added to the model (or removed from it) in several steps. After each step, a test is made to determine whether the new iteration of the model is better than the previous one, until the best fit is found. There are different stepwise options for a binary logistic regression (conditional, maximum likelihood, Wald) which are all equally reliable (Hosmer et al. 2013: 125). In the present case, the Wald stepwise method is used, as the Wald score determines the significance score of the parameters of the model. This will be useful for the main study of the dissertation, since determining which parameters are the most relevant ones is a central question. The stepwise method can use either the ascending or the descending order, which means that it can either start with one variable and add new ones based on the Wald score, or instead start with all variables and remove the ones that have

the lowest scores in each step until the model does not improve anymore. Let us look at the previous example using a stepwise binary logistic regression to illustrate these concepts.

The data from Table 8 was used in a stepwise binary logistic regression using the ascending Wald method. The results are reported in Table 12. First, only the type-token ratio variable was added to the model. None of the other variables were considered as having a Wald score high enough to be added to the model in a useful way. The stepwise regression does therefore not add unnecessary variables to the model. Second, the p value of the type- token ratio variable has slightly changed from 0.003 to 0.002, which means that this version of the model is more confident that the link found between type of text and type-token ratio is significant. Third, the coefficient has also changed, from 22.855 to 18.373, but this is also a minor change as this coefficient is of similar magnitude and has a positive sign. An additional remark is that using the descending Wald method leads to the exact same result as Table 12. This is because in the present case, there are only three variables which can be added or removed from the model. An important note though, is that the descending Wald method gets to this result by starting with all the variables, and then determines that the average sentence and word lengths variables should be dropped.

B S.E. Wald Sig.

Type-token ratio 18.373 5.962 9.496 0.002 Constant -8.679 2.901 8.951 0.003

Table 12. Coefficients of the stepwise binary logistic regression, using the ascending Wald method.

This simple example investigating the link between type of text and three predictor variables relates closely to the idea of using lexical and grammatical items in order to determine a grammaticalization score. Indeed, while there are highly lexical and highly grammatical items, some items are somewhere in between. The score attributed by a binary logistic regression can help us identify these individuals and to place them somewhere along a continuum, as was done with the academic and non-academic texts presented in Table 8. The main difference has to do with the selection of the relevant variables, as the example presented above is an imaginary dataset with a limited number of constructed variables. Since the main study of the dissertation contains a larger number of variables and their relationships are not straightforward, a general method of investigating these relationships is presented in section 4.1.4.

In document Libro Etologia Final (página 174-177)