PLAN DE MEJORAMIENTO - INFORME AUDITORÍA DE CUMPLIMIENTO

When investigating any research question, one decides what will be an appropriate sample size largely on the basis of the size of the effect or association expected. The bigger the effect or association, the smaller the sample can be in purely statistical terms. This is because bigger effects are more likely to be statistically significant with small sample sizes. A statistically significant finding is one that is large enough that it is unlikely to be caused by chance fluctuations due to sampling. (It should be stressed, the calculation of statistical significance is normally based on the hypothetical situation defined by the null hypothesis that there is no trend in the data.) The conventionally accepted level of significance is the 5 per cent or .05 level. This means that a finding as big as ours can be expected to occur by chance on 5 or fewer occasions if we tested that finding on 100 occasions (and assuming that the null hypothesis is in fact true). A finding or effect that is likely to occur on more than 5 out of 100 times by chance is described as being statistically significant or not statistically significant. Note that the correct term is non-significant, not that it is statistically innon-significant, although authors sometimes use this term. Insignificant is a misleading term since it implies that the finding is not statistically important – but that simply is not what is meant in significance testing. The importance of a finding lies in the strength of the relationship between two variables or the size of the difference between two samples. Statistical significance testing merely refers to the question whether the trend is sufficiently large in the data so that it is unlikely that it could be the result of chance factors due to the variability inherent in sampling, that is, there is little chance that the null hypothesis of no trend or difference is correct.

Too much can be made of statistical significance if the size of the trend in the data is disregarded. For example, it has been argued that with very large samples, virtually any relationship will be statistically significant though the relationship may itself be a very

small one. That is, a statistically significant relationship may, in fact, represent only a very small trend in the data. Another way of putting this is that very few null hypotheses are true if one deals with very large samples and one will accept even the most modest of trends in the data. What this means, though, in terms of generalisation is that small trends found in very large samples are likely not to generalise to small samples.

The difference between statistical significance and psychological significance is at the root of the following question. Which is better: a correlation of .06 which is statistically significant with a sample of 1000 participants or a correlation of .8 that is statistic-ally significant with a sample of 6 participants?

This is a surprisingly difficult question for many psychologists to answer.

While the critical value of 5 per cent or .05 or less is an arbitrary cut-off point, nevertheless it is one widely accepted. It is not simply the point for rejecting the null hypothesis but is also the point at which a researcher is likely to wish to generalise their findings. However, there are circumstances in which this arbitrary criterion of significance may be replaced with an alternative value:

z The significance level may be set at a value other than 5 per cent or .05. If the finding had important consequences and we wanted to be more certain that our finding was not due to chance, we might set it at a more stringent level. For example, we may have developed a test that we found was significantly related at the 5 per cent level to whether or not someone had been convicted of child abuse. Because people may want to use this test to help determine whether someone had committed, or was likely to commit, child abuse, we may wish to set the critical value at a more stringent or conservative level because we would not want to wrongly suggest that someone would be likely to commit child abuse. Consequently, we may set the critical value at, say, 0.1 per cent or .001, which is 1 out of 1000 times or less. This is a matter of judgement, not merely one of applying rules.

z Where a number of effects or associations are being evaluated at the same time, this critical value may need to be set at less than the 5 per cent or .05 level. For example, if we were comparing differences between three groups, we could make a total of three comparisons altogether. We could compare group 1 with group 2, group 1 with group 3, and group 2 with group 3. If the probability of finding a difference between any two groups is set at 5 per cent or .05, then the probability of finding any of the three comparisons statistically significant at this level is three times as big, in other words 15 per cent or .15. Because we want to maintain the overall significance level at 5 per cent or .05 for the three comparisons, we could divide the 5 per cent or the .05 by 3, which would give us an adjusted or corrected critical value of 1.67 per cent (5/3 = 1.666) or .017 (.05/3 = .0166). This correction is known as a Bonferroni adjustment. (See our companion statistics text, Introduction to Statistics in Psychology, Howitt and Cramer, 2011a, for further information on this and other related procedures.) That is, the value of, say, the t-test would have to be significant at the 1.67 per cent level according to the calculation in order to be reported as statistically significant at the 5 per cent level.

z For a pilot study using a small sample and less than satisfactory measuring instruments, the 5 per cent or .05 level of significance may be an unnecessarily stringent criterion.

The size of the trends in the data (relationship, difference between means, etc.) is possibly more important. For the purposes of such a pilot study, the significance level may be set at 10 per cent or .1 to the advantage of the research process in these circumstances. There may be other circumstances in which we might wish to be flexible about accepting significance levels of 5 per cent or .05. For example, in medical research, imagine that researchers have found a relationship between taking hormone replacement therapy and the development of breast cancer. Say that we find this

relationship to be statistically significant at the 8 per cent or .08 level, would we will-ingly conclude that the null hypothesis is preferred or would we be unwilling to take the risk that the hypothesis linking hormone replacement therapy with cancer is in fact true? Probably not. The point is not that significance testing is at fault but that a whole range of factors impinge on what we do as a consequence of the test of our hypotheses. Research is an intellectual process requiring considerable careful thought in order to make what appear to be straightforward decisions on the basis of statistical significance testing.

However, students are well advised to stick with the 5 per cent or .05 level as a matter of routine. One would normally be expected to make the case for varying this and this may prove difficult to do in the typical study.

4.5 Directional and non-directional hypotheses again

The issue of directional and non-directional hypotheses was discussed in Box 2.1, but there is more that should be added at this stage. When hypotheses are being developed, researchers usually have an idea of the direction of the trend, correlation or difference that they expect. For example, who would express the opinion that there is a difference between the driving skills of men and woman without expressing an opinion as to what that difference – such as women are definitely worse drivers – is? In everyday life, a per-son who expresses such a belief about women’s driving skills is likely to be expressing prejudices about women or joking or being deliberately provocative – they are unlikely to be a woman. Researchers, similarly, often have expectations about the likely outcome of their research – that is, the direction of the trend in their data. A researcher would not express such a view on the basis of a whim or prejudice but they would make as strong an argument as possible built on evidence suggestive of this point of view. It should also be obvious that in some cases there will be very sound reasons for expecting a particular trend in the data whereas in other circumstances no sound grounds can be put forward for such an expectation. Research works best when the researcher articulates coherent, factually based and convincing grounds for their expectations.

In other words, often research hypotheses will be expressed in a directional form. In statistical testing, a similar distinction is made between directional and non-directional tests but the justifications are required to be exacting and reasoned (see Box 2.1). In a statistical analysis, as we saw in Chapter 2, there are tough requirements before a directional hypothesis can be offered. These requirements are that there are very strong empirical or theoretical reasons for expecting the relationship to go in a particular direction and that researchers are ignorant of their data before making the prediction.

It would be silly to claim to be making a prediction if one is just reporting the trend observed in the data. These criteria are so exacting that they probably mean that little or no student research should employ directional statistical hypotheses. Probably the main exceptions are where a student researcher is replicating the findings of a classic study, which has repeatedly been shown to demonstrate a particular trend.

The reason why directional statistical hypotheses have such exacting requirements is that conventionally the significance level is adjusted for the directional hypothesis. The directional hypothesis is referred to as one-tailed significance testing. The non-directional hypothesis is referred to as two-tailed significance testing. In two-tailed significance testing, the 5 per cent or .05 chance level is split equally between the two possibilities – that the association or difference between two variables is either positive or negative.

So if the hypothesis is that cognitive behaviour therapy has an effect then this would be

supported by cognitive behaviour therapy either being better in the highest 2.5 per cent or .025 of samples or worse in the lowest 2.5 per cent or .025 of samples. In one-tailed testing the 5 per cent is piled just at one extreme – the extreme which is in the direction of the one-tailed hypothesis. Put another way, a directional hypothesis is supported by weaker data than would be required by the non-directional hypothesis. The only good justification for accepting a weaker trend is that there is good reason to think that it is correct, that is, either previous research has shown much the same trend or theory powerfully predicts a particular outcome. Given the often weak predictive power of much psychological theory, the strength of the previous research is probably the most useful of the two.

If the hypothesis is directional, then the significance level is confined to just one half of the distribution – that is, the 5 per cent is just at one end of the distribution (not both) which means, in effect, that a smaller trend will be statistically significant with a directional test. There is a proviso to this and that is that the trend is in the predicted direction. Otherwise it is very bad news since even big trends are not significant if they are in the wrong direction. The problem with directional hypotheses is, then, what happens when the researcher gets it wrong, that is the trend in the data is exactly the reverse of what is suggested in the hypothesis. There are two possibilities:

z That the researcher rejects the hypothesis.

z That the researcher rejects the hypothesis but argues that the reverse of the hypothesis has been demonstrated by the data. The latter is rather like having one’s cake and eating it, statistically speaking. If the original hypothesis had been supported using the less stringent requirements then the researcher would claim credit for that finding. If, on the other hand, the original hypothesis was actually substantially reversed by the data then this finding would now find favour. The reversed hypothesis, however, was deemed virtually untenable once the original directional hypothesis had been decided upon. So how can it suddenly be favoured when it was previously given no credence with good reason? The only conclusion must be that the findings were chance findings.

So the hypothesis should be rejected. The temptation, of course, is to forget about the original directional hypothesis and substitute a non-directional or reverse directional hypothesis. Both of these are totally wrong but who can say when even a researcher will succumb to temptation?

Possibly the only circumstances in which a student should employ directional statistical hypotheses is when conducting fairly exact replication studies. In these circumstances the direction of the hypothesis is justified by the findings of the original study. If the research supports the original direction then the conclusion is obvious. If the replication actually finds the reverse of the original findings then the researcher would be unlikely to claim that the reverse of the original findings is true since it only would apply to the replica-tion study. The situareplica-tion is one in which the original findings are in doubt as are the new findings since they are diametrically opposite.

■ One- versus two-tailed significance level

Splitting the 5 per cent or .05 chance or significance level between the two possible outcomes is usually known as the two-tailed significance level because two outcomes (directions of the trend or effect) both in a positive and a negative direction are being considered. We do this if our hypothesis is non-directional as we have not specified which of the two outcomes we expect to find. Confining the outcome to one of the two possibilities is known as the one-tailed significance level because only one outcome is predicted. This is what we do if our hypothesis is directional, where we expect the results to go in one direction.

To understand what is meant by the term ‘tailed’, we need to plot the probability of obtaining each of the possible outcomes that could be obtained by sampling if the null hypothesis is assumed to be true. This is the working assumption of hypothesis testing and reference to the null hypothesis is inescapable if hypothesis testing is to be under-stood. The technicalities of working out the distribution of random samples if the null hypothesis is true can be obtained from a good many statistics textbooks. The ‘trick’ to it all is employing the information contained in the actual data. This gives us informa-tion about the distribuinforma-tion of scores. One measure of the distribuinforma-tion of scores is the standard deviation. In a nutshell, this is a sort of average of the amount scores in a sample differ from the mean of the sample. It is computationally a small step from the standard deviation of scores to the standard error of the means of samples. Standard error is a sort of measure of the variation of sample means drawn from the population defined by the null hypothesis. Since we can calculate the standard error quite simply, this tells us how likely each of the different sample means are. (Standard error is the distribution of sample means.) Not surprisingly, samples very different from the outcome defined by the null hypothesis are increasingly uncommon the more different they are from what would be expected on the basis of the null hypothesis.

This is saying little more than that if the null hypothesis is true, then samples that are unlike what would be expected on the basis of this null hypothesis are likely to be uncommon.

4.6 More on the similarity between measures of effect (difference) and association

Often measures of the effect (or difference) in experimental designs are seen as unlike measures of association. This is somewhat misleading. Simple basic research designs in psychology are often analysed using the t-test (especially in laboratory experiments) and the Pearson correlation coefficient (especially in cross-sectional or correlational studies).

The t-test is based on comparing the means (usually) of two samples and essentially examines the size of the difference between the two means relative to the variability in the data. The Pearson correlation coefficient is a measure of the amount of association or relationship between two variables. Generally speaking, especially in introductory statistics textbooks, they are regarded as two very different approaches to the statistical analysis of data. This can be helpful for learning purposes. However, they are actually very closely related.

A t-test is usually used to determine whether an effect is significant in terms of whether the mean score of two groups differ. We could use a t-test to find out whether the mean depression score was higher in the cognitive behaviour therapy group than in the no treatment group. A t-test is the mean of one group subtracted from the mean of the other group and divided by what is known as the standard error of the mean:

The standard error of the mean is a measure of the extent to which sample means are likely to differ. It is usually based on the extent to which scores in the data differ so it is also a sort of measure of the variability in the data. There are different versions of the t-test. Some calculate the standard error of the mean and others calculate the standard error of the difference between two means.

mean of one group − mean of other group standard error of the mean

The value of t can be thought of as the ratio of the difference between the two means to the degree of the variability of the scores in the data. If the individual scores differ widely, then the t value will be smaller than if they do not differ much. The bigger the t value is, the more likely it is to be statistically significant. To be statistically significant at the two-tailed .05 level, the t value has to be 2.00 or bigger for samples of more than 61 cases. The t value can be slightly less than 2.00 for bigger samples. The minimum value that t has to exceed to be significant at this level is 1.96, which is for an infinite number of cases. These figures can be found in the tables in some statistics texts such as Introduction to Statistics in Psychology (Howitt and Cramer, 2011a).

Bigger values of t generally indicate a bigger effect (bigger difference between the sam-ple means relative to the variability in the data). However, this is affected by the samsam-ple size so this needs to be taken into consideration as well. Bigger values of t also tend to

In document INFORME AUDITORÍA DE CUMPLIMIENTO (página 11-39)