• No se han encontrado resultados

TRATAMIENTO DEL CRUDO

3. DERIVADOS DEL CRUDO

3.2 TRATAMIENTO DEL CRUDO

Bezjak and Knez (1995) provide data on the length of time it takes garment workers to runstitch a collar on a man’s shirt, using a standard workplace and a more ergonomic workplace. Table 2.1 gives the “auxiliary manual time” per collar in seconds for 30 workers using both systems.

One question of interest is whether the times are the same on average for the two workplaces. Formally, we test the null hypothesis that the aver- age runstitching time for the standard workplace is the same as the average runstitching time for the ergonomic workplace.

2.4 Randomization for Inference 21

Table 2.2:Differences in runstitching times (standard− ergonomic). 1.03 -.04 .26 .30 -.97 .04 -.57 1.75 .01 .42

.45 -.80 .39 .25 .18 .95 -.18 .71 .42 .43 -.48 -1.08 -.57 1.10 .27 -.45 .62 .21 -.21 .82

A pairedt-test is the standard procedure for testing this null hypothesis.

We use a paired t-test because each worker was measured twice, once for Pairedt-test for paired data

each workplace, so the observations on the two workplaces are dependent. Fast workers are probably fast for both workplaces, and slow workers are slow for both. Thus what we do is compute the difference (standard− er- gonomic) for each worker, and test the null hypothesis that the average of these differences is zero using a one samplet-test on the differences.

Table 2.2 gives the differences between standard and ergonomic times. Recall the setup for a one samplet-test. Let d1, d2, . . ., dnbe the n differ-

ences in the sample. We assume that these differences are independent sam- ples from a normal distribution with meanµ and variance σ2, both unknown. Our null hypothesis is that the mean µ equals prespecified value µ0 = 0

(H0 : µ = µ0 = 0), and our alternative is H1: µ > 0 because we expect the

workers to be faster in the ergonomic workplace. The formula for a one samplet-test is

t = d − µ¯ 0 s/√n ,

where ¯d is the mean of the data (here the differences d1, d2, . . ., dn),n is the The pairedt-test

sample size, ands is the sample standard deviation (of the differences)

s = v u u t 1 n − 1 n X i=1 (di− ¯d )2 .

If our null hypothesis is correct and our assumptions are true, then the t-

statistic follows at-distribution with n − 1 degrees of freedom.

Thep-value for a test is the probability, assuming that the null hypothesis

is true, of observing a test statistic as extreme or more extreme than the one Thep-value

we did observe. “Extreme” means away from the the null hypothesis towards the alternative hypothesis. Our alternative here is that the true average is larger than the null hypothesis value, so larger values of the test statistic are extreme. Thus thep-value is the area under the t-curve with n − 1 degrees of freedom from the observedt-value to the right. (If the alternative had been µ < µ0, then the p-value is the area under the curve to the left of our test

22 Randomization and Design

Table 2.3:Pairedt-tests results for runstitching times (standard –

ergonomic) for the last 10 and all 30 workers

n df d¯ s t p

Last 10 10 9 .023 .695 .10 .459 All 30 30 29 .175 .645 1.49 .074

statistic. For a two sided alternative, thep-value is the area under the curve

at a distance from 0 as great or greater than our test statistic.)

To illustrate the t-test, let’s use the data for the last 10 workers and all

30 workers. Table 2.3 shows the results. Looking at the last ten workers, thep-value is .46, meaning that we would observe a t-statistic this larger or

larger in 46% of all tests when the null hypothesis is true. Thus there is little evidence against the null here. When all 30 workers are considered, the p-

value is .074; this is mild evidence against the null hypothesis. The fact that these two differ probably indicates that the workers are not listed in random order. In fact, Figure 2.1 shows box-plots for the differences by groups of ten workers; the lower numbered differences tend to be greater.

Now consider a randomization-based analysis. The randomization null hypothesis is that the two workplaces are completely equivalent and merely act to label the responses that we observed. For example, the first worker

Randomization

null hypothesis had responses of 4.90 and 3.87, which we have labeled as standard and er-

gonomic. Under the randomization null, the responses would be 4.90 and 3.87 no matter how the random assignment of treatments turned out. The only thing that could change is which of the two is labeled as standard, and which as ergonomic. Thus, under the randomization null hypothesis, we could, with equal probability, have observed 3.87 for standard and 4.90 for ergonomic.

What does this mean in terms of the differences? We observed a differ- ence of 1.03 for worker 1. Under the randomization null, we could just as

Differences have random signs under randomization null

easily have observed the difference -1.03, and similarly for all the other dif- ferences. Thus in the randomization analogue to a pairedt-test, the absolute

values of the differences are taken to be fixed, and the signs of the differ- ences are random, with each sign independent of the others and having equal probability of positive and negative.

To construct a randomization test, we choose a descriptive statistic for the data and then get the distribution of that statistic under the randomization null hypothesis. The randomization p-value is the probability (under this

randomization distribution) of getting a descriptive statistic as extreme or more extreme than the one we observed.

2.4 Randomization for Inference 23 -1 -0.5 0 0.5 1 1.5 1 2 3 Group of 10 T i m e d i f f e r e n c e

Figure 2.1:Box-plots of differences in runstitching times by groups of 10 workers, using MacAnova. Stars and diamonds indicate potential outlier points.

For this problem, we take the sum of the differences as our descriptive statistic. (The average would lead to exactly the samep-values, and we could

also form tests using the median or other measures of center.) Start with Randomization statistic and distribution

the last 10 workers. The sum of the last 10 observed differences is .23. To get the randomization distribution, we have to get the sum for all possible combinations of signs for the differences. There are two possibilities for each difference, and 10 differences, so there are210= 1024 different equally

likely values for the sum in the randomization distribution. We must look at all of them to get the randomizationp-value.

Figure 2.2 shows a histogram of the randomization distribution for the last 10 workers. The observed value of .23 is clearly in the center of this

distribution, so we expect a largep-value. In fact, 465 of the 1024 values are Randomization

p-value

.23 or larger, so the randomizationp-value is 465/1024 = .454, very close to

thet-test p-value.

We only wanted to do a test on a mean of 10 numbers, and we had to compute 1024 different sums of 10 numbers; you can see one reason why randomization tests have not had a major following. For some data sets, you can compute the randomizationp-value by hand fairly simply. Consider the

24 Randomization and Design 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 -6 -4 -2 0 2 4 6 Sum of differences D e n s i t y

Figure 2.2:Histogram of randomization distribution of the sum of the last 10 worker differences for runstitching, with vertical line added at the observed sum.

These differences are

.62 1.75 .71 .21 .01 .42 -.21 .42 .43 .82

Only one of these values is negative (-.21), and seven of the positive differ- ences have absolute value greater than .21. Any change of these seven values can only make the sum less, so we don’t have to consider changing their signs, only the signs of .21, .01, and -.21. This is a much smaller problem, and it is fairly easy to work out that four of the 8 possible sign arrangements for testing three differences lead to sums as large or larger than the observed sum. Thus the randomizationp-value is 4/1024 = .004, similar to the .007 p-value we would get if we used the t-test.

Looking at the entire data set, we have230 = 1, 073, 741, 824 different

sets of signs. That is too many to do comfortably, even on a computer. What

Subsample the randomization distribution

is done instead is to have the computer choose a random sample from this complete distribution by choosing random sets of signs, and then use this sample for computing randomizationp-values as if it were the complete dis-

tribution. For a reasonably large sample, say 10,000, the approximation is usually good enough. I took a random sample of size 10,000 and got a p-

value .069, reasonably close to the t-test p-value. Two additional samples

2.4 Randomization for Inference 25

Table 2.4:Log whole plant phosphorus (ln µg/plant) 15 and 28 days after first harvest.

15 Days 28 Days 4.3 4.6 4.8 5.4 5.3 5.7 6.0 6.3

that these approximatep-values have a standard deviation of about q

p × (1 − p)/10000 ≈q.07 × .93/10000 = .0026 .

Documento similar