MEDIA DE MATEMÁTICAS
COMP MÍN T1 T2 T3 combustión, la oxidación y la fermentación
B.1. Procesos, métodos y actitudes en matemáticas
Chapter 5 A Gentle Introduction to the World of Systematics 43
Reality Is Complex, and Variables Can ’ t Always Be Dialed
We just fi nished dealing with the simple case of the experimenter who presents different densities of dots, and instructs the respondent to rate the perception or do a task. What happens if we cannot “ dial ” the stimulus? Let ’ s take what we have learned and proceed on our journey.
Nature doesn ’ t always present us with this type of wonderful situation, where we can dial or titrate (i.e., systematically change the test stimulus). What happens in a case where we have two alternatives, either present or absent? Let ’ s move out of the world of rectangles, dots, and numbers and move now into the world of package design, where the variables may not be continu- ous, but rather discrete, and where there may be several options of the variable, not just one or two, not just off or on.
Beyond One at a Time — Looking at Several Variables at Once
Most scientists are educated to look at one variable at a time. In this way, they feel that they better or more clearly understand “ nature. ” That is, they believe that by looking at how a single variable “ drives ” a response they At this point, you ’ re probably thinking to yourself
“ Okay, this is nice to know. I ’ d probably read it in a book somewhere. It ’ s a nice factoid that I ’ ll use at the next party as a conversation opener (or closer). Yet, so what? Why is this information important? What can I possibly do with this piece of information? What can knowledge of these relations do for my practical work? ”
Good questions. Perhaps we are working with an e - commerce site and want to put some mechanism into place that prevents a “ bot ” from reading the information, but yet allows the person to read the information and then type what is read. Now let ’ s imagine that we want to make a set of dots more dense, but not too dense. If we change the physical density, then how dense does it look, and more importantly, how comfortable is it to read a number embedded in those dots? By doing the experi- ment, we can discover how dense the rectangle should be, to ensure that it is still readable, but that it defeats the bot.
We have just been talking about the world of the experimental psychologist, and particularly the psycho- physicist. We have looked at a private sensory experi- ence, and asked ourselves how to change that experience. We know we cannot just add or subtract sound pressures in a willy - nilly way. Rather, there is lawfulness in nature that we must appreciate. Changes in what we present to the test subject result in changes of perception.
can provide valuable advice about testing the many dif- ferent combinations (i.e., allocating a fractional part of the total set to each individual in a systematic but effi - cient way), the statistician really shines when it comes to designing the combinations in the fi rst place.
Most statisticians working in the fi eld of product and packaging development are familiar with the methods of “ experimental design ” (see Montgomery 2005 ; Ryan, 2007 ). Experimental design is a branch of statistics that lays out the different combinations. Few rational researchers are so daring and oblivious to cost when it comes to testing many combinations when they can get by with fewer stimuli, better varied, so their research is more cost - effective. Experimental design provides just such a solution. Indeed, we might say that experimental design actually “ saves the day ” and moves beyond fi nding answers through testing to uncovering rules that makes the developer, the designer, and the marketer far smarter. The actual evaluations look like tests, and they should because that ’ s what they are. It ’ s the disciplined thinking and disciplined experimentation that creates the true base of knowledge.
Beyond Tables to Models
You will see that we have progressed from considering one variable that is “ continuous, ” to considering several variables all at once. Furthermore, if you are like most people, you probably get a bit overwhelmed by a table of numbers. This is to be expected — people are not con- structed to look simply at numbers, but rather to look for patterns in the numbers. We tried to fi nd some patterns in the previous paragraphs, such as “ Do the more com- pound pictures with multiple variables perform better than the simpler pictures having only one variable? ”
There must be a better way, and there is. We don ’ t have to stay with columns of numbers in a table, looking for a pattern that nature is trying to reveal. Statistics can help here. Let ’ s introduce the notion of “ regression anal- ysis ” (Wikipedia, 2008 ), also commonly referred to as curve fi tting, although in the case of an on/off relation, the idea of a curve doesn ’ t really fi t, but the approach of curve fi tting actually does quite well (Arlinghaus and Arlinghaus, 1994 ).
Regression analysis is a branch of statistics, often called model building. Regression analysis looks for the relation between one or more independent variables, and a dependent variable. Those of you who have taken a statistics course probably will remember the relatively then “ understand ” how nature works. This heritage is
admirable and pervades a lot of the way people think about the world. In fact, it would be fair to say that much of today ’ s intellectual growth in science comes from this one - at - a - time analysis of variables in the world. The truth of the matter is that most psychophysicists spend their lives understanding the world, one variable at a time.
In the commercial world of design, things are not quite as simple nor are they orderly. Yes, one - at - a - time variation is satisfying, but it doesn ’ t necessarily answer business problems about what to put on labels, what factors infl uence perception, and what drives the occa- sionally momentary impulse to buy the food when one is shopping in a store. Although the one - at - a - time method eventually uncovers the key drivers of responses to pack- ages, the strategy is ineffi cient, and the timelines are just very long. It takes time to do things one at a time.
Most of you who read this book work in the world of business, where the research efforts have fi nancial con- sequences. Business questions have to be answered quickly. For the most part, these business questions involve a specifi c goal, such as increased purchase fre- quency or better communication of nutrition. One vari- able at a time simply does not do the job, or if it does, then the problem is unusually simple.
When it comes to several independent variables at one time, matters can become complicated. When we deal with only two options for each variable, we might be able to keep things to a reasonable number. The math is pretty easy to do. If each variable has two options, then one variable requires two combinations, two variables require four combinations, three variables requires eight combi- nations, etc. The numbers don ’ t really start mounting until we reach fi ve or six variables, at which time we have 32 or 64 combinations, respectively. The rule is simple — with two options for each variable, we will have 2 N combinations to test. When N is large (i.e., many dif- ferent variables to explore), 2 N becomes very large. The task becomes even more daunting when instead of two options per variable we have three options. Thus, we might have three colors for a package, three different labels, three different pictures of a food, three different sizes, etc. For N variables, each with three options, we have 3 N combinations.
For the past 70 years or so, statisticians have been quite involved in this issue of multiple stimulus testing, especially when the test stimuli are systematically varied (Box, Hunter, and Hunter, 1978 ). Although statisticians
Chapter 5 A Gentle Introduction to the World of Systematics 45
simplistic yet instructive example that most introductory statistics courses give to explain the idea of regression. Let ’ s look at an example of regression, this time from the U.S. Department of Agriculture. We see the data in Figure 5.2 . The independent variable is year, starting in 1948. The dependent variable is a measure of relative productivity, with 1987 normalized to 1.0.
Once we plot the data, how then do we make use of it? What type of question should we ask? The fi gure itself simply retells the table of data. There is a bit more, however. When we plot the data, we can see the nature of the relation. We see that over the passage of time, starting in 1948 there is a systematic rise in the produc- tivity of agriculture. We could look to any year and fi nd its relative productivity simply by keeping our fi nger at the year (abscissa or x - axis), and moving upward until we fi nd the data, and fi nally moving leftward to the ordinate (y - axis) to discover the relative productivity.
We want to go further, however. We want to create a model or equation that shows us the numerical relation between the year and the agricultural productivity. To do this, let us move out of the world of plotting data and into the world of regression.
First, let us look at the actual data from which the curves in Figure 5.2 are drawn. Happily for us as readers and analysts, the U.S. government, specifi cally the Department of Agriculture, publishes these numbers. They can be found at the same website as that from which Figure 5.2 is taken. We see some of these data for the fi rst fi ve and the last fi ve years in Table 5.1 .
1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 1950 1960 1970 OUTPUT 1980 Agricultural Productivity: 1948–1996 Index (1987 = 1) Source: USDA·ERS Year 1990 1996 INPUT PRODUCTIVITY
Figure 5.2 How agricultural productivity changed over a 48 - year period, from 1948 to 1996
Table 5.1 Data about agricultural inputs, outputs, and pro- ductivity. The data are shown for the fi rst fi ve years and the last fi ve years only.
Trends in U.S. Agriculture, published by United States Department of Agriculture — National Agriculture Statistics Service
Index of Agricultural Productivity: 1948 – 1996 Source: USDA — ERS
Year Output Input Productivity 1948 0.507 1.035 0.490 1949 0.507 1.097 0.462 1950 0.503 1.094 0.460 1951 0.527 1.108 0.476 1952 0.540 1.107 0.488 1992 1.137 0.991 1.147 1993 1.071 0.997 1.074 1994 1.217 1.025 1.187 1995 1.153 1.038 1.111 1996 1.202 1.009 1.191
We can learn a lot by plotting the data, but there is more. Suppose we want to develop a model showing the expected change, say in output, as a function of the number of years since the analysis began. Let us call 1948 year 1, 1949 year 2, etc. Now, looking at the data in Table 5.1 , let us relate the number of years to the output, by the simple equation: Output = k 1 (Number of years) + k 0 . This is a simple linear equation. The results appear in Table 5.2 . It says in words:
of years. In other words, output has been steadily increasing. In consumer research, we will see lower values for the squared multiple R.
5. The additive constant is 0.435. This value can be found in Table 5.2 in the column marked “ coeffi - cient. ” We interpret this constant to mean that at time 0 (i.e., 1947) we expect the agricultural output to be 0.435. Of course, we did not measure the output then, since the data start at 1948. Nevertheless, because we have a linear equation, we can estimate the value of that equation when time is 0 (i.e., when the year is 1947). Notice that the regression analysis comes out with a coeffi cient value with three signifi cant digits. This is purely mathematical. The regression modeling can estimate the data to 20 or more signifi cant digits. However, the reality is that for most cases we would use at most 1 signifi cant digit.
6. The coeffi cient for the single independent variable, k 1 , is 0.014. This means that the output increases by 0.014 units for each year since 1947. Thus, if we look at a four - year period, from 1947 to 1951, we can expect (0.014 = coeffi cient) × (4 = number of years since 1947). This is 0.056 units. Notice that once we have this coeffi cient, we have a sense of the rate at which agricultural output increases for each year. The goodness of fi t need not be so high. Even if the mul- tiple R 2 were lower, say approximately 0.60 (i.e., 60% of the variability in the output accounted for by the number of years), we would feel comfortable that we somehow have a “ handle ” on how fast the output grows for each year. It is this sense of learning, of rules, that makes the analysis so gratifying, and leads to an increased satisfaction that we know what is really occurring, rather than just plotting the data. 7. The standard error tells us the variability of this coef-
fi cient or additive constant. If we were to run the study again, and do the analysis, then based on these data, we would expect the coeffi cients of the equation to vary. About 68% of the time we would expect to see a coeffi cient between the mean ± 1 standard error. The standard error is 0.011 for the additive constant, so that about 68% of the time we would expect the additive constant to lie between a low value of 0.425 and a high value of 0.447 (corresponding to 0.436 ± 0.011). For the coeffi cient for “ years, ” the standard error is almost infi nitesimal, so the computer output shows it to be 0.000. Of course, if we were to extend the results to, say, 10 decimal places, we would see a non - zero value for the standard error. 1. The output is a linear function of the number of years.
2. Furthermore when year = 0 (i.e., 1947), we can expect an output value of k 0 .
3. Finally, for each year, we expect a constant increase in output equal to k 1 .
We will use the standard statistical packages for regression. Let ’ s unpack the fi gure to understand what the statistics mean. Our analysis will be helpful in the future when we look at the effects that different features exert on the perception of packages.
1. The dependent variable is “ output. ” The economists measured the agricultural output, in relative units, and gave that data in Table 5.1 .
2. The number of “ cases ” or observations is 49 (N = 49). In Table 5.1 , we see only 10 of the 49. However, when it comes to analyzing the data and building a model, we use all 49 observations.
3. The goodness of fi t is shown by the multiple R. The multiple R shows the degree of linearity. The multiple R ranges from a low of 0 to a high of +1.00. We have a very good fi t, indeed, almost a perfect fi t. The mul- tiple R is 0.983.
4. The square of the multiple R shows the proportion of the variability in the dependent variable (output) that can be accounted for by knowing the value of the independent variable (number of years). The squared multiple R, 0.967, means that almost 97% of the vari- ability can be accounted for by knowing the number
Table 5.2 “ Linear ” regression analysis that fi ts a straight line to the relation between agricultural output (dependent variable) and number of years since 1948 (independent variable)
Dependent Variable: OUTPUT N: 49 Multiple R: 0.983
Squared multiple R: 0.967 Adjusted squared multiple R: 0.966 Standard error of estimate: 0.038
Effect Coeffi cient
Standard
Error t statistic P(2 Tail) Additive Constant (k 0 ) 0.436 0.011 39.353 0.000 Years since 1947 (k 1 ) 0.014 0.000 37.168 0.000
Chapter 5 A Gentle Introduction to the World of Systematics 47
Our analysis will be the so - called “ dummy variable regression. ” Dummy regression refers to the nature of the independent variables, which take on only one of two values. If in a test stimulus (i.e., package design, concept, etc.) the element is present, then the element is repre- sented by the value 1. In contrast, if the element is absent from the test stimulus, then the element is represented by the value 0.
The representation of 1 and 0 is not done simply as a way to show presence/absence. Rather, the representa- tion will allow us to use these two numbers as the values of the independent variable. There is a simple logic oper- ating behind the scene here. Let ’ s return for a minute to our example about agricultural output versus year. The equation is written as:
Output=k0+ (k year1 )
Recall that the coeffi cient k 1 shows us the expected change in output for each change in one year. So when k 1 = 0.15, we expect output to change by 0.15 units when we go from year 1 to year 2, and the same 0.15 - unit change when we go from year 2 to year 3, etc.
Now imagine that we are dealing with package design, rather than with agricultural output. We have a database like we had for Table 5.1 , but this time the independent variables are design elements. These are the elements A, B, and C. The three design elements can either be present or absent. We see the coding of the eight different com- binations, as well as the percent of respondents who rated each package design as communicating “ healthful ” (rating of 7 – 9 on the 9 - point healthfulness scale). See Table 5.3 .
8. The “ t ” value is the student “ t ” statistic. The t value is defi ned as the value ((coeffi cient − 0)/standard error). The t value has a sampling distribution. That is, for any t value, we know the probability of getting that t value if the coeffi cient were truly 0 rather than what we observe. The probability that the constant or coeffi cient is really 0, rather than what we observe, is infi nitesimally small. The “ t ” is very high, so the probability is virtually nil that we are seeing a random fl uctuation from a true mean of 0.
Extending Our Approach to the More Simple Case — Present/Absent
Let ’ s now move forward with our understanding of mod- eling. We will move out from the world of continuous variables such as year, which take on a stream of values such as 1 – 48, and move into the world of “ on - off ” or “ yes - no. ” This world is more appropriate for package design, where we deal with the presence/absence of fea- tures on a package. It ’ s a rare case when we can system- atically vary one variable over a wide range, to look at the equation relating the size of that variable (i.e., size of logo) to the rating (i.e., interest in buying the product based on the package).
The more typical situation is a package that comprises several silos (variables), with each silo comprising, in turn, several options (elements). The number of silos may go from as few as one (i.e., presence/absence of a logo) to a dozen or more (logo, color, burst to show “ new/ improved, ” price, color of background for price, etc.). The types of silos are endless, limited only by the imagi- nation of the designer and, of course, the space on the package itself. The more complex case, but not necessarily more diffi cult in the long run, will comprise several silos, and different numbers of options for each silo. We will look at an approach to solve the problem of “ How does each element in each silo drive the response? ” in this more complicated situation. And, as a bonus, this straightforward approach that we outline will be used in the rest of this book to help us learn rules about package design.
Arrays of 1s and 0s — A Useful System to Represent the Combinations
When we deal with these complicated problems of many