The pilot data from Experiment 1 indicate that the long-term trend may not be readily interpreted in graphs that also show short-term variability. The purpose of Study 5 was to investigate whether people encode representations of trends and short-term variability when looking at complex time-series graphs. Furthermore, finding in Study 4 that trends may not be readily interpreted, Study 5 also asked whether language can support the identification of long-term trends.
Language can provide user-goals, which are thought to activate relevant schema and guide visual-spatial attention (Brunyé & Taylor, 2009; Rothkopf, Ballard, & Hayhoe, 2007; Yarbus, 1967). Attending to spatial language when encoding visual scenes can support spatial reasoning (Loewenstein & Gentner, 2005), influence memory for a scene (Feist & Gentner, 2007), and affect the degree to which static images are mentally animated (Coventry, et al., 2013). Therefore, using language to convey the importance of the long-term trend might direct attention to visual features that support encoding of the trend and influence cognition about the trend. Furthermore, presenting this as a ‘warning’ can make the information salient and increase the likelihood that it is acted upon (Wogalter, et al., 1987).
However, linguistic information, including warnings, can be ignored by individuals (Eiriksdottir & Catrambone, 2011; Wogalter, et al., 1987). Even if read, linguistic information may be shallowly processed (LeFevre & Dixon, 1986). Further, although language might be intrinsically tied to flexible spatial skills (Hermer-Vazquez, Spelke & Katsnelson, 1999), individuals can instead rely on visual cues, weighted by prior experience, to support spatial processing
(Learmonth, Newcombe, Sheridan, & Jones, 2008). Hence, a linguistic warning might not support interpretation of trends from time-series graphs.
The aim of Study 5 was therefore to test whether a linguistic warning that provides a strategy for interpreting long-term trends (by ignoring task-irrelevant features) would improve encoding of the long-term trend. Furthermore, if a warning is effective, the study asks to what extent the warning is long-lasting, and
whether the effect is driven by changes in visual attention (measured using eye tracking) or whether the warning might merely provide a schema to help organize visual information into long-term memory, without affecting visual attention directly. Informed by previous work (Brunyé & Taylor, 2009; Peebles & Cheng, 2003), it was predicted that the warning would direct visual attention to
information consistent with a mentally superimposed line of best of fit.
The study also manipulated a perceptual feature of time-series graphs – the number of intermediary x-axis tick marks and labels. In line with evidence of interactions between top-down and bottom-up processes (Hegarty, Canham, & Fabrikant, 2010), it was hypothesized that intermediary x-axis tick marks and labels might provide salient cues that direct attention to short-term changes in the data, resulting in poorer spatial representation of the long-term trend.
Method
Design
To test spatial representations of the long-term trend and short-term variability, a forced choice task was employed in which participants were shown a graph to study and were then asked to make a ‘same’ or ‘different’ judgment on a following test graph. The test graph was either identical to the study graph (same); had the same overall pattern as the study graph but with a different gradient (gradient different); had the same gradient as the study graph but with exaggerated peaks and troughs (amplitude different); or was completely different to the study graph (completely different). The number of x-axis ticks, either 2, 5 or 9, was varied across each type of test graph (see Figure 13 for examples).
Figure 13: Three examples of study graphs (solid line) and associated test graphs (dashed line) shown here together. Study and test graphs both used solid lines for stimuli presentation and were shown sequentially in the experiment.
To test the effect of a linguistic warning on cognition of the graph,
participants were randomly allocated to either receive a warning at the start of the study, or to receive no such warning. The warning read:
“WARNING When looking at graphs, people are often misled by extreme data points – short-term fluctuations in the data can obscure the long-term trend. To avoid errors, it is useful to ignore extreme data points to correctly identify the long-term trend.”
The experiment therefore employed a 4 (Test Graph) x 3 (X-ticks) x 2 (Warning) design, with test graph and x-ticks as within participant variables and warning as a between participant variable.
Participants
Forty undergraduate students (29 female, 11 male) from the University of East Anglia took part in the study in return for course credit or a nominal payment. Average age was 21 years (range 18-30 years). Sample size was informed via power analysis to detect a medium effect size (ηp2 = .060).
Apparatus
A Tobii TX300 Eye Tracker (Tobii Technology AB, Danderyd, Sweden) with integrated TFT LCD monitor (51cm x 29cm) set to 1280 x 720 pixels was used for stimulus presentation and collection of eye gaze data at 300Hz. Eprime Version 2.0 (Psychology Software Tools Inc., Sharpsburg, USA) was used to control stimulus presentation and record data. Responses for same-different trials were mapped to the ‘Z’ and ‘M’ keyboard keys, which were reversed and
counterbalanced between warning conditions. Verbal responses were recorded via a headset microphone. Eye gaze data were analyzed using OGAMA Version 4.5 (A. Voßkühler, Freie Universität Berlin, Germany), using default parameters for fixation detection.
Linguistic warning
Graph stimuli – ‘same-different’ trials
Study time-series graphs were created (1126 x 510 pixels), each plotting 17 data points. Twelve initial datasets were created for the study graphs for ‘same- different’ trials, four of which showed an underlying positive long-term trend, four a negative long-term trend and four a flat long-term trend (Figure 13). Data points for each graph were created by sampling residuals at random from a normal distribution, which were then applied to a baseline positive (gradient = 1.0, intercept = 30), negative (gradient = -1.0, intercept = 50) or flat (gradient = 0.0, intercept = 40) linear trend graph. The x-axis was labeled ‘Years’ and the y- axis was labeled either as “Medication use (doses)”, “Infections (patients)”, “Temperature (oC)”, “Rainfall (mm)”, “Income (GBP £)”, or “Expenditure (USD $)”. The x-axis covered a range of 16 years, with the starting year always between 1900 and 1994. The y-axis covered a range of 40 units, starting at 20 and
finishing at 60 units. A caption was created for each graph that simply read “[variable] over time”. Three study graphs – one with a positive trend, one with a negative trend, and one with a flat trend – were allocated to each of the four test graph conditions (same, gradient different, amplitude different, completely different).
For each of the twelve study graphs, a corresponding test graph was then created. For the three study graphs allocated to the ‘same’ condition, test graphs were identical to the study graph. Test graphs for the three study graphs allocated to the ‘gradient different’ condition had a subtly different gradient to the study graph (transformation of the y values of the study graph: y' = y ± 0.4x). The direction of the transformation, i.e. shift upward applying +0.4x, or a shift downward applying -0.4x, was matched to the gradient of the line of best fit for the study graph. Flat trend graphs had gradients close to, but not exactly equal to 0, owing to the random sampling of residuals. Therefore, positive long-term trend study graphs had test graph pairings that became steeper (more positive), negative long-term trend study graphs had test graph pairings that also became steeper (more negative), and flat long-term trend study graphs with a line of best fit gradient > 0 had test graph pairings that became more positive and flat long-term
trend study graphs with a line of best fit gradient < 0 had test graph pairings that became more negative.
Test graphs for the three study graphs allocated to the amplitude different condition had extended peaks and troughs compared to the study graph (residuals multiplied by 1.4). For the three study graphs allocated to the ‘completely
different’ condition, three new graphs were produced to serve as test graphs. For each of the 12 study-test graph pairings, three variants were then created, showing 2, 5 and 9 x-ticks (Figure 13), resulting in a total of 36 study-test graph pairings.
Graph stimuli –‘describe’ filler trials
A further group of graphs was created (using the same pixel dimensions, plotting the same number of data points, and using the same labelling as for the same- different trials), which acted as filler trials on which participants were tasked to describe the graph. Three initial datasets were created, one with a positive long- term trend, one with a negative long-term trend, and one with a flat long-term trend. For each of these initial datasets, three graph variants were then created, showing 2, 5 and 9 x-ticks, resulting in a total of 9 graphs for the ‘describe’ filler trials.
Graph stimuli –‘comprehension’ filler trials
A final group of study time series graphs was created (using the same pixel dimensions, plotting the same number of data points, and using the same labelling as for the same-different trials), which acted as filler trials on which participants were asked to answer a comprehension question about the graph. Nine initial datasets were created, three with a positive long-term trend, three with a negative long-term trend, and three with a flat long-term trend. In this instance, within each set of positive, negative and flat graphs, one graph showed 2 x-ticks, one showed
5 x-ticks and one-showed 9 x-ticks. In total there were 9 graphs for ‘true-false comprehension’ filler trials.
Areas of interest (AOI)
AOIs were defined for each study graph by first determining a circle around each data point with a maximum diameter that would avoid overlapping adjacent AOIs (58 pixels), i.e. the largest mutually exclusive area that could be defined for a data point radiating from the centre of each data point. A parallelogram (2.0 x 34.5 degrees of visual angle) was then fitted over the line of best fit of the plotted data, determined by linear least squares regression. The height of the parallelogram was the same size as that used for the data points (58 pixels), and the length of a parallelogram was determined by the distance between the outer edges of the first and last data point AOI (1002 pixels). The parallelogram formed the line of best fit AOI (6.3% of screen area). A convex hull was then determined around the outer edges of the defined shapes, which formed the whole data AOI (mean 22.1% of screen area). An extreme data AOI was defined as the area of the whole data AOI that sat outside of the line of best fit AOI (mean 15.8% of screen area) (Figure 14).
Figure 14: Line of best-fit AOI and extreme data AOI for one of the 24 study graphs.
Procedure
Participants were informed that the study was investigating how people understand line graphs and they then received instructions on screen before a practice block of trials. The eye tracker was then calibrated. Participants were randomly allocated to either the warning or no warning condition, with the requirement of two equal sized groups (20 participants in each group).
Participants in the warning condition then received the warning on screen and were instructed to read it before starting the first of three blocks of trials. Participants in the no warning condition simply started the first block of trials after eye tracker calibration. Each trial consisted of a study phase (Figure 15) during which participants were asked to look at and study the caption and the graph. The caption was presented prior to the graph to help control time spent reading the caption. The test phase began by indicating which task would follow, i.e. same-different, true-false, or describe (true-false and describe tasks were included to encourage participants to study the graphs in a naturalistic way and to ensure depth of encoding). For same-different trials, participants then made a same-different judgment about a test caption and then a same-different judgment about a test graph (i.e. comparing to their memory for the study caption and study graph). Participants were instructed to give a response as quickly as possible when the test caption/graph appeared.
Trials were presented in three blocks. Each block contained 18 trials – 12 same-different trials, three true-false filler trials and three describe filler trials – presented in random order. Within a block, each of the initial 12 same-different study datasets appeared once, with each x-tick variant appearing in separate blocks (i.e. a same-different study graph dataset only appeared once in a block). Study-test graph pairings were allocated to blocks such that each block contained three ‘same’ trials, three ‘amplitude different’ trials, three ‘gradient different’ trials and three ‘completely different’ trials. Furthermore, each block contained four positive trend same-different study graphs, four negative trend same- different study graphs, and four flat trend same-different study graphs. In addition, each block contained four study-test graphs for each of 2, 5 and 9 x- ticks. Hence, for same-different trials, trial type, trend and x-ticks was balanced in
each block. Each block also balanced trend and x-ticks among ‘describe’ and ‘comprehension’ filler trials. See Appendix 2 Table A2-1 and Table A2-2 for full allocation of trials to blocks.
The specific trials allocated to a block was identical for all participants, but the order in which trials appeared within a block was randomised for each participant. Further, the order of the blocks was counterbalanced across
participants. The eye tracker was re-calibrated at the start of each block. At the end of the third block, participants in the warning condition were asked what they remembered about the warning. The study lasted approximately 1 hour.
Results
Data screening
Due to the importance of encoding the warning, a strict exclusion criterion was used, requiring accurate recall of the warning at the end of the study. Only same- different trials where participants correctly remembered the caption and then went on to make a judgement about the graph were included in the analyses. Six
participants were removed from further analyses: four participants in the warning condition who could not recall the warning at the end of the study; one participant who subsequently reported monocular vision impairment; and one participant whose accuracy on completely different trials was 11% (lower than three standard deviations from mean accuracy). Following data screening, 34 participants were included in data analysis, 18 in the no warning condition and 16 in the warning condition.
Figure 15: Presentation of same-different and filler trials.
Task performance.
Sensitivity to detect differences between the graphs on same-different trials was measured using d', calculated using the log-linear rule (Hautus, 1995). There was no significant difference between the warning and no warning groups on ability to discriminate between completely different trials, t(32) = -0.341, p = .735, d = 0.117, 95% CI [-0.558, 0.790]. To assess sensitivity to detect subtle changes between study and test graphs, participants’ d' scores for amplitude and gradient
sensitivity were analyzed with a 2 (Test Graph [amplitude different, gradient different]) x 3 (X-ticks [2, 5, 9]) x 2 (Warning [no warning, warning]) mixed ANOVA.
Means and standard deviations for each cell of the analysis are provided in Appendix 3, Table A3-1. There was no main effect of test graph, x-ticks, or warning (Table 5). However there was a significant interaction between test graph and warning, F(1,32) = 4.399, p = .044, ηp2 = .121 (Figure 16). Participants in the no warning condition performed significantly worse on gradient different trials than amplitude different trials: t(17) = -3.381, p = .004, d = -0.823, 95% CI [- 1.364, -0.263]; whereas those in the warning condition performed about equally on gradient different trials and amplitude different trials: t(15) = 0.112, p = .912, d = 0.030, [-0.497, 0.556]. There were no other significant two-way interactions and no three-way interaction (Table 5).
Table 5. Study 5 mixed ANOVA table (test graph x x-ticks x warning); * indicates significance at the .05 level.
Source Test p-value ηp2
Main effects
test graph F(1,32) = 3.655 .065 .103
x-ticks F(2,64) = 0.365 .696 .011
warning F(1,32) = 0.034 .855 .001
Two-way interactions
test graph x warning F(1,32) = 4.399 .044* .121 test graph x x-ticks F(2,64) = 0.060 .942 .002 x-ticks x warning F(2,64) = 2.512 .089 .073 Three-way interaction
Figure 16: Average sensitivity (d') for amplitude different and gradient different trials in each group, with 95% confidence intervals.
To investigate if the effect of the warning on gradient performance deteriorated over time, d' values were recalculated by collapsing data across x- ticks (as there was no significant x-ticks main effect or interaction), and then splitting the data by block of trials, i.e. first block, intermediary block, last block. A 2 (Test Graph) x 3 (Block) x 2 (Warning) mixed ANOVA was then performed. Means and standard deviations for each cell of the analysis are provided in Appendix 3, Table A3-2. Results were consistent with the first mixed ANOVA (i.e. a significant Test Graph x Warning interaction), but there was no three-way interaction between test graph, warning and block (Table 6).
Table 6. Study 5 mixed ANOVA table (test graph x block x warning); * indicates significance at the .05 level.
Source Test p-value ηp2
Main effects
test graph F(1,32) = 4.092 .051 .113
block F(2,64) = 1.116 .334 .034
warning F(1,32) = 0.014 .906 <.001
Two-way interactions
test graph x warning F(1,32) = 4.319 .046* .119 test graph x block F(2,64) = 0.509 .603 .016 block x warning F(2,64) = 0.330 .720 .010 Three-way interaction
test graph x block x warning F(2,64) = 0.026 .974 .001
Visual attention
To investigate if the improved discriminability of the gradient found in the warning condition might be driven by differences in visual attention during encoding, fixation durations for the AOIs of the study graphs were calculated. Four participants were excluded from further analysis as they had poor eye tracking calibrations (two participants from each of the warning conditions, leaving 16 participants in the no warning group and 14 participants in the warning group). Same-different trials in which a correct response was given to the caption and a response was given to the test graph, all trials for the true-false task in which a response was given, and all trials for the describe task were included in the analysis. However, individual trials were excluded if >15% of eye tracking samples were missing, or if there was a continuous period >700ms of data missing (10.7% of trials). As there was no main effect or interaction of x-ticks in the d' data, fixation data were collapsed across x-ticks.
The data were checked to see if the warning influenced the total fixation duration for the whole data area compared to the no warning group, finding no significant difference: t(28) = 1.288, p = .208 (two-tailed, equal variances assumed), d = 0.471, 95% CI [-0.261, 1.195]. Fixation durations for the line of best fit AOI and extreme data AOI were then compared. Homogeneity of
variances between the warning and no warning groups could not be assumed for total fixation data for the line of best fit AOI or the extreme data AOI; Levene’s test for equality of variances were, F(1,28) = 9.121, p = .005; and F(1,28) = 5.285, p = .029, respectively. Therefore, separate independent t-tests were performed on the data in line with a priori predictions.
Participants in the warning condition spent significantly longer fixating on the line of best fit area than participants who did not receive the warning, t(19.802) = 2.119, p = .024 (one-tailed, equal variances not assumed), d = .804, 95% CI [0.050, 1.545]. Conversely, there was no significant difference for the extreme data area, t(25.137) = -0.352, p = .728 (two-tailed, equal variances not assumed), d = -0.125, [-0.842, 0.594] (see Table 7 for fixation durations).
Table 7: Study 5 mean (M) and standard deviations (SD) of fixation duration in ms during study for each AOI.
Area of interest
No warning (n = 16) Warning (n = 14)
M SD M SD
Line of best fit 1426 (432) 1919 (772)
Extreme data 1587 (586) 1525 (356)
Whole data 3013 (884) 3444 (952)