6.1 Assigning Fixations to Text
We set out to measure the accuracy with which fixations can be assigned to lines of text. The results reported in Section
5.2 are rather disappointing: in most cases, the discrimination between adjacent lines was quite poor. This was not so unexpected, as quantified in Table
3. We chose to study font sizes and line spacings that are not too far removed from those used in conventional work conditions. But with these settings, the pitch of the lines on the screen was in the range of 0.5 to 1.1 cm. When viewed at a distance of 60 cm, the difference in viewing angle of adjacent lines ranged from 0.47
\(^\circ\) to 1.05
\(^\circ\). Such angles are at the resolution limit of the eye tracker and fall within the field of view of the eye. Therefore, it is not surprising that they are hard or impossible to distinguish.
Given that the resolution of eye trackers is around 1
\(^\circ\) [
50] and the coverage of the fovea is about 2
\(^\circ\) [
39,
51], the limiting factor is the fovea. To be completely sure which line is being looked at, the separation between the lines should be at least 2
\(^\circ\). At a distance of 60 cm, this means a pitch of 2.1 cm on the screen, which can be achieved with a 20-pt font at a spacing of 2.6—more than double space. At a distance of 75 cm, a pitch of 2.6 cm is required, implying a triple-spaced 20-pt font. Placing the experiment’s participants closer to the screen reduces the requirements, but they are still quite far from what developers normally use in their work.
In an experiment like ours, where the only relevant target is one predefined line, this is not a problem. But in a reading study where the line being read needs to be identified without ambiguity, it is. In some cases, it may be possible to identify the line by using the horizontal dimension, provided adjacent lines have different indentations and lengths. However, given our result that counting lines required fewer fixations than the number of lines, one cannot in general exclude the possibility that adjacent lines can be read during the same fixation.
6.2 Human Behavior
Our experiment was not designed to study human behavior. However, the participants did behave in diverse and interesting ways. Two contrasting forms of behavior were easily discerned: avoiding unnecessary work and repeating work unnecessarily.
The results concerning avoiding unnecessary work are shown in Figure
14. The instructions given to the participants were to initially read the texts, and then to read the question and figure out the answer. But they soon learned that they do not really need to read the whole text to answer the questions. So they started to skip the text and go straight to the question. As shown in Figure
14, initially nearly all participants read or at least skimmed the text. From the second text, half already skipped it. After the fourth text, only one participant continued to read the text each time.
This result resonates with the literature on code comprehension. Although reading and understanding code takes up the majority of developers’ time [
29,
57], there have been many observations implying that they may not really “read” the code in the conventional sense of the word. Instead, they employ the “as-needed” comprehension strategy of Littman et al. [
27], use the opportunistic approach of Letovsky [
25], or perform fact finding activities as suggested by LaToza et al. [
24]. Rather than trying to understand the code, they may look for shortcuts that enable them to complete the task without investing the effort required to achieve actual understanding [
26,
44].
The results concerning unnecessary work were described previously in Sections
5.3 and
5.4. The term
unnecessary is of course subjective. We use it because of the extreme simplicity of our tasks: they require only to remember the line number and one to three letters. So we expected that participants would exhibit an efficient viewing pattern, where they read the question, find the target line, and focus on it. Instead, some of them returned to the question multiple times or counted the lines multiple times. Presumably this was done to ascertain the line number, and to ascertain the letters that need to be looked for. This behavior is somewhat surprising given the simplicity of the tasks, as noted earlier. Nevertheless, some of our participants needed to reassure themselves during the execution of the tasks that they were doing the right thing.
6.3 Reading Order
The phenomena discussed previously have an effect on reading order. In the context of reading code, Busjahn et al. [
8] suggested two possible reading orders: “story order” (top-to-bottom, left-to-right) and “execution order” (tracing the execution of the program, including function calls and loops). Importantly, these possible orders were studied in the context of code comprehension tasks, where participants need to summarize the code or determine the value of a variable.
When performing code comprehension research, we need to be punctilious regarding the distinction between comprehension in the sense of understanding
what the code does (its functionality), and comprehension in the sense of understanding
how the code does it—that is, the code’s structure and the inter-relationships between different parts of the code. Reading patterns for these diverse goals may be completely different. Academic research often emphasizes the first, such as in studies of code summarization [
1,
42]. But in real-life work, developers are typically more interested in the second, such as when performing code maintenance tasks.
As we see it, both “story order” and “execution order” are just simple and easy-to-define special cases of what we may call
task order: the reading order needed to perform a given task. If the task is syntactic, such as to find all the instances of a certain variable name (what Binkley et al. [
4] call a “where’s Waldo” problem), “story order” may be used. If the task is to trace the execution of the code and record the values assigned to a certain variable, the code will be read in “execution order,” including re-reading loop bodies multiple times in the way they would be executed.
But it is questionable whether either “story order” or “execution order” reflect developer behavior when performing other tasks. When we read a story for fun, we most probably indeed read it in “story order”—from beginning to end, passing through all the words on the way. When we read a news story we might do the same, or we might skim some paragraphs that seem less interesting [
15]. But we rarely just read a piece of code, let alone a whole module or system. Usually we have some specific goal in mind, like fixing a bug or adding a feature. This requires a different approach: we first need to understand the structure of the code and then use this understanding to find the relevant location to perform the task [
53]. Consequently, the reading order need not be directly related to the way the code is written or executed and, in particular, need not be linear.
Trying to characterize “task order” in general may be futile, as different tasks may require quite different reading patterns. However, many tasks probably include the following three components (foreshadowed by other works [
46,
53]):
•
Orient yourself to get a feel for the structure of the code. This may involve scanning the visible code.
•
Search for the specific part of the code you need to work on to complete the task. This may include skipping complete blocks of the code based on using peripheral vision to identify beacons.
•
Focus on the relevant code to complete the task. This may combine repeated reading as one learns the details of the code, and then modifying it by performing editing operations, which leads to different reading patterns [
46].
What we saw in our experiment is apparently this sort of orient-search-focus task order, for the task of counting letters in a line of text. At a minimum, the participants in the experiment needed to scan the page to identify and read the question, search for and find the target line, and then focus on the target line to count the desired letters. The same conceptualization can be applied when interpreting observed patterns of reading code.
But our experiments also showed that this “task order” may be tainted by personal conditions and attitudes. For example, some people may be confident in their actions, whereas others are unsure about how to approach the task. People may also require constant validation during their work. This may explain the actions of our experimental participants who regressed to the question to re-check the letters, or re-counted the lines leading to the target line.
Moreover, in some cases, the task may be too hard or ill defined, and participants may be unable to cope. This may lead to unorderly reading with many skips and reversals. Such reading patterns have indeed been observed in the past, where they were called
thrashing or
fumbling [
10,
20,
46].
Due to such differences in approach and attitude, the specific reading patterns employed by different developers may be quite different. However, it may still be fruitful to look for commonalities between these reading patterns. This is because the commonalities can be expected to reflect the core activities that need to be performed to complete the task—namely, those that constitute the actual “task order.”
At the same time, we should stay mindful of the fact that variations are expected to exist. And indeed, one of the repeated results in the code eye tracking literature is the diversity of scan paths of different experimental participants performing the same task (e.g., [
13,
20,
53]). These variations are also interesting, as they reflect the differences between participants. Studying the differences can help uncover the effects of knowledge, experience, and attitude. This includes both minor variations, such as random small regressions to ascertain the last word read, and major variations, such as either initially scanning the whole text or else not doing so.
6.4 Navigation and Peripheral Vision
In our experiment, once the question was read, the participants needed to find the relevant line. The results show a difference between counting and using visual cues. It appears that visual cues can be noted using peripheral vision, and there is no need to fixate on each line to check them.
The implication for code is that the importance of indentation and color-coding keywords is not as a direct aid for comprehension, but that they provide beacons for navigation. Crosby and Stelovsky [
13] frown upon the practice of printing keywords in boldface, saying that “keywords are the least observed portions of a program’s text.” But this misses the point that clear visual identification of the keywords allows them to be identified at a glance, and helps developers to easily focus on the code
between these keywords. Bauer et al. [
2] claim that indentation has no effect on gaze pattern. However, they consider only aggregate metrics such as fixation duration, fixation rate, and saccade amplitude, and did not analyze gaze paths. We need to design experiments that involve navigation, and specifically navigation that can be aided by the code’s block structure and indentation, to really see whether indentation has an effect. Results by Talsma et al. [
52] indicate that, at least for beginning students, highlighting the block structure of code helps them focus their attention and leads to more linear reading.
6.5 Additional Uses of the Experiment
Although the experiment was designed to measure the accuracy of identifying the line being read, it turns out that it can actually have wider uses as a component in other studies.
One use is for validation. It is quite common to assume that the output of an eye tracker and the fixation locations computed from it are valid. But such data may suffer from inaccuracies and systematic bias. As a result, counting fixations in a predefined area of interest may include fixations that actually represent looking at something else, and vice versa (called
gaze uncertainty by Wang et al. [
56]). Embedding a simple experiment like ours as a step in a larger experiment can provide independent testimony that the data is indeed valid (or not).
Another use is for recalibration. Given that our experiment requires the participants to focus on one specific line, performing this task in the context of a larger experiment can be used to check the calibration and measure the bias between the eye tracker output and the actual target. This can be used to compensate for drift during an experiment without having to resort to a disruptive full recalibration—essentially a realization of the “required fixation location” of Hornof and Halverson [
18].
Finally, it should be noted that the accuracy of line identification can depend not only on the physical attributes of the stimuli (font size and line spacing) but also on the experimental participant. For some participants, certain sizes can be adequate, whereas for others, a larger size may be needed. Thus, focus experiments like ours can also be used as a criterion for excluding certain participants. By requiring the ratio of fixations that are assigned to the target line correctly to be above a certain threshold, the criterion becomes both quantifiable and directly relevant to reducing threats to the validity of the actual study.