=======================================================================================
When done right, graphs can be appealing, informative, and of considerable value to an academic article. Unfortunately, researchers generally suck at making good graphs. I surmise that this is because researchers do not completely master their graphing software, and they are either too lazy or too busy to change this state of affairs. Consequently, the graphs that researchers produce are often no more than a distortion of the ideal Platonian graph that the researcher had in mind.
This compendium facilitates the creation of good graphs by presenting a set of concrete examples, ranging from the trivial to the advanced. The graphs can all be reproduced and adjusted by copy-pasting code into the R console.
Almost every example in this compendium is driven by the same philosophy: A good graph is a simple graph, in the Einsteinian sense that a graph should be made as simple as possible, but not simpler.
I will close with a request and a piece of advice. The request: if you create a clean graph in R that you believe is a candidate for inclusion in this compendium, please do not hesitate to contact me at EJ.Wagenmakers@gmail.com. Your contribution will be acknowledged explicitly, alongside the code you provided. The advice: when you create a clean graph in R, put it on Flickr (public license) before you sign away your copyright to a publisher. For an example, see Figure 1 from this paper.
This work has profited greatly from interactions with my colleagues, many of whom have contributed graphs of their own. I am also endebted to Quentin Gronau, who has added new graphs and beautified existing ones.
Producing clean graphs can be a challenging task. First you have to consider what is the best way in which to convey the information: a line graph, a histogram, a multi-panel plot; such conceptual dilemma’s are not dealt with in this compendium, and instead we recommend the reader to the chapters on creating graphs in the excellent book by Briscoe (1996). Second, you have to use computer software to translate the conceptual graph to a publication-ready figure. This is the phase where this compendium may be useful, because it brings together R code for producing a set of clean, publication-ready figures. Hopefully this will make it easy to copy-paste and adjust the code to suit your own needs.
In my experience, many graphs can be dramatically improved by adhering to the following guidlines: (1) invest sufficient time and effort in the process; (2) omit needless graphical elements, that is, make every element count; (3) judge the relative impact of the graphical elements and ensure that they are nicely balanced; (4) use large font sizes for all text; (5) deviate from the R default settings – with a little effort, you can do a lot better.
This compendium does not discuss figure headings. However, I will say that it is clearly desirable to have the main message of a figure be understood without being forced to read the main text. If possible, start your figure heading by stating what the figure is meant to demonstrate (i.e., its interpretation). For example, do not state “Popularity as a function of president height”; instead, state “Taller presidents are more popular”.
Finally, a note on color. Many graphs look better in color, but there are two complications. First, some academic journals do not publish manuscripts in color, at least not without charging a hefty price. Second, many readers and reviewers do not have a color printer. Below, some graphs have color, whereas others only use grey-scales. Of course this is one of the easiest things to adjust.
Based on this compendium, learning to create good graphs in R will be 80% copy-paste and 20% tinkering. Let’s go plot ourselves some graphs!
Whenever a researcher reports a correlation, it is imperative to plot the data. [Anscombe’s quartet](http://en.wikipedia.org/wiki/Anscombe's_quartet) (plotted below) is a famous demonstration of this fact.
This plot shows the relation between the height ratio of US presidents and the percentage of the popular vote. Note the large circles for the data, the thick line for the linear relation, and the large font size for the axis labels. Also, note that the line does not touch the y-axis (a subtlety that requires deviating from the default).
Show R-Code
Histograms are relatively straightforward to create and to interpret. In fact, some people may even find them boring. Luckily, it is easy to increase the reader’s interest level by adding information to the plot. Below we illustrate various ways by which this may be accomplished.
When in doubt, add tick marks that showcase the individual data points. This is particularly useful when the number of data points is small. The code below is courtesy of Helen Steingroever. Note that the rug tick marks are jittered.
Show R-Code
In R, it is easy to include a nonparametric density estimator. This requires that freq = F
in the histogram comment. Courtesy of Helen Steingroever.
This example shows how to display the bar heights, using the function l_ply
. Courtesy of Helen Steingroever and Quentin Gronau.
The line plot is one of the most standard plots. Nevertheless, many researchers fail to realize that line plots deserve love and attention too.
This graph plots error bars with a user-defined function. More to the point, the lines are thick, and they do not overlap with the symbols (type = "c"
). Note that the legend is not needed; the legend text could simply have been positioned near the associated grapphical elements.
Similar to the above, this plot shows the distribuion of the data with a user-defined boxplot function.
Show R-Code
By now this plot should look familiar. The distribution of the data is now indicated with a violin plot instead of a box plot. Courtesy of Henrik Singmann, who tweaked the results from the vioplot
package. Warning: this a a lot of code.
In many psychological experiments, there are two dependent variables for each participant: mean response time (RT) and mean proportion of errors. This plot shows them both – RTs are on the left y-axis, and errors are on the right y-axis.
Show R-Code
Like their histogram cousin, bar plots are intrinsically boring.
The title says it all. Note that the error bars are added with the l_ply
function. Courtesy of Helen Steingroever and Quentin Gronau.
Densities are ubiquitous, particularly for those who have a Bayesian inclination. As for the histogram and the bar plot, it is generally a good idea to add more information to the bare-bones plot.
This is a relatively standard plot. Note the thickness of the lines and the font size for the axis labels.
Show R-Code
This plot adds a histogram to the density plot, but without needlessly displaying the vertical histogram lines as well. In addition, the code defines the extent to which the lines are transparent, so that both the density and the histogram remain visible, and one does not completely block the other from view.
Show R-Code
This plot adds text to the plot. Although this is generally trivial, this particular example contains a mathematical symbol that is tricky to display properly (unless, of course, you know how it works).
Show R-Code
This is another example, featuring a nice Greek letter. Seriously, what is important here is that the labels are positioned next to the associated graphical element. This approach is more direct than creating a legend, when the reader has to decode the legend first, keep the symbols in working memory, and then turn attention to the graph itself. Bottom line: only use legends when you have to. Even then, you may find that the legend box almost never fulfills a useful function, and can safely be omitted.
Show R-Code
It is cool to be able to highlight specific parts of a density by some color coding scheme. In this example, Ravi Selker shows how that can be done (hint: it’s the polygon
function).
Mijke Rhemtulla also likes to highlight specific parts of a density. This is the first plot in a series, taken from one of Mijke’s stats courses.
Show R-Code
Part 2…
Show R-Code
Part 3…
Show R-Code
Part 4… The take-home message from the last set of plots: use polygon
, annotate the plot, and use large font sizes and thick graphical elements.
Michael Lee attended me to a “stacked densities plot” [http://nxn.se/post/97650612370/high-contrast-stacked-distribution-plots]. Quentin Gronau did the work and shows how multiple densities can be displayed at the same time, while still being discriminable. Note the use of the trans3d
function.
It can be very informative to plot a function. This is relatively straightforward once you stick to the basic principles (thick lines, annotate the plot, large font sizes).
What did we say? Thick lines, annotate the plot, large font sizes!
Show R-Code
What’s not to love about time series? In constrast to some of the previous plots, time series are virtually always interesting, almost mesmerizing. The bar plot compares to a time series as, well, a refridgerator compares to Marilin Monroe. The reason, of course, is that time series are highly informative: they usually contain many observations; moreover, they show how particular variables change over time (it is a time series, after all). Enough of the talking – let’s turn to some examples.
Instead of giving a lecture about diffusion processes, I’ll point out that the lines are transparent. We’ve encountered this before but it was Guy Hawkins who showed me how to do this in R.
Show R-Code
Helen Steingroever returns to us once again, this time with a choice profile for the Iowa gambling task. The plot conveys a lot of information: for one participant, the plot indicates the sequence of 100 choices among four choice alternatives, and whether or not each choice resulted in a win or a loss.
Show R-Code
This plot shows the development of the Bayes factor (y-axis) as the data accumulate (x-axis). This procedure may give frequentists a heart attack but, in Bayes world, that’s just how we roll. What I like about the graph are the annotations on the right side of the plot, and the subtle horizontal lines that indicate Jeffreys’ criteria on the evidence. It took some time to figure out how to display the word “Evidence” in its current direction. To make this plot I “borrowed” code from Ruud Wetzels and Benjamin Scheibehenne.
Show R-Code
And again the Bayesians flaunt their disdain for the sillyness of sampling plans. The plot below shows the development of the Bayes factor (y-axis) with the number of digits from \(\pi\). As the digits accumulate, so does the evidence in favor of the null hypothesis (yes frequentists, you read that right – evidence in favor of the null hypothesis).
The plot shows the maximum evidence (in red), the actual evidence (for two different priors), and the area that we can expect the Bayes factor be in \(95\%\) of the cases, should the null hypothesis hold. This is dirty frequentist reasoning of course, but the plot does show how it is possible to reject a null hypothesis even when the data provide a lot of support in its favor (i.e., the Jeffreys-Lindley paradox). Courtesy of Quentin Gronau.
Show R-Code
To suitably impress the readership, any academic needs to be able to create a multi-panel graph. Below is a set of examples. When creating a multi-panel plot, the main challenge is to select the right number of panels (yes, you can have too many) so that the text and the symbols remain readible.
This is one of my favorite plots, highlighting the difference between discrete probability mass and continous probability density. Credit goes to Michael Lee for conceptualizing the graph (it is presented in box 3.2 of our book) and to Quentin Gronau for the executing in R. Note the use of ablineclip for lines of distinct length and uniroot for finding the x-value that corresponds to five times the density of another x-value.
Show R-Code
The only way to understand the title (and the plot) is to visit the Wikipedia entry on [Buffon’s needle](http://en.wikipedia.org/wiki/Buffon's_needle). Anyway, this is another two-panel plot, showing two posterior distributions for estimating \(\pi\) using an experiment that involved tossing a needle (ad nauseam).
Show R-Code
Sometimes a graph is worth a thousand words. [Anscombe’s quartet](http://en.wikipedia.org/wiki/Anscombe's_quartet) famously drives home the idea that you should always plot your data. This code is based on the Anscombe plot in R. I personally don’t like lappy and similar complications – it may do the trick but when you have to describe the code as “magic” this signals a communication problem. Anyway, the point of the example is graphical display of course. As always, note the thick lines, the large symbols, and the large font size.
Show R-Code
Each panel of this plot shows something very different: histogram, density, point plot, and function. I like the annotations too. With help from Quentin Gronau.
Show R-Code
Ravi Selker enters the stage and presents a nine-panel plot of posterior predictives. Note the use of textGrob and arrangeGrob. Nice work.
Show R-Code
Several cool plots do not fall neatly into the above categories.
This is a funnel plot, and it is courtesy of Mark Nieuwenstein. The code depends on the meta
and metafor
R packages.
Sacha Epskamp uses his qgraph
package and shows how to display a network with nodes and connections.
Sacha Epskamp shows how to present the many outcomes from a questionnaire in a single graph.
Show R-Code
Briscoe, M. H. (1996). Preparing scientific illustrations. Springer.