This worksheet will go through some point pattern analysis of bovine TB data in Cornwall.
The locations are supplied in a text file.
"x" "y" "year" "spoligotype" 178276 32534 1998 12 176862 33521 1997 12 173076 31586 1992 12 172052 32638 1989 12 173320 29896 2002 12 ...
Each row is a notification of Bovine TB, with year and a genetic type marker. Note the coordinates have been anonymised by random jittering. We read the file into a data frame:
btb = read.table("./Data/BTB_spoligotype_data_jittered.txt", head = TRUE) summary(btb)
## x y year spoligotype ## Min. :135537 Min. : 20732 Min. :1989 Min. : 9.0 ## 1st Qu.:173239 1st Qu.: 42034 1st Qu.:1996 1st Qu.: 9.0 ## Median :215143 Median : 69124 Median :1999 Median : 9.0 ## Mean :203500 Mean : 68476 Mean :1998 Mean :11.9 ## 3rd Qu.:225218 3rd Qu.: 87888 3rd Qu.:2000 3rd Qu.:15.0 ## Max. :242442 Max. :116537 Max. :2002 Max. :43.0
The spoligotype is a numeric variable, but really it's a category, so we'll convert it to a factor and check some basics:
btb$type = factor(paste0("Sp", btb$spoligotype)) table(btb$type)
## ## Sp10 Sp11 Sp12 Sp15 Sp17 Sp20 Sp21 Sp43 Sp9 ## 24 11 109 166 7 104 1 3 494
table(btb$year)
## ## 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 ## 30 27 41 42 29 15 17 61 64 132 130 147 67 117
There are 9 spoligotypes so we create a 9-colour palette. We then use the type variable for the colour - this will get mapped to 0-9 to select the colour.
Since this isn't a spatial data object we plot the x and y coordinates as a scatterplot. Setting the aspect ratio to 1 avoids any distortion due to R stretching the plot to fit a square.
# base graphics require(RColorBrewer) palette(brewer.pal(9, "Spectral")) plot(btb$x, btb$y, col = btb$type, pch = 19, cex = 0.5, asp = 1)
The ggplot2 package looks at plots another way, building them up
as a sum of elements. Here we start by defining the data and where the x, y,
and colour come from. Then we add a geom_point
to show we want
a scatterplot and a coord_equal()
to set the aspect ratio.
# ggplot2 require(ggplot2) ggplot(btb, aes(x = x, y = y, col = type)) + geom_point() + coord_equal()
ggplot2 also lets us do faceting to map each type separately on a common scale:
ggplot(btb, aes(x = x, y = y, col = type)) + geom_point() + facet_wrap(~type) + coord_equal()
By converting the data frame into a SpatialPointsDataFrame we can use the sp graphics functions.
coordinates(btb) = ~x + y proj4string(btb) = CRS("+init=epsg:27700") sp.theme(set = TRUE, regions = list(col = brewer.pal(9, "Set1"))) spplot(btb, "type")
To do any point pattern analysis we need to define the region in which we have observed the points, known as the `window'. We have a file that roughly defines Cornwall:
cornwall = read.table("./Data/btbpoly.txt") plot(cornwall, type = "l")
A quick inspection of the spoligotype plots shows obvious segregation of the different types. There's no need for complicated statistics to show that (a scientist once said that if you needed statistics to analyse your experiment then you just hadn't collected enough data). Instead we will compare the cases before the year 2000 to those since then.
First create a new factor variable in the data frame for the decade:
btb$decade = factor(ifelse(btb$year < 2000, "D19", "D20"))
Now try and create this plot with ggplot2 graphics:
The spatstat package uses its own special objects for storing point-pattern data. These include the points, the window, and any attributes of each point, called `marks'.
We create a ppp
object from the columns of the data frame we just read in. Plotting one
of these shows it with the window.
require(spatstat) btbp = ppp(x = btb$x, y = btb$y, window = owin(poly = cornwall), marks = btb$decade) plot(btbp)
## D19 D20 ## 1 2
To compare the point pattern for each decade we will smooth the point pattern onto a grid using kernel smoothing. First we compute a bandwidth and then the density is estimated using that as a smoothing constant.
We will do this separately for the two decades, and then compute the ratio
of the post-2000 cases to the total number. Spatstat requires the use
of the eval.im
function to compute operations on its output
grids (unlike the raster
package which lets you use the normal
arithmetic operators).
bw19 = bw.diggle(btbp[btbp$marks == "D19", ]) bw19
## sigma ## 1079
i19 = density(btbp[btbp$marks == "D19", ], sigma = bw19) plot(i19, main = "Pre-2000")
bw20 = bw.diggle(btbp[btbp$marks == "D20", ]) bw20
## sigma ## 1290
i20 = density(btbp[btbp$marks == "D20", ], sigma = bw20) plot(i20, main = "Post-2000")
plot(eval.im(i20/(i19 + i20)), main = "Raw risk")
This output looks to be oversmoothed, with some very flat areas of zero and one. The optimum smoothing for a ratio is different to the individual point patterns. Spatstat has a function for this, so we compute a bandwidth for the ratio and plot that. Scale the output to be relative to the mean post-2000 case rate.
rrbw = bw.relrisk(btbp)
rrbw
## sigma ## 8721
rr = relrisk(btbp, sigma = rrbw) baseRisk = sum(btbp$marks == "D20")/length(btbp$marks) plot(eval.im(rr/baseRisk), col = rev(brewer.pal(10, "RdYlBu")), main = "Decade ratio")
This exploratory analysis shows that there seems to be an area in the centre of the county which has experienced increased BTB cases since 2000, and fewer cases in the far west.
The K-function for a point pattern is a function of distance, and is the expected number of points to be found within that distance of an arbitrary point of the pattern.
We can test if our BTB locations are completely random within Cornwall by a simulation test.
The envelope
function computes 99 simulations of points in the window and
compares the data with the range of the K-function from the simulations. This enables us
to test the hypothesis that the points are completely spatially random.
eK = envelope(btbp, Kest)
## Generating 99 simulations of CSR ... ## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ## 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, ## 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, ## 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, ## 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, ## 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, ## 91, 92, 93, 94, 95, 96, 97, 98, 99. ## ## Done.
plot(eK, . - pi * r^2 ~ r)
## lty col key label ## obs 1 1 obs K[obs](r) - pi * r^2 ## theo 2 2 theo K[theo](r) - pi * r^2 ## hi 1 8 hi K[hi](r) - pi * r^2 ## lo 1 8 lo K[lo](r) - pi * r^2 ## meaning ## obs observed value of K(r) for data pattern ## theo theoretical value of K(r) for CSR ## hi upper pointwise envelope of K(r) from simulations ## lo lower pointwise envelope of K(r) from simulations
We see the line goes well outside the simulation envelope
The BTB data are clearly more clustered than completely random.
For point patterns of two types, we can compute the cross-K function - the expected number of points of type 1 to be found within a distance of an arbitrary type 2 point, and vice-versa.
These statistics can help tell us if the two sets of points are a randomly labelled division of the union of the points into two sets.
The Kcross function for the two decades is computed and plotted like this. The function has a number of possible edge-correction algorithms and we use the best one here:
kC = Kcross(btbp, correction = "best") plot(kC, . - pi * r^2 ~ r)
## lty col key label ## iso 1 1 iso hat(K[list(D19, D20)]^{ iso})(r) - pi * r^2 ## theo 2 2 theo { K[list(D19, D20)]^{ pois }}(r) - pi * r^2 ## meaning ## iso Ripley isotropic correction estimate of Kcross["D19", "D20"](r) ## theo theoretical Poisson Kcross["D19", "D20"](r)
The result is plotted minus pi-r-squared to linearise it slightly, but it is clearly non-zero. However to test our hypothesis we must compare this Kcross with the result of 99 random labellings of our points into two classes.
The envelope
function does just that, and produces something we can plot to
show the data and the envelopes from the random labelling simulations. But subtracting the average
simulation value we can flatten the plot a bit more to make inspection a bit easier.
e = envelope(btbp, Kcross, i = "D20", j = "D19", simulate = expression(rlabel(btbp)), verbose = FALSE) plot(e, . - mmean ~ r)
## lty col key ## obs 1 #D53E4F obs ## mmean 2 #F46D43 mmean ## hi 1 #BEBEBE hi ## lo 1 #BEBEBE lo ## label ## obs K[list(D20, D19)][obs](r) - bar(K[list(D20, D19)])(r) ## mmean bar(K[list(D20, D19)])(r) - bar(K[list(D20, D19)])(r) ## hi K[list(D20, D19)][hi](r) - bar(K[list(D20, D19)])(r) ## lo K[list(D20, D19)][lo](r) - bar(K[list(D20, D19)])(r) ## meaning ## obs observed value of Kcross["D20", "D19"](r) for data pattern ## mmean sample mean of Kcross["D20", "D19"](r) from simulations ## hi upper pointwise envelope of Kcross["D20", "D19"](r) from simulations ## lo lower pointwise envelope of Kcross["D20", "D19"](r) from simulations
The red line for the data stays well within the simulation envelope
We cannot reject the hypothesis that the pre-2000 and post-2000 cases are not a random labelling of the case locations. It is possible then that the variation shown in the kernel smoothing plots is not statistically significant.
The GBIF database holds thousands of global species records. The dismo
package
has a function for downloading species records based on names, areas etc.
A set of badger records for Cornwall was downloaded and converted to a point shapefile.
Read the badger data in using readOGR
and transform to OS Grid.
Then create a spatstat ppp
object with Cornwall as a window.
require(rgdal) badgers = readOGR("./Data/", "badgers")
## OGR data source with driver: ESRI Shapefile ## Source: "./Data/", layer: "badgers" ## with 160 features and 3 fields ## Feature type: wkbPoint with 2 dimensions
badgers = spTransform(badgers, CRS("+init=epsg:27700")) badgerp = as.ppp(as.ppp(X = coordinates(badgers), W = owin(poly = cornwall)))
## Warning: 45 points were rejected as lying outside the specified window
## Warning: data contain duplicated points
plot(badgerp)
Some points are outside the region and are removed when doing as.ppp
twice.
Now we want to see if the badgers and the BTB cases are a random labelling or not. First
we create a new ppp
object that is the superimposition of the badgers and the BTB cases,
and then compute the cross-K function
bbt = superimpose(TB = unmark(btbp), BADGER = badgerp)
## Warning: data contain duplicated points
e = envelope(bbt, Kcross, i = "TB", j = "BADGER", simulate = expression(rlabel(bbt)), verbose = FALSE) plot(e, . - mmean ~ r)
## lty col key ## obs 1 #D53E4F obs ## mmean 2 #F46D43 mmean ## hi 1 #BEBEBE hi ## lo 1 #BEBEBE lo ## label ## obs K[list(TB, BADGER)][obs](r) - bar(K[list(TB, BADGER)])(r) ## mmean bar(K[list(TB, BADGER)])(r) - bar(K[list(TB, BADGER)])(r) ## hi K[list(TB, BADGER)][hi](r) - bar(K[list(TB, BADGER)])(r) ## lo K[list(TB, BADGER)][lo](r) - bar(K[list(TB, BADGER)])(r) ## meaning ## obs observed value of Kcross["TB", "BADGER"](r) for data pattern ## mmean sample mean of Kcross["TB", "BADGER"](r) from simulations ## hi upper pointwise envelope of Kcross["TB", "BADGER"](r) from simulations ## lo lower pointwise envelope of Kcross["TB", "BADGER"](r) from simulations
Here we see the red line goes way outside the simulation envelope
This is actually saying that there are fewer BTB cases than expected in the vicinity of a badger report. BUT since the badger data is very rough and not really a great survey of badger locations I wouldn't draw any scientific conclusions from this!
Points are often clustered in space because they occur where people live, or where the habitat is amenable. Points often cluster in time because for a period the conditions are favourable for the events to occur. Space-time clustering is when points cluster simultaneously in space and time - in otherwords there is an anomalous rate at some location at some period compared to the average spatial rate and temporal rate
We can test the null hypothesis of no spatial clustering with a function from the splancs
package.
Since this was written a long time ago it requires fairly primitive arguments - all data needs to
be converted into matrices - but essentially it just takes location and time of events, plus a spatial
and temporal boundary. A range of distances and times for computing the clustering statistic is also given.
To test the hypothesis, we run a Monte-Carlo test. By permuting the locations and times of the events we can break any space-time clustering without affecting the spatial or temporal clustering. Then we compute a clustering statistic based on K-functions, and compare the data with the simulations by histogram.
require(splancs) cornwallp = as.matrix(rbind(cornwall, cornwall[1, ])) stm = stmctest(as.matrix(coords(btbp)), btb$year, cornwallp, range(btb$year), seq(1000, 31000, len = 30), 1:4, nsim = 99, quiet = TRUE)
hist(stm$t, xlab = "Statistic", main = "Space-time cluster test") abline(v = stm$t0) y.8 <- par()$usr[4] * 0.8 text(stm$t0, y.8, "Data Statistic")
The data statistic ranks 95
amongst the simulations.
There is strong evidence for space-time clustering in the BTB data