Point Pattern Analysis

This worksheet will go through some point pattern analysis of bovine TB data in Cornwall.

Read the data

The locations are supplied in a text file.

"x" "y" "year" "spoligotype"
 178276 32534 1998 12
 176862 33521 1997 12
 173076 31586 1992 12
 172052 32638 1989 12
 173320 29896 2002 12

...

Each row is a notification of Bovine TB, with year and a genetic type marker. Note the coordinates have been anonymised by random jittering. We read the file into a data frame:

btb = read.table("./Data/BTB_spoligotype_data_jittered.txt", head = TRUE)
summary(btb)

##        x                y               year       spoligotype  
##  Min.   :135537   Min.   : 20732   Min.   :1989   Min.   : 9.0  
##  1st Qu.:173239   1st Qu.: 42034   1st Qu.:1996   1st Qu.: 9.0  
##  Median :215143   Median : 69124   Median :1999   Median : 9.0  
##  Mean   :203500   Mean   : 68476   Mean   :1998   Mean   :11.9  
##  3rd Qu.:225218   3rd Qu.: 87888   3rd Qu.:2000   3rd Qu.:15.0  
##  Max.   :242442   Max.   :116537   Max.   :2002   Max.   :43.0

The spoligotype is a numeric variable, but really it's a category, so we'll convert it to a factor and check some basics:

btb$type = factor(paste0("Sp", btb$spoligotype))
table(btb$type)

## 
## Sp10 Sp11 Sp12 Sp15 Sp17 Sp20 Sp21 Sp43  Sp9 
##   24   11  109  166    7  104    1    3  494

table(btb$year)

## 
## 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 
##   30   27   41   42   29   15   17   61   64  132  130  147   67  117

Making maps

base graphics

There are 9 spoligotypes so we create a 9-colour palette. We then use the type variable for the colour - this will get mapped to 0-9 to select the colour.

Since this isn't a spatial data object we plot the x and y coordinates as a scatterplot. Setting the aspect ratio to 1 avoids any distortion due to R stretching the plot to fit a square.

# base graphics
require(RColorBrewer)
palette(brewer.pal(9, "Spectral"))
plot(btb$x, btb$y, col = btb$type, pch = 19, cex = 0.5, asp = 1)

ggplot2

The ggplot2 package looks at plots another way, building them up as a sum of elements. Here we start by defining the data and where the x, y, and colour come from. Then we add a geom_point to show we want a scatterplot and a coord_equal() to set the aspect ratio.

# ggplot2
require(ggplot2)
ggplot(btb, aes(x = x, y = y, col = type)) + geom_point() + coord_equal()

ggplot2 also lets us do faceting to map each type separately on a common scale:

ggplot(btb, aes(x = x, y = y, col = type)) + geom_point() + facet_wrap(~type) + 
    coord_equal()

sp graphics

By converting the data frame into a SpatialPointsDataFrame we can use the sp graphics functions.

coordinates(btb) = ~x + y
proj4string(btb) = CRS("+init=epsg:27700")
sp.theme(set = TRUE, regions = list(col = brewer.pal(9, "Set1")))
spplot(btb, "type")

Point pattern analysis

To do any point pattern analysis we need to define the region in which we have observed the points, known as the `window'. We have a file that roughly defines Cornwall:

cornwall = read.table("./Data/btbpoly.txt")
plot(cornwall, type = "l")

A quick inspection of the spoligotype plots shows obvious segregation of the different types. There's no need for complicated statistics to show that (a scientist once said that if you needed statistics to analyse your experiment then you just hadn't collected enough data). Instead we will compare the cases before the year 2000 to those since then.

First create a new factor variable in the data frame for the decade:

btb$decade = factor(ifelse(btb$year < 2000, "D19", "D20"))

Now try and create this plot with ggplot2 graphics:

The spatstat package uses its own special objects for storing point-pattern data. These include the points, the window, and any attributes of each point, called `marks'.

We create a ppp object from the columns of the data frame we just read in. Plotting one of these shows it with the window.

require(spatstat)
btbp = ppp(x = btb$x, y = btb$y, window = owin(poly = cornwall), marks = btb$decade)
plot(btbp)

## D19 D20 
##   1   2

To compare the point pattern for each decade we will smooth the point pattern onto a grid using kernel smoothing. First we compute a bandwidth and then the density is estimated using that as a smoothing constant.

We will do this separately for the two decades, and then compute the ratio of the post-2000 cases to the total number. Spatstat requires the use of the eval.im function to compute operations on its output grids (unlike the raster package which lets you use the normal arithmetic operators).

bw19 = bw.diggle(btbp[btbp$marks == "D19", ])
bw19

## sigma 
##  1079

i19 = density(btbp[btbp$marks == "D19", ], sigma = bw19)
plot(i19, main = "Pre-2000")

bw20 = bw.diggle(btbp[btbp$marks == "D20", ])
bw20

## sigma 
##  1290

i20 = density(btbp[btbp$marks == "D20", ], sigma = bw20)
plot(i20, main = "Post-2000")

plot(eval.im(i20/(i19 + i20)), main = "Raw risk")

This output looks to be oversmoothed, with some very flat areas of zero and one. The optimum smoothing for a ratio is different to the individual point patterns. Spatstat has a function for this, so we compute a bandwidth for the ratio and plot that. Scale the output to be relative to the mean post-2000 case rate.

rrbw = bw.relrisk(btbp)
rrbw

## sigma 
##  8721

rr = relrisk(btbp, sigma = rrbw)
baseRisk = sum(btbp$marks == "D20")/length(btbp$marks)
plot(eval.im(rr/baseRisk), col = rev(brewer.pal(10, "RdYlBu")), main = "Decade ratio")

Conclusion

This exploratory analysis shows that there seems to be an area in the centre of the county which has experienced increased BTB cases since 2000, and fewer cases in the far west.

K-Function pattern analysis

The K-function for a point pattern is a function of distance, and is the expected number of points to be found within that distance of an arbitrary point of the pattern.

We can test if our BTB locations are completely random within Cornwall by a simulation test. The envelope function computes 99 simulations of points in the window and compares the data with the range of the K-function from the simulations. This enables us to test the hypothesis that the points are completely spatially random.

eK = envelope(btbp, Kest)

## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
## 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
## 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
## 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
## 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75,
## 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
## 91, 92, 93, 94, 95, 96, 97, 98, 99.
## 
## Done.

plot(eK, . - pi * r^2 ~ r)

##      lty col  key                 label
## obs    1   1  obs  K[obs](r) - pi * r^2
## theo   2   2 theo K[theo](r) - pi * r^2
## hi     1   8   hi   K[hi](r) - pi * r^2
## lo     1   8   lo   K[lo](r) - pi * r^2
##                                                meaning
## obs            observed value of K(r) for data pattern
## theo                 theoretical value of K(r) for CSR
## hi   upper pointwise envelope of K(r) from simulations
## lo   lower pointwise envelope of K(r) from simulations

We see the line goes well outside the simulation envelope

Conclusion

The BTB data are clearly more clustered than completely random.

Bivariate Analysis

For point patterns of two types, we can compute the cross-K function - the expected number of points of type 1 to be found within a distance of an arbitrary type 2 point, and vice-versa.

These statistics can help tell us if the two sets of points are a randomly labelled division of the union of the points into two sets.

The Kcross function for the two decades is computed and plotted like this. The function has a number of possible edge-correction algorithms and we use the best one here:

kC = Kcross(btbp, correction = "best")
plot(kC, . - pi * r^2 ~ r)

##      lty col  key                                                    label
## iso    1   1  iso           hat(K[list(D19, D20)]^{    iso})(r) - pi * r^2
## theo   2   2 theo {    K[list(D19, D20)]^{        pois    }}(r) - pi * r^2
##                                                              meaning
## iso  Ripley isotropic correction estimate of Kcross["D19", "D20"](r)
## theo                     theoretical Poisson Kcross["D19", "D20"](r)

The result is plotted minus pi-r-squared to linearise it slightly, but it is clearly non-zero. However to test our hypothesis we must compare this Kcross with the result of 99 random labellings of our points into two classes.

The envelope function does just that, and produces something we can plot to show the data and the envelopes from the random labelling simulations. But subtracting the average simulation value we can flatten the plot a bit more to make inspection a bit easier.

e = envelope(btbp, Kcross, i = "D20", j = "D19", simulate = expression(rlabel(btbp)), 
    verbose = FALSE)
plot(e, . - mmean ~ r)

##       lty     col   key
## obs     1 #D53E4F   obs
## mmean   2 #F46D43 mmean
## hi      1 #BEBEBE    hi
## lo      1 #BEBEBE    lo
##                                                       label
## obs   K[list(D20, D19)][obs](r) - bar(K[list(D20, D19)])(r)
## mmean bar(K[list(D20, D19)])(r) - bar(K[list(D20, D19)])(r)
## hi     K[list(D20, D19)][hi](r) - bar(K[list(D20, D19)])(r)
## lo     K[list(D20, D19)][lo](r) - bar(K[list(D20, D19)])(r)
##                                                                    meaning
## obs             observed value of Kcross["D20", "D19"](r) for data pattern
## mmean              sample mean of Kcross["D20", "D19"](r) from simulations
## hi    upper pointwise envelope of Kcross["D20", "D19"](r) from simulations
## lo    lower pointwise envelope of Kcross["D20", "D19"](r) from simulations

The red line for the data stays well within the simulation envelope

Conclusion

We cannot reject the hypothesis that the pre-2000 and post-2000 cases are not a random labelling of the case locations. It is possible then that the variation shown in the kernel smoothing plots is not statistically significant.

Badgers

Badger Data

The GBIF database holds thousands of global species records. The dismo package has a function for downloading species records based on names, areas etc.

A set of badger records for Cornwall was downloaded and converted to a point shapefile.

Read the badger data in using readOGR and transform to OS Grid. Then create a spatstat ppp object with Cornwall as a window.

require(rgdal)
badgers = readOGR("./Data/", "badgers")

## OGR data source with driver: ESRI Shapefile 
## Source: "./Data/", layer: "badgers"
## with 160 features and 3 fields
## Feature type: wkbPoint with 2 dimensions

badgers = spTransform(badgers, CRS("+init=epsg:27700"))
badgerp = as.ppp(as.ppp(X = coordinates(badgers), W = owin(poly = cornwall)))

## Warning: 45 points were rejected as lying outside the specified window

## Warning: data contain duplicated points

plot(badgerp)

Some points are outside the region and are removed when doing as.ppp twice.

Now we want to see if the badgers and the BTB cases are a random labelling or not. First we create a new ppp object that is the superimposition of the badgers and the BTB cases, and then compute the cross-K function

bbt = superimpose(TB = unmark(btbp), BADGER = badgerp)

## Warning: data contain duplicated points

e = envelope(bbt, Kcross, i = "TB", j = "BADGER", simulate = expression(rlabel(bbt)), 
    verbose = FALSE)
plot(e, . - mmean ~ r)

##       lty     col   key
## obs     1 #D53E4F   obs
## mmean   2 #F46D43 mmean
## hi      1 #BEBEBE    hi
## lo      1 #BEBEBE    lo
##                                                           label
## obs   K[list(TB, BADGER)][obs](r) - bar(K[list(TB, BADGER)])(r)
## mmean bar(K[list(TB, BADGER)])(r) - bar(K[list(TB, BADGER)])(r)
## hi     K[list(TB, BADGER)][hi](r) - bar(K[list(TB, BADGER)])(r)
## lo     K[list(TB, BADGER)][lo](r) - bar(K[list(TB, BADGER)])(r)
##                                                                      meaning
## obs             observed value of Kcross["TB", "BADGER"](r) for data pattern
## mmean              sample mean of Kcross["TB", "BADGER"](r) from simulations
## hi    upper pointwise envelope of Kcross["TB", "BADGER"](r) from simulations
## lo    lower pointwise envelope of Kcross["TB", "BADGER"](r) from simulations

Here we see the red line goes way outside the simulation envelope

Conclusion

This is actually saying that there are fewer BTB cases than expected in the vicinity of a badger report. BUT since the badger data is very rough and not really a great survey of badger locations I wouldn't draw any scientific conclusions from this!

Space-Time Clustering Test

Points are often clustered in space because they occur where people live, or where the habitat is amenable. Points often cluster in time because for a period the conditions are favourable for the events to occur. Space-time clustering is when points cluster simultaneously in space and time - in otherwords there is an anomalous rate at some location at some period compared to the average spatial rate and temporal rate

We can test the null hypothesis of no spatial clustering with a function from the splancs package. Since this was written a long time ago it requires fairly primitive arguments - all data needs to be converted into matrices - but essentially it just takes location and time of events, plus a spatial and temporal boundary. A range of distances and times for computing the clustering statistic is also given.

To test the hypothesis, we run a Monte-Carlo test. By permuting the locations and times of the events we can break any space-time clustering without affecting the spatial or temporal clustering. Then we compute a clustering statistic based on K-functions, and compare the data with the simulations by histogram.

require(splancs)
cornwallp = as.matrix(rbind(cornwall, cornwall[1, ]))
stm = stmctest(as.matrix(coords(btbp)), btb$year, cornwallp, range(btb$year), 
    seq(1000, 31000, len = 30), 1:4, nsim = 99, quiet = TRUE)

hist(stm$t, xlab = "Statistic", main = "Space-time cluster test")
abline(v = stm$t0)
y.8 <- par()$usr[4] * 0.8
text(stm$t0, y.8, "Data Statistic")

The data statistic ranks 95 amongst the simulations.

Conclusion

There is strong evidence for space-time clustering in the BTB data