Analysing spatial data using R

Brunei R User Group Meetup 🇧🇳

Author
Affiliation

Universiti Brunei Darussalam

Published

March 9, 2024

https://bruneir.github.io/brm-spatial

Preliminaries

Welcome to the first Brunei R User Group meetup!


The RUGS mission is to facilitate the person-to-person exchange of knowledge in small group settings on a global scale. —R Consortium

"R" |> 
  rug("b", _, "unei")

About us

  • A group of UBD-ians and R enthusiasts
  • We want to create a community of R users in Brunei
  • Champion the Open Source cause

More events to come this year. Stay tuned!

Expectations

Outcomes

This is a hands-on, live-coding, lecture-style “workshop”. Expect to learn (or at the very least, see me do!)…

  1. What spatial data is and why it’s important.
  2. What statistical analysis can be done with spatial data.
  3. How to perform spatial analysis using R.

A basic understanding of R is assumed.

For some, maybe it will be a bit fast-paced (sorry in advanced!). All the materials will be available online (see link on the right). Truthfully, I am not expecting you to walk away as an expert in spatial analysis, given the length of this talk. At the very least however I hope you become aware of the various spatial techniques and how they can be achieved in R.

Also… ChatGPT is a great way to accelerate your learning!

I’m very happy to answer questions afterwards! 😀

Getting started with R

I’ll just talk about these points briefly:

  • What’s the difference between R and RStudio?
  • Quick run through RStudio’s features
  • Set up a project
  • R Scripts vs Notebooks (.qmd or .Rmd)
  • Executing commands in R
# Try this out for yourself!

N <- 100
x <- runif(n = N, min = 0, max = 1)
head(x)  # Show the first 6 elements
[1] 0.5817537 0.2280617 0.1916819 0.4470871 0.8420693 0.1764505
mean(x)  # Calculate the mean of this vector
[1] 0.5381948
run_x <- rep(NA, length(x))
for (i in seq_along(x)) {
  run_x[i] <- sum(x[1:i])
}
head(run_x)
[1] 0.5817537 0.8098154 1.0014973 1.4485844 2.2906537 2.4671042

Pipelines

Suppose you have a sequence of functions that you want to apply onto an object. For instance, doing the following in R:

Sample 100 observations from the N(0,1), then apply the exponential function, then apply the sine function, then take the mean

Code
mean(sin(exp(rnorm(100))))
[1] 0.6298813

This is a bit difficult to read! In programming, pipelines allow us to put the output of a previous function into the argument of the first function of the preceding one. The pipe operator is ‘|’ followed by a ‘>’. My font has ligatures, that’s why it’s showing up as those triangles!

Code
rnorm(100) |>  # Random samples
  exp() |>     # exponential function
  sin() |>     # sine function
  mean()       # mean function
[1] 0.540876

For us humans, it’s much more readable so preferred when creating data science notebooks.

List of packages

The power of R comes from its diverse range of user-contributed packages. To install a package in R, we type install.packages("package_name"). As an example, try install the tidyverse package.

install.packages("tidyverse")

A bunch of things will happen on your screen that makes you look like a legit hacker. (It’s normal! Unless… there are some errors in the installation process 😅) Once that’s done, you will want to load the package using library() to start using it. Here’s a list of packages we will be using today. You’ll need to install all of them before we begin. In RStudio, there will be a prompt (yellow line at the top of the source pane) for you to install all these packages with a single click.

Code
library(kernlab)       # for GPR
library(ggrepel)       # nice labels in plots
library(osrm)          # OSM route calculator
library(spdep)         # Moran's test and more
library(CARBayes)      # spatial CAR models
library(mapview)       # interactive plots
library(scales)        # pretty formats
library(tidyverse)
theme_set(theme_bw())  # set the theme

Furthermore, there are packages that are not yet on CRAN, but are available on GitHub. Please install them using the remotes package.

remotes::install_github("propertypricebn/bruneimap")

Of course, don’t forget to load it. For more information, please check out the package’s GitHub page.

Introduction

What we’ll learn
  • Importance of accounting for spatial effects in data.
  • Using the Moran’s I test to detect spatial autocorrelations.

Consider the distribution of house prices in Brunei1. I think it is well accepted that the prices of houses in Brunei are not uniformly distributed across the country, but are determined by location. Figure 1 (a) shows a faithful representation of house prices in Brunei, and we can clearly see a clustering effect. Many economists, realtors, and urban planners will undoubtedly tell you that there is some kind of “neighbouring-effect” at play here. House closer to each other tend to exhibit more similar properties.

Code
load("data/artificial_hsp.RData")

# forest_col <- adjustcolor("forestgreen", alpha = 0.9)
forest_col <- "grey"

plot_hsp <- function(x) {
  left_join(kpg_sf, x, by = join_by(id, kampong, mukim)) |>
    ggplot() +
    geom_sf(aes(fill = price, col = "")) +
    scale_fill_viridis_c(
      labels = scales::dollar, 
      name = "Predicted price", 
      alpha = 0.9, 
      na.value = forest_col
    ) +
    scale_colour_manual(values = "grey30") +              
    guides(colour = guide_legend(
      "No property\nsales", 
      override.aes = list(fill = forest_col, colour = forest_col)
    ))
}

p1 <- plot_hsp(hsp_bn)
p2 <- 
  hsp_bn |> 
  mutate(price = sample(price)) |> 
  plot_hsp()

p1
p2
(a) Original distribution
(b) Random shuffle
Figure 1: Spatial distribution of house prices in Brunei

When we perform statistical modelling and ignore the spatial component, effectively what we are assuming is that the prices of houses are independent of their location. Under this assumption, Figure 1 (b) is a valid representation of house prices in Brunei (uniformity of prices). This is a very dangerous thing to do, especially when the spatial effect are non-ignorable. We may get nonsensical results and inferences out of the resulting model.

Of course, the degree to which this assumption is violated depends on the context and the problem at hand, as well as the scale of the data. For the above-type problem where we have area-level data, one can use the Moran’s I test of global heterogeneity to test for spatial autocorrelation. It tests

\(H_0\): No spatial autocorrelation
\(H_1\): (Positive) Spatial autocorrelation present

To test this in R, we can use the spdep::moran.test() function from the spdep package. There are two key ingredients here, the actual data itself (house prices), and also the neighbourhood relationship between all the areas (spatial structure). The code is as follows.

Code
# Prepare neighbourhood strucuture
hsp_sf <- left_join(kpg_sf, hsp_bn, by = join_by(id, kampong, mukim))
nb <- poly2nb(hsp_sf, queen = FALSE, snap = 1)
Wlist <- nb2listw(nb, zero.policy = TRUE, style = "B")

# Moran's test
moran.test(hsp_sf$price, listw = Wlist, zero.policy = TRUE, na.action = na.omit)

    Moran I test under randomisation

data:  hsp_sf$price  
weights: Wlist 
omitted: 7, 11, 13, 20, 57, 70, 77, 107, 146, 147, 167, 169, 171, 183, 185, 187, 198, 209, 218, 229, 231, 234, 249, 250, 253, 260, 262, 263, 264, 270, 271, 280, 281, 282, 283, 284, 285, 286, 287, 288, 290, 291, 292, 299, 305, 311, 312, 329, 353, 357, 358, 362, 378, 379, 380, 384, 388, 402, 404, 405, 407, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 429, 431, 433, 438   

Moran I statistic standard deviate = 23.226, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.782368767      -0.002824859       0.001142894 

Types of GIS data

What we’ll learn
  • Types of GIS data and how these are handled in R.
  • Difference between spatial and non-spatial data analysis.
  • Importance of geocoding your data for spatial analysis.

Roughly speaking, there are 4 types of GIS data.

  1. Points
    • Having \((X, Y)\) coordinates (latitude, longitude, or projected coordinates, and are “zero-dimensional”.
    • E.g. shopping malls, hospitals, outbreaks, etc.
  2. Lines
    • A collection of points that form a path or a boundary. Has length.
    • E.g. roads, rivers, pipelines, etc.
  3. Polygons
    • A closed area made up of line segments or curves.
    • E.g. countries, districts, buildings, etc.
  4. Raster
    • Pixelated (or gridded) data where each pixel is associated with a geographical area and some measurement.
    • E.g. satellite images, elevation data, etc.

The first three are usually referred to as vector data. GIS data can be stored in various formats such as .shp or .geojson. The handling of GIS data (at least vector type data) is facilitated by the sf package (Pebesma and Bivand 2023) which uses the simple features standard.

Note

Simple features refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects.

It’s helpful to think about the shape of this spatial data set. As an example, here’s a random slice of 10 kampong-level population data for Brunei:

Code
left_join(
  kpg_sf, 
  bn_census2021, 
  by = join_by(id, kampong, mukim, district)
) |>
  select(
    kampong, population, geometry
  ) |>
  slice_sample(n = 10)
Simple feature collection with 10 features and 2 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 114.4212 ymin: 4.302264 xmax: 115.1735 ymax: 4.998173
Geodetic CRS:  WGS 84
# A tibble: 10 × 3
   kampong                   population                                 geometry
   <chr>                          <dbl>                            <POLYGON [°]>
 1 Pulau ...                         NA ((115.0435 4.997858, 115.0436 4.997846,…
 2 Kg. Baong                        351 ((114.9697 4.840602, 114.9698 4.840598,…
 3 Kg. Perdayan                      NA ((115.0759 4.779878, 115.0767 4.779845,…
 4 Kg. Seri Tanjong Belayang        113 ((115.0523 4.721034, 115.059 4.718149, …
 5 Kg. Kandang                      300 ((114.649 4.785566, 114.6497 4.784629, …
 6 Kg. Bang Bulan                    20 ((115.063 4.684746, 115.0633 4.684573, …
 7 Kg. Sukang                        57 ((114.6014 4.375045, 114.6015 4.374833,…
 8 Kg. Wasan                        441 ((114.8154 4.798696, 114.8157 4.798617,…
 9 Kg. Pandai Besi A                 87 ((114.9387 4.881099, 114.9383 4.880876,…
10 Kg. Terunan                       19 ((114.4643 4.423127, 114.4643 4.42307, …

Spatial data analysis must have these two components:

  1. The study variables (in the above example, this is population data).
  2. GIS data regarding that study variable.

If we only have 1 without 2, then it really is just a regular data analysis (stating the obvious). Adding the GIS data is a process called “geocoding” the data points.

Note

In R, geocoding using tidyverse can be achieved using the dplyr::left_join() or similar xxx_join() family of functions.

(MULTI)POINT data

What we’ll learn
  • Loading data sets in R using readr::read_csv().
  • Identifying data types and their implications.

Use the data from Jaafar and Sukri (2023) on the physicochemical characteristics and texture classification of soil in Bornean tropical heath forests affected by exotic Acacia mangium. There are three datasets provided.

  1. GIS data (WGS84 coordinates) of all study plots.
  2. Soil physicochemical property data. This contains details of soil physical, chemical, nutrient concentration of the three habits studied.
  3. Soil texture classification. Provides details on the classification of the soil texture in the habitats studied.

We will first load the data sets in R.

Code
## Load the data sets
soil_gps <- read_csv(
  "data/8389823/GPS - Revised.csv", 
  # IMPORTANT!!! The csv file has latin1 encoding as opposed to UTF-8
  locale = readr::locale(encoding = "latin1")
)
  
soil_physico <- read_csv("data/8389823/Soil physicochemical properties.csv")
soil_texture <- read_csv("data/8389823/Soil texture classification.csv")

Clean up the point data

What we’ll learn
  • Highlighting the need for cleaning and preprocessing data.
  • Using glimpse() to peek at the data.
  • Using mutate() to change stuff in the data set.
  • Using str() to look at the structure of an R object.

Let’s take a look at the point data set.

glimpse(soil_gps)
Rows: 18
Columns: 5
$ Forest_type  <chr> "Kerangas", "Kerangas", "Kerangas", "Kerangas", "Kerangas…
$ Habitat_type <chr> "Intact", "Intact", "Intact", "Intact", "Intact", "Intact…
$ Plot_name    <chr> "KU1", "KU2", "KU3", "KU4", "KU5", "KU6", "KI1", "KI2", "…
$ Latitude     <chr> "4° 35' 53.40\"N", "4° 35' 38.37\"N", "4° 35' 53.89\"N", …
$ Longitude    <chr> "114° 30' 39.09\"E", "114° 31' 05.89\"E", "114° 30' 38.90…

The first three columns are essentially the identifiers of the plots (forest type, habitat type, and the unique identification code for the study plot). However, the latitude and longitude needs a bit of cleaning up, because it’s currently in character format. This needs to be in a formal Degree Minute Second DMS class that R can understand. For this we will use the sp::char2dms() function.

As an example let’s take a look at the first latitude.

x <- soil_gps$Latitude[1]
x
[1] "4° 35' 53.40\"N"
# convert it using sp::char2dms() function
x <- sp::char2dms(x, chd = "°")
x
[1] 4d35'53.4"N
str(x)
Formal class 'DMS' [package "sp"] with 5 slots
  ..@ WS : logi FALSE
  ..@ deg: int 4
  ..@ min: int 35
  ..@ sec: num 53.4
  ..@ NS : logi TRUE

This is a special class that R understands as being a latitude from Earth. To convert it to decimal, we just do as.numeric():

[1] 4.598167

Now let’s do this for all the values in the soil_gps data. We will use the dplyr::mutate() function in a pipeline.

Code
soil_gps <-
  soil_gps |>
  mutate(
    Latitude = as.numeric(sp::char2dms(Latitude, chd = "°")),
    Longitude = as.numeric(sp::char2dms(Longitude, chd = "°"))
  )
soil_gps
# A tibble: 18 × 5
   Forest_type Habitat_type Plot_name Latitude Longitude
   <chr>       <chr>        <chr>        <dbl>     <dbl>
 1 Kerangas    Intact       KU1           4.60      115.
 2 Kerangas    Intact       KU2           4.59      115.
 3 Kerangas    Intact       KU3           4.60      115.
 4 Kerangas    Intact       KU4           4.63      114.
 5 Kerangas    Intact       KU5           4.60      115.
 6 Kerangas    Intact       KU6           4.60      115.
 7 Kerangas    Invaded      KI1           4.59      115.
 8 Kerangas    Invaded      KI2           4.59      115.
 9 Kerangas    Invaded      KI3           4.59      115.
10 Kerangas    Invaded      KI4           4.59      115.
11 Kerangas    Invaded      KI5           4.59      115.
12 Kerangas    Invaded      KI6           4.59      115.
13 Kerangas    Plantation   AP1           4.59      115.
14 Kerangas    Plantation   AP2           4.59      115.
15 Kerangas    Plantation   AP3           4.59      115.
16 Kerangas    Plantation   AP4           4.59      115.
17 Kerangas    Plantation   AP5           4.59      115.
18 Kerangas    Plantation   AP6           4.59      115.

Preliminary plot of the data

What we’ll learn

Using the data contained in the bruneimap package, we can plot the study areas on a map of Brunei. Use either the brn_sf, dis_sf, mkm_sf or kpg_sf data sets.

Code
ggplot(brn_sf) +
  geom_sf() +
  geom_point(data = soil_gps, aes(Longitude, Latitude)) 

We can zoom in a bit… but we have to find out manually the correct bounding box.

Code
ggplot(mkm_sf) +
  geom_sf() +
  geom_sf(data = dis_sf, fill = NA, col = "black", linewidth = 1) +
  geom_point(data = soil_gps, aes(Longitude, Latitude)) +
  geom_text_repel(
    data = soil_gps,
    aes(Longitude, Latitude, label = Plot_name),
    box.padding = 0.5,
    max.overlaps = 30
  ) +
  coord_sf(
    xlim = c(114.4, 114.6),
    ylim = c(4.5, 4.7)
  )

Merge with the study data

What we’ll learn
  • Using left_join() to merge two data sets together.
  • Using geom_jitter() to plot the study variables that are overlapping.

Let’s take a look at the data set.

glimpse(soil_physico)
Rows: 144
Columns: 16
$ Habitat_type              <chr> "Intact", "Intact", "Intact", "Intact", "Int…
$ Plot_name                 <chr> "KU1", "KU1", "KU1", "KU1", "KU1", "KU1", "K…
$ Subplot_name              <chr> "A", "A", "B", "B", "C", "C", "D", "D", "A",…
$ Soil_depth                <chr> "0-15", "30-50", "0-15", "30-50", "0-15", "3…
$ Nitrogen                  <dbl> 0.617, 0.188, 0.663, 0.200, 0.465, 0.255, 0.…
$ Phosphorus                <dbl> 0.248, 0.129, 0.259, 0.295, 0.172, 0.145, 0.…
$ Magnesium                 <dbl> 0.000, 0.045, 0.054, 0.035, 0.079, 0.043, 0.…
$ Calcium                   <dbl> 0.167, 0.187, 0.148, 0.113, 0.253, 0.229, 0.…
$ Potassium                 <dbl> 0.059, 0.037, 0.054, 0.022, 0.098, 0.033, 0.…
$ Exchangable_magnesium     <dbl> 0.009, 0.004, 0.007, 0.005, 0.029, 0.014, 0.…
$ Exchangable_calcium       <dbl> 0.010, 0.009, 0.008, 0.009, 0.109, 0.041, 0.…
$ Exchangable_potassium     <dbl> 0.101, 0.085, 0.092, 0.087, 0.101, 0.090, 0.…
$ Available_phosphorus      <dbl> 0.012, 0.012, 0.013, 0.012, 0.013, 0.014, 0.…
$ pH                        <dbl> 2.3, 2.7, 2.0, 2.0, 2.6, 2.5, 2.3, 2.1, 1.0,…
$ Gravimetric_water_content <dbl> 5.911, 3.560, 10.860, 5.082, 6.963, 4.549, 5…
$ Organic_matter            <dbl> 4.559, 1.399, 4.523, 2.309, 3.131, 2.209, 3.…
glimpse(soil_texture)
Rows: 144
Columns: 8
$ Habitat_type           <chr> "Intact", "Intact", "Intact", "Intact", "Intact…
$ Plot_name              <chr> "KU1", "KU1", "KU1", "KU1", "KU2", "KU2", "KU2"…
$ Subplot_name           <chr> "A", "B", "C", "D", "A", "B", "C", "D", "A", "B…
$ Soil_depth             <chr> "0-15", "0-15", "0-15", "0-15", "0-15", "0-15",…
$ Clay                   <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 2.5, 2.5, 2.5, 0.0, 2.…
$ Silt                   <dbl> 2.5, 0.0, 0.0, 2.5, 0.0, 0.0, 2.5, 2.5, 7.5, 7.…
$ Sand                   <dbl> 97.5, 100.0, 100.0, 97.5, 100.0, 97.5, 95.0, 95…
$ Texture_classification <chr> "Sand", "Sand", "Sand", "Sand", "Sand", "Sand",…

The soil_physico and soil_texture data sets contain the same columns, so we might as well merge them together. We will use the dplyr::left_join() function.

Code
# Actually I just want to merge these two together
soil_df <- left_join(
  soil_physico,
  soil_texture,
  by = join_by(Habitat_type, Plot_name, Subplot_name, Soil_depth)
)
soil_df
# A tibble: 144 × 20
   Habitat_type Plot_name Subplot_name Soil_depth Nitrogen Phosphorus Magnesium
   <chr>        <chr>     <chr>        <chr>         <dbl>      <dbl>     <dbl>
 1 Intact       KU1       A            0-15          0.617      0.248     0    
 2 Intact       KU1       A            30-50         0.188      0.129     0.045
 3 Intact       KU1       B            0-15          0.663      0.259     0.054
 4 Intact       KU1       B            30-50         0.2        0.295     0.035
 5 Intact       KU1       C            0-15          0.465      0.172     0.079
 6 Intact       KU1       C            30-50         0.255      0.145     0.043
 7 Intact       KU1       D            0-15          0.285      0.225     0.052
 8 Intact       KU1       D            30-50         0.057      0.207     0.031
 9 Intact       KU2       A            0-15          0.37       0.135     0.038
10 Intact       KU2       A            30-50         0.114      0.168     0.021
# ℹ 134 more rows
# ℹ 13 more variables: Calcium <dbl>, Potassium <dbl>,
#   Exchangable_magnesium <dbl>, Exchangable_calcium <dbl>,
#   Exchangable_potassium <dbl>, Available_phosphorus <dbl>, pH <dbl>,
#   Gravimetric_water_content <dbl>, Organic_matter <dbl>, Clay <dbl>,
#   Silt <dbl>, Sand <dbl>, Texture_classification <chr>

Once we’ve done that, the soil_df data set (the study variables) is actually missing the spatial data. We need to geocode it with the soil_gps data set. Again, dplyr::left_join() to the rescue!

Code
soil_df <- left_join(
  soil_df, 
  soil_gps,
  by = join_by(Habitat_type, Plot_name)
)

Now we’re in a position to plot the study variables on the map. Note that there are only 18 plots in the soil_gps data set, and each plot has repeated measurements. That means when we plot it, it will overlap and look like a single point. So a good thing to do is to jitter the point so it’s easier to see.

Code
ggplot(kpg_sf) +
  geom_sf(fill = NA) +
  geom_jitter(
    data = soil_df, 
    aes(Longitude, Latitude, col = Nitrogen, size = Nitrogen, 
        shape = Habitat_type),
    width = 0.001, height = 0.001, alpha = 0.7
  ) +
  coord_sf(
    xlim = c(114.46, 114.54),
    ylim = c(4.58, 4.64)
  ) +
  scale_color_viridis_c() +
  guides(size = "none")

Predictive models

What we’ll learn
  • Using kernlab::gpr() to build a predictive Gaussian process regression (kriging) model.
  • Important to consult an expert in the field before building a model. GIGO!

At this stage, we probably want to consult an expert in this field, and ask what variables are important for predicting the nitrogen content in the soil, so that we can build a predictive model. In mathematics, a model is simply a relationship between variables, like so: \[ y = f(x_1,x_2,\dots,x_p) + \epsilon \] where the \(x_1,\dots,x_p\) are the input variables, and \(y\) is the output variable of interest (in this case, nitrogen concentrations). No model can perfectly account for this relationship, so we add a term \(\epsilon\) which represents the error in the model. Ideally this should be as small as possible.

Several models, ranging from classical statistical models to complex machine learning models, are possible:

  1. Linear regression model – lm()
  2. Generalised additive models (GAM) (Wood 2017)mgcv::gam()
  3. Gaussian process regression (Kriging) – kernlab::gausspr()
  4. Random forests – randomForest::randomForest()
  5. Geographically weighted regression (GWR) – spgwr::gwr()

Let’s focus on Gaussian process regression (because that’s one that I know quite well). The idea of GPR is to model the relationship between the input variables and the output variable as a multivariate Gaussian distribution: \[ f(x) \sim \operatorname{N}\big(0, K(x,x')\big) \] where \(K(x,x')\) is the covariance function, which measures the similarity between the input variables \(x\) and \(x'\). The most common covariance function is the squared exponential function: \[ K(x,x') = \exp\left(-\frac{1}{2}\sum_{i=1}^p (x_i - x'_i)^2\right). \] Let \(d\) represent the “distance” between two points. Then the squared exponential kernel becomes very small when this distance \(d\) is large, and very large when \(d\) is small. Put another way, since elements of the matrix \(K\) represent co-variability, that means two points that are close together will behave similarly, and vice versa. This is very much in line with Tobler’s first law of geography: “Everything is related to everything else, but near things are more related than distant things”.

Code
x <- seq(-4, 4, length = 100)
y <- exp(-x^2)
tibble(d = x, Kxx = y) |>
  ggplot(aes(d, Kxx)) +
  geom_line() 

Given the assumed behaviour of our function \(f(x)\), and what is observed from the data, we can then make predictions about the nitrogen content at unobserved locations. Skipping over a whole lot of mathematics, let’s just fit this model in R.

Code
# Build a model to predict Nitrogen from all numeric variables. This is
# definitely not theory based, so just want to show the code.
soil_df <-
  soil_df |>
  select(where(is.numeric)) 
mod <- gausspr(Nitrogen ~ ., data = soil_df)
Using automatic sigma estimation (sigest) for RBF or laplace kernel 
Code
mod
Gaussian Processes object of class "gausspr" 
Problem type: regression 

Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  0.0563047122723642 

Number of training instances learned : 144 
Train error : 0.111691693 

Having done that, we now want to prepare a prediction data frame. Essentially, we will rasterise the study area into a predefined grid. For the other variables, we will just set them at their mean values.

Code
xr <- c(114.4, 114.6)
xx <- seq(xr[1] - 0.01, xr[2] + 0.01, length = 100)

yr <- c(4.5, 4.7)
yy <- seq(yr[1] - 0.01, yr[2] + 0.01, length = 100)

mean_X <- 
  soil_df |>
  summarise(across(everything(), mean)) |>
  select(-Longitude, -Latitude)

pred_df <-
  expand_grid(
    Longitude = xx,
    Latitude = yy
  ) |>
  bind_cols(mean_X)

pred_df$ypred <- predict(mod, newdata = pred_df)

# Additional step: filter points that are outside of the Brunei land area.
pred_sf <- 
  pred_df |>
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) |>
  st_filter(y = brn_sf[1, ])

ggplot() +
  geom_raster(
    data = pred_sf,
    aes(fill = ypred, geometry = geometry),
    stat = "sf_coordinates",
    alpha = 0.8
  ) +
  # geom_raster(data = pred_df, aes(Longitude, Latitude, fill = ypred),
  #             alpha = 0.8) +
  geom_sf(data = kpg_sf, fill = NA, inherit.aes = FALSE, col = "black") +
  geom_sf(data = dis_sf, fill = NA, col = "black", linewidth = 1) +
  geom_point(data = soil_gps, aes(Longitude, Latitude, 
                                  shape = Habitat_type)) +
  geom_text_repel(
    data = soil_gps,
    aes(Longitude, Latitude, label = Plot_name),
    box.padding = 0.5,
    max.overlaps = 30
  ) +
  scale_fill_viridis_c() +
  scale_colour_viridis_c() +
  coord_sf(xlim = xr, ylim = yr)

Warning

Garbage In Garbage Out! Modelling is as much an art as it is a science. Careful consideration needs to be made as to what is considered a predictor of a variable.

Line data ((MULTI)LINESTRING)

What we’ll learn

For this example, we’ll play with the road network shape file obtained from OpenStreetMaps. The data is in geojson format, so let’s import that into R.

Code
brd <- 
  read_sf("data/hotosm_brn_roads_lines_geojson/hotosm_brn_roads_lines_geojson.geojson") |>
  st_transform(4326)  # SET THE CRS!!! (WGS84)
glimpse(brd)
Rows: 25,570
Columns: 15
$ name       <chr> "Simpang 393", "Simpang 405", NA, NA, NA, NA, "Lebuhraya Tu…
$ `name:en`  <chr> NA, NA, NA, NA, NA, NA, "Tutong–Telisai Highway", NA, NA, N…
$ highway    <chr> "residential", "residential", "service", "residential", "tr…
$ surface    <chr> NA, NA, NA, NA, NA, "asphalt", "asphalt", NA, NA, NA, "asph…
$ smoothness <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ width      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lanes      <chr> NA, NA, NA, NA, NA, "1", "2", NA, NA, NA, "2", NA, NA, NA, …
$ oneway     <chr> NA, NA, NA, NA, NA, "yes", "yes", NA, NA, NA, "no", "yes", …
$ bridge     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ layer      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ source     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ `name:ms`  <chr> NA, NA, NA, NA, NA, NA, "Lebuhraya Tutong–Telisai", NA, NA,…
$ osm_id     <int> 386886618, 481030903, 512405939, 664532755, 442044892, 6651…
$ osm_type   <chr> "ways_line", "ways_line", "ways_line", "ways_line", "ways_l…
$ geometry   <LINESTRING [°]> LINESTRING (114.6236 4.7910..., LINESTRING (114.…

There are 25,570 features in this data set, which may be a bit too much. Let’s try to focus on the major roads only. This information seems to be contained in the highway column. What’s in it?

Code
table(brd$highway)

     bridleway   construction       cycleway        footway  living_street 
             1             28             73            898             10 
      motorway  motorway_link           path     pedestrian        primary 
           116            152            140             60            865 
  primary_link    residential           road      secondary secondary_link 
           332           9023              1            446             79 
       service          steps       tertiary  tertiary_link          track 
          9876             53            586             59            442 
         trunk     trunk_link   unclassified 
           460            310           1560 

According to this wiki, In OpenStreetMap, the major roads of a road network are sorted on an importance scale, from motorway to quaternary road.

Code
brd_mjr <- 
  brd |>
  filter(highway %in% c("motorway", "trunk", "primary", "secondary")) 
brd_mjr
Simple feature collection with 1887 features and 14 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 114.1906 ymin: 4.516642 xmax: 115.2021 ymax: 5.037115
Geodetic CRS:  WGS 84
# A tibble: 1,887 × 15
   name     `name:en` highway surface smoothness width lanes oneway bridge layer
 * <chr>    <chr>     <chr>   <chr>   <chr>      <chr> <chr> <chr>  <chr>  <chr>
 1 Lebuhra… Tutong–T… trunk   asphalt <NA>       <NA>  2     yes    <NA>   <NA> 
 2 Lebuhra… Tutong–T… trunk   asphalt <NA>       <NA>  3     yes    <NA>   <NA> 
 3 Jalan S… <NA>      primary asphalt <NA>       <NA>  2     yes    yes    1    
 4 Jalan S… <NA>      primary asphalt <NA>       <NA>  2     yes    <NA>   <NA> 
 5 Lebuh R… Seria–Be… trunk   asphalt <NA>       <NA>  2     yes    <NA>   <NA> 
 6 <NA>     <NA>      trunk   asphalt <NA>       <NA>  2     yes    <NA>   <NA> 
 7 <NA>     <NA>      primary asphalt <NA>       <NA>  1     yes    <NA>   <NA> 
 8 Lebuh R… Seria–Be… trunk   asphalt <NA>       <NA>  2     yes    yes    1    
 9 <NA>     <NA>      primary asphalt <NA>       <NA>  2     yes    <NA>   <NA> 
10 Lebuhra… Telisai–… trunk   asphalt <NA>       <NA>  2     yes    <NA>   <NA> 
# ℹ 1,877 more rows
# ℹ 5 more variables: source <chr>, `name:ms` <chr>, osm_id <int>,
#   osm_type <chr>, geometry <LINESTRING [°]>

And now a plot of these roads.

Code
ggplot() +
  geom_sf(data = brn_sf) +
  geom_sf(data = brd_mjr, aes(col = highway), size = 0.5) +
  # scale_colour_viridis_d(option = "turbo")
  ggsci::scale_colour_npg()

With this, I asked ChatGPT what kind of spatial analyses can be done on this data set. It said, when paired with appropriate data, we can do things like:

  1. Network Connectivity Analysis
    • Assess reachability and identify disconnected road network components.
  2. Accessibility and Service Area Analysis
    • Determine service areas and catchment areas for essential services.
  3. Traffic Simulation and Management
    • Simulate traffic flow to identify bottlenecks and suggest optimal routing.
  4. Environmental Impact Assessment
    • Estimate vehicular emissions and model noise pollution from roads.
  5. Urban and Regional Planning
    • Examine land use compatibility and assess infrastructure development needs.
  6. Safety Analysis
    • Identify accident hotspots and assess pedestrian safety.
  7. Economic Analysis
    • Evaluate economic accessibility and the impact of road projects.

Let’s pick one of these: Calculate the distance between the centroid of several regions and the major hospital in the Belait district. This analysis guides urban and healthcare planning by pinpointing areas with inadequate access to emergency services, enabling targeted infrastructure and service improvements.

Road networks in Belait region

What we’ll learn
  • Manipulating GIS data using sf::st_intersection() and the like. Useful for reorganising the spatial structure (without having to do this in QGIS or ArcGIS).
  • Sampling points from a line data set.
  • Calculating distances between points and lines using osrm package.

First we “crop” the road network to the Belait region.

Code
brd_belait <- st_intersection(
  brd,
  filter(dis_sf, name == "Belait")
)

ggplot(brd_belait) +
  geom_sf() +
  geom_sf(data = filter(dis_sf, name == "Belait"), fill = NA)

If we were to sample random points from the Belait polygon, we might get non-sensical areas like the extremely rural areas or forest reserves. So the idea is to sample random points from the road network itself. For this, we need a function that will get us a random point on the path itself.

Code
get_random_point <- function(linestring) {
  coords <- st_coordinates(linestring)
  samp_coord <- coords[sample(nrow(coords), 1), , drop = FALSE]
  samp_coord[, 1:3]
}
get_random_point(brd_belait$geometry[1])
         X          Y         L1 
114.241433   4.594193   1.000000 

Once we have this function, we need to map() this function onto each of the linestrings in the brd_belait data set. The resulting list of points is too large! So we will just sample 100 points (you can experiment with this number).

Code
random_points <-
  map(brd_belait$geometry, get_random_point) |>
  bind_rows() |>
  slice_sample(n = 100)

What we have now is a data frame of 100 random points on the road network in the Belait district. We will use the osrm package to calculate the distance between these points and the Suri Seri Begawan Hospital in Kuala Belait. The output will be three things: 1) The duration (minutes); 2) The distance (km); and 3) a LINESTRING object that represents the path to get to the hospital. Unfortunately the osrmRoute() function is not vectorised, i.e. we have to do it one-by-one for each of the 100 points. Luckily, we can just make a for loop and store the results in a list.

Code
suriseri <- c(114.198778, 4.583444)

res <- list()
for (i in 1:100) {
  res[[i]] <- osrmRoute(src = random_points[i, 1:2], dst = suriseri, overview = "full")
}
res <- 
  bind_rows(res) |>
  as_tibble() |>
  st_as_sf()
res
Simple feature collection with 100 features and 4 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 114.1878 ymin: 4.33654 xmax: 114.694 ymax: 4.69931
Geodetic CRS:  WGS 84
# A tibble: 100 × 5
   src   dst   duration distance                                        geometry
   <chr> <chr>    <dbl>    <dbl>                                <LINESTRING [°]>
 1 1     dst       8.83     7.01 (114.2325 4.58214, 114.2325 4.58214, 114.2325 …
 2 1     dst      17.2     11.7  (114.2925 4.60607, 114.2925 4.60608, 114.2927 …
 3 1     dst       3.68     2.26 (114.2148 4.5902, 114.2148 4.5902, 114.2142 4.…
 4 1     dst      21.7     22.3  (114.3483 4.62193, 114.3461 4.62121, 114.3446 …
 5 1     dst      32.4     35.5  (114.4511 4.65474, 114.4511 4.65475, 114.4514 …
 6 1     dst      30.9     36.7  (114.4614 4.6576, 114.4614 4.6576, 114.4627 4.…
 7 1     dst      19.1     19.1  (114.3266 4.59817, 114.3266 4.59817, 114.3268 …
 8 1     dst      34.8     37.7  (114.4822 4.67307, 114.4822 4.67307, 114.4815 …
 9 1     dst      25.3     23.2  (114.3541 4.62083, 114.3544 4.62097, 114.3545 …
10 1     dst      18.1     14.9  (114.3183 4.61294, 114.3183 4.61295, 114.3183 …
# ℹ 90 more rows

So with all that done, we can now plot the paths taken by the 100 random points to the hospital. The map gives us an indication of which areas are underserved by the hospital, and can guide urban and healthcare planning by pinpointing areas with inadequate access to emergency services, enabling targeted infrastructure and service improvements.

Code
ggplot(res) +
  # geom_point(data = random_points, aes(x = X, y = Y), col = "red") +
  geom_sf(data = filter(kpg_sf, district == "Belait"), fill = NA) +
  geom_sf(aes(col = duration), linewidth = 1.2, alpha = 0.7) +
  geom_point(x = suriseri[1], y = suriseri[2], col = "red3", pch = "X", 
             size = 3) +
  scale_colour_viridis_c() 

Improving the analysis

  • Weight analysis by populous areas. Outcalls to hospitals can be modelled using a Poisson distribution with the population as the rate parameter.
  • Use a more sophisticated routing algorithm that accounts for traffic conditions and road quality (am vs pm, weekends vs weekdays, etc.).
  • Simpler to analyse at the kampong or mukim level?

Areal data ((MULTI)POLYGONS)

What we’ll learn

When your study data is made up a finite number of non-overlapping areas, then you can represent them as polygons in R. This is the case for the kampong and mukim data in Brunei. As an example, let us look at the population of each kampong in Brunei. This dataset comes from the 2021 Brunei Census data (DEPS 2022)

Code
glimpse(bn_census2021)
Rows: 365
Columns: 11
$ id           <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 14, 15, 16, 17, 18, 19, 2…
$ kampong      <chr> "Kg. Biang", "Kg. Amo", "Kg. Sibut", "Kg. Sumbiling Baru"…
$ mukim        <chr> "Mukim Amo", "Mukim Amo", "Mukim Amo", "Mukim Amo", "Muki…
$ district     <chr> "Temburong", "Temburong", "Temburong", "Temburong", "Temb…
$ population   <dbl> 75, 394, 192, 91, 108, 143, 199, 123, 95, 90, 92, 2427, 4…
$ pop_male     <dbl> 46, 218, 98, 48, 60, 68, 115, 65, 52, 46, 73, 1219, 252, …
$ pop_female   <dbl> 29, 176, 94, 43, 48, 75, 84, 58, 43, 44, 19, 1208, 150, 2…
$ pop_bruneian <dbl> 37, 280, 174, 55, 57, 64, 114, 88, 63, 35, 37, 1557, 235,…
$ pop_pr       <dbl> 33, 83, 17, 24, 41, 64, 64, 28, 29, 32, 2, 179, 3, 67, 32…
$ household    <dbl> 13, 83, 37, 23, 23, 23, 38, 26, 26, 23, 14, 517, 76, 691,…
$ occ_liv_q    <dbl> 13, 62, 27, 16, 22, 21, 37, 22, 12, 23, 14, 492, 71, 681,…

Each row of the data refers to a kampong-level observation. While there are unique identifiers to this (id, kampong, mukim, district), we would still need to geocode this data set so that we can do fun things like plot it on a map. Let’s use (again) left_join() to do this.

Code
bn_pop_sf <- 
  left_join(
    kpg_sf, 
    bn_census2021, 
    by = join_by(id, kampong, mukim, district)
  )

Great. Let’s take a look at the population column. It would be very interesting to see where most of the 440,704 people of Brunei live!

Code
ggplot(bn_pop_sf) +
  geom_sf(aes(fill = population)) +
  scale_fill_viridis_c(na.value = NA)

As expected, there are “hotspots” of population in the Brunei-Muara district, and to a lesser extent in the Belait district. We can make this graph a bit better by binning the population values. It seems to be dominated by a lot of these low value colours. Let’s take a look at this further by inspecting a histogram.

Code
ggplot(bn_pop_sf) +
  geom_histogram(aes(population), binwidth = 100)

So maybe we can bin the population into 4 categories: < 100, 101-1000, 1001-10000, and 10000+. For this we directly use the scale_fill_viridis_b() and adjust the breaks. Otherwise we would have to cut() the population column and then use scale_fill_manual(). We also added the names of the top 10 most populous kampongs to the map using ggrepel::geom_label_repel().

Code
kpg_labels_sf <-
  bn_pop_sf |>
  arrange(desc(population)) |>
  slice_head(n = 10)

bn_pop_sf |>
  # filter(population > 50) |>
  ggplot() +
  geom_sf(aes(fill = population), col = NA, alpha = 0.8) +
  geom_sf(data = kpg_sf, fill = NA, col = "black") +
  ggrepel::geom_label_repel(
    data = kpg_labels_sf,
    aes(label = kampong, geometry = geometry),
    stat = "sf_coordinates",
    inherit.aes = FALSE,
    box.padding = 1,
    size = 2,
    max.overlaps = Inf
  ) +
  scale_fill_viridis_b(
    name = "Population",
    na.value = NA,
    labels = scales::comma,
    breaks = c(0, 100, 1000, 10000, 20000)
    # limits = c(0, 12000)
  ) +
  theme_bw()

Condtional Autoregressive (CAR) models

What we’ll learn
  • Use spatial autoregressive models (CAR models) to quantify discrepancies in the variable of interest due to the area.
  • Use the CARBayes package to do this.
  • Plot interactive maps using mapview.

One of the most interesting things about areal data is that you might want to find out how the value of a variable of interest is impacted simply by being in a certain area. A possible model could be \[ y_i = \alpha + f(x_i) + \phi_i + \epsilon_i \] where \(y_i\) is the value of the variable of interest in area \(i\), \(\alpha\) is the intercept, \(f(x_i)\) is the covariate predictor, and \(\phi_i\) represents the “offset” to the average value of \(y_i\) due to the area \(i\). If you’re familiar with ANOVA this feels similar to that, except that we impose a certain structure to these \(\phi_i\)s: \[ \phi_i \mid \boldsymbol{\phi}_{-i} \sim \operatorname{N} \left( \rho\sum_{j \sim i} w_{ij} \phi_j \ , \ \sigma^2 \right). \] Here \(\rho\) is a parameter that controls the amount of spatial autocorrelation in the model, and \(w_{ij}\) is assumed row-normalised. This means that these random effects \(\phi\) must take into account the structure of the neighbourhoods of the areas when estimating these values. This is the Conditional Autoregressive (CAR) model, and it is a very popular model for areal data.

Let’s take a peek at the hsp_bn data that we saw earlier.

Code
glimpse(hsp_bn)
Rows: 355
Columns: 8
$ id        <dbl> 1, 2, 3, 4, 5, 6, 182, 183, 184, 185, 187, 189, 191, 192, 19…
$ kampong   <chr> "Kg. Biang", "Kg. Amo", "Kg. Sibut", "Kg. Sumbiling Baru", "…
$ mukim     <chr> "Mukim Amo", "Mukim Amo", "Mukim Amo", "Mukim Amo", "Mukim A…
$ price     <dbl> 111334.17, 90746.26, 229180.29, 134933.77, 113434.67, 107893…
$ built_up  <dbl> 830, 820, 1300, 1300, 820, 840, 830, 1700, 1700, 810, 2200, …
$ land_size <dbl> 0.5150, 0.6140, 0.0421, 0.1800, 0.3100, 2.7200, 0.5800, 0.06…
$ beds      <dbl> 2, 2, 3, 3, 2, 2, 2, 3, 3, 2, 5, 4, 5, 5, 4, 4, 4, 3, 4, 6, …
$ baths     <dbl> 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 4, 4, 4, 4, 3, 4, 4, 5, 4, 5, …
Important

The hsp_bn data is an artificial data set that I produced for instructive purposes only. It should NOT be used for research purposes.

Suppose we are interested in building a linear house price model for Brunei that predicts the price of a house based on the built up size, number of bedrooms, number of bathrooms, and the size of the land it sits on. This will look like this: \[ \texttt{price} = \alpha + \beta_1 \texttt{built\_up} + \beta_2 \texttt{beds} + \beta_3 \texttt{baths} + \beta_4 \texttt{land\_size} + \phi + \epsilon \] From the earlier plot (Figure 1), we saw that there a lot of gaps in the kampong areas, which create “islands” of areas disconnected from each other. Ideally the areas are all interconnected in some way. One way to solve this is to work at the higher mukim level. However, since we have kampong-level data (multiple observations per mukim), we have two approaches:

  1. Aggregate the data upwards to the mukim-level. This involves summarising the data for each mukim based on the multiple kampong observations (e.g. using medians, means and modes).

  2. Utilise the kampong-level data directly, incorporating the multilevel structure into the model.

The first approach has the downside of reducing the number of observations to the number of mukims, which might not be ideal if you have very few mukims. We only have 37 observed mukims so in terms of sample size strength, this is not great. So, we will use the second approach.

The first thing to do is to create the \(W\) matrix required for the CAR model, which is deduced from the neighbourhood structure of the mukims. This can be obtained using the spdep::poly2nb() function, as shown below.

Code
dat_sp <-
  hsp_bn |>
  group_by(mukim) |>
  summarise(mukim = first(mukim)) |>
  left_join(mkm_sf) |>
  st_as_sf() |>
  as("Spatial")
nb_mkm <- poly2nb(dat_sp, row.names = dat_sp$mukim, queen = FALSE)

# Plot
nb_sf <- nb2lines(nb_mkm, coords = sp::coordinates(dat_sp), as_sf = TRUE)
nb_sf <- st_set_crs(nb_sf, st_crs(dat_sp))
ggplot() +
  geom_sf(data = mkm_sf) +
  geom_sf(data = nb_sf, col = "red3")

This graph contains 2 subgraphs which are disjoint, because the none of the borders of the mukims in Temburong are found to be touching with any of the other mukims. This is an arbitrary decision, and we can manually change this if we wanted to. Maybe our reasoning is because the bridge is already built, we can consider Mukim Kota Batu and Mukim Bangar to be neighbours.

Code
i <- which(attr(nb_mkm, "region.id") == "Mukim Kota Batu")
j <- which(attr(nb_mkm, "region.id") == "Mukim Bangar")
nb_mkm[[i]] <- c(nb_mkm[[i]], j)
nb_mkm[[j]] <- c(nb_mkm[[j]], i)
nb_mkm
Neighbour list object:
Number of regions: 37 
Number of nonzero links: 152 
Percentage nonzero weights: 11.10299 
Average number of links: 4.108108 

The CAR model requires the \(W\) matrix, which is then obtained using the spdep::nb2mat() function. To estimate this model in R, we use the CARBayes package. This is done as follows:

Code
mod1 <- CARBayes::S.CARmultilevel(
  formula = I(price/1000) ~ I(built_up/1000) + beds + baths + land_size,
  data = hsp_bn,
  family = "gaussian",
  burnin = 1000,
  n.sample = 2000,
  ind.area = match(hsp_bn$mukim, dat_sp$mukim),
  W = nb2mat(nb_mkm, style = "B")
)
Code
mod1

#################
#### Model fitted
#################
Likelihood model - Gaussian (identity link function) 
Random effects model - Multilevel Leroux CAR
Regression equation - I(price/1000) ~ I(built_up/1000) + beds + baths + land_size


#################
#### MCMC details
#################
Total number of post burnin and thinned MCMC samples generated - 1000
Number of MCMC chains used - 1
Length of the burnin period used for each chain - 1000
Amount of thinning used - 1

############
#### Results
############
Posterior quantities and DIC

                      Mean      2.5%     97.5% n.effective Geweke.diag
(Intercept)       229.8222  211.4890  247.1329       480.6         0.6
I(built_up/1000)    9.1906    3.1481   15.3973       655.2        -0.8
beds                6.0099    0.9735   11.0969       621.3        -1.1
baths              -1.3629   -5.9237    3.3370       423.9         1.2
land_size          -9.0801  -15.3878   -2.6087       609.8         0.6
nu2               947.2506  811.8939 1107.0444       758.9         1.2
tau2             5507.2362 3253.4677 8837.4659       750.9         0.8
rho                 0.7651    0.5203    0.9749        32.0         0.7

DIC =  3481.558       p.d =  40.19155       LMPL =  -1744.1 

A note on scaling of the variables. We divide the price and built_up by 1000. This has the effect of rescaling the price to thousands of Brunei Dollars, and the effect of built_up is then measured per 1000 sqft. This is just to make the output more interpretable (and look nice), which we can do now.

Let’s take a look at the mukim-level estimates of the \(\phi\) values.

Code
phi <- apply(mod1$samples$phi, 2, mean)
names(phi) <- dat_sp$mukim
head(phi)
      Mukim Amo    Mukim Bangar Mukim Batu Apoi Mukim Berakas A Mukim Berakas B 
      -86.33861        26.07063       -65.31808        43.09296        16.75966 
    Mukim Bokok 
      -57.52087 

And if we wanted to visualise this interactively, we can us the mapview package for this.

Code
hsp_mapview <-
  left_join(mkm_sf, tibble(
    mukim = names(phi),
    phi = round(phi * 1000, -3)
  ))
mapview(
  hsp_mapview,
  zcol = "phi",
  layer.name = "Spatial effects (BND)",
  label = "mukim",
  alpha.regions = 0.9
)

Next steps

  • Usually want to compare against a model with no spatial effects, using e.g. CARBayes::S.glm(), and compare model fit (DIC, WAIC, predictive marginal log-likelihood). This gives (hopefully) justification as to why a spatial model is needed.
  • Model spatial effects at the kampong level perhaps?
  • Are the variables sufficient for the model? Maybe there are other interaction effects not sufficiently accounted for (e.g. house type? age/condition of property? neighbourhood characteristics? )

Summary

  1. Spatial data is data that has a spatial component to it, and can be represented as points, lines, polygons, or grids.

  2. When your study variables are spatially dependent, or spatially heterogeneous, you must account for this in your analyses.

  3. Depending on the type of spatial data you, you can use different types of spatial models.

  • For modelling point data, you can use something as simple as linear regression (with X,Y coordinates as input) or more complex models like kriging.
  • For line data, the interest is usually in paths and distances.
  • For areal data, you can use spatial autoregressive models (CAR or SAR), or spatially varying coefficient models (e.g. GWR).
  1. R has a number of packages that can help you with spatial data analysis!

References

Bivand, Roger S, Edzer J Pebesma, Virgilio Gómez-Rubio, and Edzer Jan Pebesma. 2008. Applied Spatial Data Analysis with R. Vol. 747248717. Springer.
DEPS. 2022. “The Population and Housing Census Report (BPP) 2021: Demographic, Household and Housing Characteristics.” Department of Economic Planning and Statistics, Ministry of Finance and Economy, Brunei Darussalam.
Jaafar, Salwana Md, and Rahayu Sukmaria Sukri. 2023. “Data on the Physicochemical Characteristics and Texture Classification of Soil in Bornean Tropical Heath Forests Affected by Exotic Acacia Mangium.” Data in Brief 51 (December). https://doi.org/10.1016/j.dib.2023.109670.
Jamil, Haziq. 2024+. “A Spatio-Temporal Analysis of Property Prices in Brunei Darussalam.” 2024+. https://haziqj.ml/house-price/papers/jamil2024spatio/.
Moraga, Paula. 2019. Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny. CRC Press.
Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With Applications in R. 1st ed. New York: Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.
Wood, Simon N. 2017. Generalized Additive Models: An Introduction with R, Second Edition. 2nd ed. Boca Raton: Chapman and Hall/CRC. https://doi.org/10.1201/9781315370279.

Footnotes

  1. Artificial data generated for the purposes of this tutorial. Not to be used for research purposes!↩︎