The RUGS mission is to facilitate the person-to-person exchange of knowledge in small group settings on a global scale. —R Consortium
"R"|>rug("b", _, "unei")
About us
A group of UBD-ians and R enthusiasts
We want to create a community of R users in Brunei
Champion the Open Source cause
More events to come this year. Stay tuned!
Expectations
Outcomes
This is a hands-on, live-coding, lecture-style “workshop”. Expect to learn (or at the very least, see me do!)…
What spatial data is and why it’s important.
What statistical analysis can be done with spatial data.
How to perform spatial analysis using R.
A basic understanding of R is assumed.
For some, maybe it will be a bit fast-paced (sorry in advanced!). All the materials will be available online (see link on the right). Truthfully, I am not expecting you to walk away as an expert in spatial analysis, given the length of this talk. At the very least however I hope you become aware of the various spatial techniques and how they can be achieved in R.
Also… ChatGPT is a great way to accelerate your learning!
I’m very happy to answer questions afterwards! 😀
Getting started with R
I’ll just talk about these points briefly:
What’s the difference between R and RStudio?
Quick run through RStudio’s features
Set up a project
R Scripts vs Notebooks (.qmd or .Rmd)
Executing commands in R
# Try this out for yourself!N<-100x<-runif(n =N, min =0, max =1)head(x)# Show the first 6 elements
This is a bit difficult to read! In programming, pipelines allow us to put the output of a previous function into the argument of the first function of the preceding one. The pipe operator is ‘|’ followed by a ‘>’. My font has ligatures, that’s why it’s showing up as those triangles!
Code
rnorm(100)|># Random samplesexp()|># exponential functionsin()|># sine functionmean()# mean function
[1] 0.540876
For us humans, it’s much more readable so preferred when creating data science notebooks.
List of packages
The power of R comes from its diverse range of user-contributed packages. To install a package in R, we type install.packages("package_name"). As an example, try install the tidyverse package.
A bunch of things will happen on your screen that makes you look like a legit hacker. (It’s normal! Unless… there are some errors in the installation process 😅) Once that’s done, you will want to load the package using library() to start using it. Here’s a list of packages we will be using today. You’ll need to install all of them before we begin. In RStudio, there will be a prompt (yellow line at the top of the source pane) for you to install all these packages with a single click.
Importance of accounting for spatial effects in data.
Using the Moran’s I test to detect spatial autocorrelations.
Consider the distribution of house prices in Brunei1. I think it is well accepted that the prices of houses in Brunei are not uniformly distributed across the country, but are determined by location. Figure 1 (a) shows a faithful representation of house prices in Brunei, and we can clearly see a clustering effect. Many economists, realtors, and urban planners will undoubtedly tell you that there is some kind of “neighbouring-effect” at play here. House closer to each other tend to exhibit more similar properties.
When we perform statistical modelling and ignore the spatial component, effectively what we are assuming is that the prices of houses are independent of their location. Under this assumption, Figure 1 (b) is a valid representation of house prices in Brunei (uniformity of prices). This is a very dangerous thing to do, especially when the spatial effect are non-ignorable. We may get nonsensical results and inferences out of the resulting model.
Of course, the degree to which this assumption is violated depends on the context and the problem at hand, as well as the scale of the data. For the above-type problem where we have area-level data, one can use the Moran’s I test of global heterogeneity to test for spatial autocorrelation. It tests
\(H_0\): No spatial autocorrelation \(H_1\): (Positive) Spatial autocorrelation present
To test this in R, we can use the spdep::moran.test() function from the spdep package. There are two key ingredients here, the actual data itself (house prices), and also the neighbourhood relationship between all the areas (spatial structure). The code is as follows.
Difference between spatial and non-spatial data analysis.
Importance of geocoding your data for spatial analysis.
Roughly speaking, there are 4 types of GIS data.
Points
Having \((X, Y)\) coordinates (latitude, longitude, or projected coordinates, and are “zero-dimensional”.
E.g. shopping malls, hospitals, outbreaks, etc.
Lines
A collection of points that form a path or a boundary. Has length.
E.g. roads, rivers, pipelines, etc.
Polygons
A closed area made up of line segments or curves.
E.g. countries, districts, buildings, etc.
Raster
Pixelated (or gridded) data where each pixel is associated with a geographical area and some measurement.
E.g. satellite images, elevation data, etc.
The first three are usually referred to as vector data. GIS data can be stored in various formats such as .shp or .geojson. The handling of GIS data (at least vector type data) is facilitated by the sf package (Pebesma and Bivand 2023) which uses the simple features standard.
Note
Simple features refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects.
It’s helpful to think about the shape of this spatial data set. As an example, here’s a random slice of 10 kampong-level population data for Brunei:
Simple feature collection with 10 features and 2 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 114.4212 ymin: 4.302264 xmax: 115.1735 ymax: 4.998173
Geodetic CRS: WGS 84
# A tibble: 10 × 3
kampong population geometry
<chr> <dbl> <POLYGON [°]>
1 Pulau ... NA ((115.0435 4.997858, 115.0436 4.997846,…
2 Kg. Baong 351 ((114.9697 4.840602, 114.9698 4.840598,…
3 Kg. Perdayan NA ((115.0759 4.779878, 115.0767 4.779845,…
4 Kg. Seri Tanjong Belayang 113 ((115.0523 4.721034, 115.059 4.718149, …
5 Kg. Kandang 300 ((114.649 4.785566, 114.6497 4.784629, …
6 Kg. Bang Bulan 20 ((115.063 4.684746, 115.0633 4.684573, …
7 Kg. Sukang 57 ((114.6014 4.375045, 114.6015 4.374833,…
8 Kg. Wasan 441 ((114.8154 4.798696, 114.8157 4.798617,…
9 Kg. Pandai Besi A 87 ((114.9387 4.881099, 114.9383 4.880876,…
10 Kg. Terunan 19 ((114.4643 4.423127, 114.4643 4.42307, …
Spatial data analysis must have these two components:
The study variables (in the above example, this is population data).
GIS data regarding that study variable.
If we only have 1 without 2, then it really is just a regular data analysis (stating the obvious). Adding the GIS data is a process called “geocoding” the data points.
Note
In R, geocoding using tidyverse can be achieved using the dplyr::left_join() or similar xxx_join() family of functions.
Use the data from Jaafar and Sukri (2023) on the physicochemical characteristics and texture classification of soil in Bornean tropical heath forests affected by exotic Acacia mangium. There are three datasets provided.
Soil physicochemical property data. This contains details of soil physical, chemical, nutrient concentration of the three habits studied.
Soil texture classification. Provides details on the classification of the soil texture in the habitats studied.
We will first load the data sets in R.
Code
## Load the data setssoil_gps<-read_csv("data/8389823/GPS - Revised.csv", # IMPORTANT!!! The csv file has latin1 encoding as opposed to UTF-8 locale =readr::locale(encoding ="latin1"))soil_physico<-read_csv("data/8389823/Soil physicochemical properties.csv")soil_texture<-read_csv("data/8389823/Soil texture classification.csv")
Clean up the point data
What we’ll learn
Highlighting the need for cleaning and preprocessing data.
The first three columns are essentially the identifiers of the plots (forest type, habitat type, and the unique identification code for the study plot). However, the latitude and longitude needs a bit of cleaning up, because it’s currently in character format. This needs to be in a formal Degree Minute Second DMS class that R can understand. For this we will use the sp::char2dms() function.
As an example let’s take a look at the first latitude.
x<-soil_gps$Latitude[1]x
[1] "4° 35' 53.40\"N"
# convert it using sp::char2dms() functionx<-sp::char2dms(x, chd ="°")x
Using the data contained in the bruneimap package, we can plot the study areas on a map of Brunei. Use either the brn_sf, dis_sf, mkm_sf or kpg_sf data sets.
The soil_physico and soil_texture data sets contain the same columns, so we might as well merge them together. We will use the dplyr::left_join() function.
Code
# Actually I just want to merge these two togethersoil_df<-left_join(soil_physico,soil_texture, by =join_by(Habitat_type, Plot_name, Subplot_name, Soil_depth))soil_df
Once we’ve done that, the soil_df data set (the study variables) is actually missing the spatial data. We need to geocode it with the soil_gps data set. Again, dplyr::left_join() to the rescue!
Code
soil_df<-left_join(soil_df, soil_gps, by =join_by(Habitat_type, Plot_name))
Now we’re in a position to plot the study variables on the map. Note that there are only 18 plots in the soil_gps data set, and each plot has repeated measurements. That means when we plot it, it will overlap and look like a single point. So a good thing to do is to jitter the point so it’s easier to see.
Using kernlab::gpr() to build a predictive Gaussian process regression (kriging) model.
Important to consult an expert in the field before building a model. GIGO!
At this stage, we probably want to consult an expert in this field, and ask what variables are important for predicting the nitrogen content in the soil, so that we can build a predictive model. In mathematics, a model is simply a relationship between variables, like so: \[
y = f(x_1,x_2,\dots,x_p) + \epsilon
\] where the \(x_1,\dots,x_p\) are the input variables, and \(y\) is the output variable of interest (in this case, nitrogen concentrations). No model can perfectly account for this relationship, so we add a term \(\epsilon\) which represents the error in the model. Ideally this should be as small as possible.
Several models, ranging from classical statistical models to complex machine learning models, are possible:
Let’s focus on Gaussian process regression (because that’s one that I know quite well). The idea of GPR is to model the relationship between the input variables and the output variable as a multivariate Gaussian distribution: \[
f(x) \sim \operatorname{N}\big(0, K(x,x')\big)
\] where \(K(x,x')\) is the covariance function, which measures the similarity between the input variables \(x\) and \(x'\). The most common covariance function is the squared exponential function: \[
K(x,x') = \exp\left(-\frac{1}{2}\sum_{i=1}^p (x_i - x'_i)^2\right).
\] Let \(d\) represent the “distance” between two points. Then the squared exponential kernel becomes very small when this distance \(d\) is large, and very large when \(d\) is small. Put another way, since elements of the matrix \(K\) represent co-variability, that means two points that are close together will behave similarly, and vice versa. This is very much in line with Tobler’s first law of geography: “Everything is related to everything else, but near things are more related than distant things”.
Given the assumed behaviour of our function \(f(x)\), and what is observed from the data, we can then make predictions about the nitrogen content at unobserved locations. Skipping over a whole lot of mathematics, let’s just fit this model in R.
Code
# Build a model to predict Nitrogen from all numeric variables. This is# definitely not theory based, so just want to show the code.soil_df<-soil_df|>select(where(is.numeric))mod<-gausspr(Nitrogen~., data =soil_df)
Using automatic sigma estimation (sigest) for RBF or laplace kernel
Code
mod
Gaussian Processes object of class "gausspr"
Problem type: regression
Gaussian Radial Basis kernel function.
Hyperparameter : sigma = 0.0563047122723642
Number of training instances learned : 144
Train error : 0.111691693
Having done that, we now want to prepare a prediction data frame. Essentially, we will rasterise the study area into a predefined grid. For the other variables, we will just set them at their mean values.
Code
xr<-c(114.4, 114.6)xx<-seq(xr[1]-0.01, xr[2]+0.01, length =100)yr<-c(4.5, 4.7)yy<-seq(yr[1]-0.01, yr[2]+0.01, length =100)mean_X<-soil_df|>summarise(across(everything(), mean))|>select(-Longitude, -Latitude)pred_df<-expand_grid( Longitude =xx, Latitude =yy)|>bind_cols(mean_X)pred_df$ypred<-predict(mod, newdata =pred_df)# Additional step: filter points that are outside of the Brunei land area.pred_sf<-pred_df|>st_as_sf(coords =c("Longitude", "Latitude"), crs =4326)|>st_filter(y =brn_sf[1, ])ggplot()+geom_raster( data =pred_sf,aes(fill =ypred, geometry =geometry), stat ="sf_coordinates", alpha =0.8)+# geom_raster(data = pred_df, aes(Longitude, Latitude, fill = ypred),# alpha = 0.8) +geom_sf(data =kpg_sf, fill =NA, inherit.aes =FALSE, col ="black")+geom_sf(data =dis_sf, fill =NA, col ="black", linewidth =1)+geom_point(data =soil_gps, aes(Longitude, Latitude, shape =Habitat_type))+geom_text_repel( data =soil_gps,aes(Longitude, Latitude, label =Plot_name), box.padding =0.5, max.overlaps =30)+scale_fill_viridis_c()+scale_colour_viridis_c()+coord_sf(xlim =xr, ylim =yr)
Warning
Garbage In Garbage Out! Modelling is as much an art as it is a science. Careful consideration needs to be made as to what is considered a predictor of a variable.
For this example, we’ll play with the road network shape file obtained from OpenStreetMaps. The data is in geojson format, so let’s import that into R.
Code
brd<-read_sf("data/hotosm_brn_roads_lines_geojson/hotosm_brn_roads_lines_geojson.geojson")|>st_transform(4326)# SET THE CRS!!! (WGS84)glimpse(brd)
There are 25,570 features in this data set, which may be a bit too much. Let’s try to focus on the major roads only. This information seems to be contained in the highway column. What’s in it?
With this, I asked ChatGPT what kind of spatial analyses can be done on this data set. It said, when paired with appropriate data, we can do things like:
Network Connectivity Analysis
Assess reachability and identify disconnected road network components.
Accessibility and Service Area Analysis
Determine service areas and catchment areas for essential services.
Traffic Simulation and Management
Simulate traffic flow to identify bottlenecks and suggest optimal routing.
Environmental Impact Assessment
Estimate vehicular emissions and model noise pollution from roads.
Urban and Regional Planning
Examine land use compatibility and assess infrastructure development needs.
Safety Analysis
Identify accident hotspots and assess pedestrian safety.
Economic Analysis
Evaluate economic accessibility and the impact of road projects.
Let’s pick one of these: Calculate the distance between the centroid of several regions and the major hospital in the Belait district. This analysis guides urban and healthcare planning by pinpointing areas with inadequate access to emergency services, enabling targeted infrastructure and service improvements.
Road networks in Belait region
What we’ll learn
Manipulating GIS data using sf::st_intersection() and the like. Useful for reorganising the spatial structure (without having to do this in QGIS or ArcGIS).
Sampling points from a line data set.
Calculating distances between points and lines using osrm package.
First we “crop” the road network to the Belait region.
If we were to sample random points from the Belait polygon, we might get non-sensical areas like the extremely rural areas or forest reserves. So the idea is to sample random points from the road network itself. For this, we need a function that will get us a random point on the path itself.
Code
get_random_point<-function(linestring){coords<-st_coordinates(linestring)samp_coord<-coords[sample(nrow(coords), 1), , drop =FALSE]samp_coord[, 1:3]}get_random_point(brd_belait$geometry[1])
X Y L1
114.241433 4.594193 1.000000
Once we have this function, we need to map() this function onto each of the linestrings in the brd_belait data set. The resulting list of points is too large! So we will just sample 100 points (you can experiment with this number).
What we have now is a data frame of 100 random points on the road network in the Belait district. We will use the osrm package to calculate the distance between these points and the Suri Seri Begawan Hospital in Kuala Belait. The output will be three things: 1) The duration (minutes); 2) The distance (km); and 3) a LINESTRING object that represents the path to get to the hospital. Unfortunately the osrmRoute() function is not vectorised, i.e. we have to do it one-by-one for each of the 100 points. Luckily, we can just make a for loop and store the results in a list.
So with all that done, we can now plot the paths taken by the 100 random points to the hospital. The map gives us an indication of which areas are underserved by the hospital, and can guide urban and healthcare planning by pinpointing areas with inadequate access to emergency services, enabling targeted infrastructure and service improvements.
Code
ggplot(res)+# geom_point(data = random_points, aes(x = X, y = Y), col = "red") +geom_sf(data =filter(kpg_sf, district=="Belait"), fill =NA)+geom_sf(aes(col =duration), linewidth =1.2, alpha =0.7)+geom_point(x =suriseri[1], y =suriseri[2], col ="red3", pch ="X", size =3)+scale_colour_viridis_c()
Improving the analysis
Weight analysis by populous areas. Outcalls to hospitals can be modelled using a Poisson distribution with the population as the rate parameter.
Use a more sophisticated routing algorithm that accounts for traffic conditions and road quality (am vs pm, weekends vs weekdays, etc.).
Simpler to analyse at the kampong or mukim level?
Areal data ((MULTI)POLYGONS)
What we’ll learn
Represent statistical data using colour mapping symbology (choropleth)
Using a binned colour scale, e.g. ggplot2::geom_scale_fill_viridis_b()
When your study data is made up a finite number of non-overlapping areas, then you can represent them as polygons in R. This is the case for the kampong and mukim data in Brunei. As an example, let us look at the population of each kampong in Brunei. This dataset comes from the 2021 Brunei Census data (DEPS 2022)
Each row of the data refers to a kampong-level observation. While there are unique identifiers to this (id, kampong, mukim, district), we would still need to geocode this data set so that we can do fun things like plot it on a map. Let’s use (again) left_join() to do this.
Code
bn_pop_sf<-left_join(kpg_sf, bn_census2021, by =join_by(id, kampong, mukim, district))
Great. Let’s take a look at the population column. It would be very interesting to see where most of the 440,704 people of Brunei live!
As expected, there are “hotspots” of population in the Brunei-Muara district, and to a lesser extent in the Belait district. We can make this graph a bit better by binning the population values. It seems to be dominated by a lot of these low value colours. Let’s take a look at this further by inspecting a histogram.
So maybe we can bin the population into 4 categories: < 100, 101-1000, 1001-10000, and 10000+. For this we directly use the scale_fill_viridis_b() and adjust the breaks. Otherwise we would have to cut() the population column and then use scale_fill_manual(). We also added the names of the top 10 most populous kampongs to the map using ggrepel::geom_label_repel().
Code
kpg_labels_sf<-bn_pop_sf|>arrange(desc(population))|>slice_head(n =10)bn_pop_sf|># filter(population > 50) |>ggplot()+geom_sf(aes(fill =population), col =NA, alpha =0.8)+geom_sf(data =kpg_sf, fill =NA, col ="black")+ggrepel::geom_label_repel( data =kpg_labels_sf,aes(label =kampong, geometry =geometry), stat ="sf_coordinates", inherit.aes =FALSE, box.padding =1, size =2, max.overlaps =Inf)+scale_fill_viridis_b( name ="Population", na.value =NA, labels =scales::comma, breaks =c(0, 100, 1000, 10000, 20000)# limits = c(0, 12000))+theme_bw()
Condtional Autoregressive (CAR) models
What we’ll learn
Use spatial autoregressive models (CAR models) to quantify discrepancies in the variable of interest due to the area.
One of the most interesting things about areal data is that you might want to find out how the value of a variable of interest is impacted simply by being in a certain area. A possible model could be \[
y_i = \alpha + f(x_i) + \phi_i + \epsilon_i
\] where \(y_i\) is the value of the variable of interest in area \(i\), \(\alpha\) is the intercept, \(f(x_i)\) is the covariate predictor, and \(\phi_i\) represents the “offset” to the average value of \(y_i\) due to the area \(i\). If you’re familiar with ANOVA this feels similar to that, except that we impose a certain structure to these \(\phi_i\)s: \[
\phi_i \mid \boldsymbol{\phi}_{-i} \sim \operatorname{N} \left( \rho\sum_{j \sim i} w_{ij} \phi_j \ , \ \sigma^2 \right).
\] Here \(\rho\) is a parameter that controls the amount of spatial autocorrelation in the model, and \(w_{ij}\) is assumed row-normalised. This means that these random effects \(\phi\) must take into account the structure of the neighbourhoods of the areas when estimating these values. This is the Conditional Autoregressive (CAR) model, and it is a very popular model for areal data.
Let’s take a peek at the hsp_bn data that we saw earlier.
The hsp_bn data is an artificial data set that I produced for instructive purposes only. It should NOT be used for research purposes.
Suppose we are interested in building a linear house price model for Brunei that predicts the price of a house based on the built up size, number of bedrooms, number of bathrooms, and the size of the land it sits on. This will look like this: \[
\texttt{price} = \alpha + \beta_1 \texttt{built\_up} + \beta_2 \texttt{beds} + \beta_3 \texttt{baths} + \beta_4 \texttt{land\_size} + \phi + \epsilon
\] From the earlier plot (Figure 1), we saw that there a lot of gaps in the kampong areas, which create “islands” of areas disconnected from each other. Ideally the areas are all interconnected in some way. One way to solve this is to work at the higher mukim level. However, since we have kampong-level data (multiple observations per mukim), we have two approaches:
Aggregate the data upwards to the mukim-level. This involves summarising the data for each mukim based on the multiple kampong observations (e.g. using medians, means and modes).
Utilise the kampong-level data directly, incorporating the multilevel structure into the model.
The first approach has the downside of reducing the number of observations to the number of mukims, which might not be ideal if you have very few mukims. We only have 37 observed mukims so in terms of sample size strength, this is not great. So, we will use the second approach.
The first thing to do is to create the \(W\) matrix required for the CAR model, which is deduced from the neighbourhood structure of the mukims. This can be obtained using the spdep::poly2nb() function, as shown below.
This graph contains 2 subgraphs which are disjoint, because the none of the borders of the mukims in Temburong are found to be touching with any of the other mukims. This is an arbitrary decision, and we can manually change this if we wanted to. Maybe our reasoning is because the bridge is already built, we can consider Mukim Kota Batu and Mukim Bangar to be neighbours.
Code
i<-which(attr(nb_mkm, "region.id")=="Mukim Kota Batu")j<-which(attr(nb_mkm, "region.id")=="Mukim Bangar")nb_mkm[[i]]<-c(nb_mkm[[i]], j)nb_mkm[[j]]<-c(nb_mkm[[j]], i)nb_mkm
Neighbour list object:
Number of regions: 37
Number of nonzero links: 152
Percentage nonzero weights: 11.10299
Average number of links: 4.108108
The CAR model requires the \(W\) matrix, which is then obtained using the spdep::nb2mat() function. To estimate this model in R, we use the CARBayes package. This is done as follows:
Code
mod1<-CARBayes::S.CARmultilevel( formula =I(price/1000)~I(built_up/1000)+beds+baths+land_size, data =hsp_bn, family ="gaussian", burnin =1000, n.sample =2000, ind.area =match(hsp_bn$mukim, dat_sp$mukim), W =nb2mat(nb_mkm, style ="B"))
Code
mod1
#################
#### Model fitted
#################
Likelihood model - Gaussian (identity link function)
Random effects model - Multilevel Leroux CAR
Regression equation - I(price/1000) ~ I(built_up/1000) + beds + baths + land_size
#################
#### MCMC details
#################
Total number of post burnin and thinned MCMC samples generated - 1000
Number of MCMC chains used - 1
Length of the burnin period used for each chain - 1000
Amount of thinning used - 1
############
#### Results
############
Posterior quantities and DIC
Mean 2.5% 97.5% n.effective Geweke.diag
(Intercept) 229.8222 211.4890 247.1329 480.6 0.6
I(built_up/1000) 9.1906 3.1481 15.3973 655.2 -0.8
beds 6.0099 0.9735 11.0969 621.3 -1.1
baths -1.3629 -5.9237 3.3370 423.9 1.2
land_size -9.0801 -15.3878 -2.6087 609.8 0.6
nu2 947.2506 811.8939 1107.0444 758.9 1.2
tau2 5507.2362 3253.4677 8837.4659 750.9 0.8
rho 0.7651 0.5203 0.9749 32.0 0.7
DIC = 3481.558 p.d = 40.19155 LMPL = -1744.1
A note on scaling of the variables. We divide the price and built_up by 1000. This has the effect of rescaling the price to thousands of Brunei Dollars, and the effect of built_up is then measured per 1000 sqft. This is just to make the output more interpretable (and look nice), which we can do now.
Let’s take a look at the mukim-level estimates of the \(\phi\) values.
Usually want to compare against a model with no spatial effects, using e.g. CARBayes::S.glm(), and compare model fit (DIC, WAIC, predictive marginal log-likelihood). This gives (hopefully) justification as to why a spatial model is needed.
Model spatial effects at the kampong level perhaps?
Are the variables sufficient for the model? Maybe there are other interaction effects not sufficiently accounted for (e.g. house type? age/condition of property? neighbourhood characteristics? )
Summary
Spatial data is data that has a spatial component to it, and can be represented as points, lines, polygons, or grids.
When your study variables are spatially dependent, or spatially heterogeneous, you must account for this in your analyses.
Depending on the type of spatial data you, you can use different types of spatial models.
For modelling point data, you can use something as simple as linear regression (with X,Y coordinates as input) or more complex models like kriging.
For line data, the interest is usually in paths and distances.
For areal data, you can use spatial autoregressive models (CAR or SAR), or spatially varying coefficient models (e.g. GWR).
R has a number of packages that can help you with spatial data analysis!
Figure 1 (a): Original distributionFigure 1 (b): Random shuffle
References
Bivand, Roger S, Edzer J Pebesma, Virgilio Gómez-Rubio, and Edzer Jan Pebesma. 2008. Applied Spatial Data Analysis with R. Vol. 747248717. Springer.
DEPS. 2022. “The Population and Housing Census Report (BPP) 2021: Demographic, Household and Housing Characteristics.”Department of Economic Planning and Statistics, Ministry of Finance and Economy, Brunei Darussalam.
Jaafar, Salwana Md, and Rahayu Sukmaria Sukri. 2023. “Data on the Physicochemical Characteristics and Texture Classification of Soil in Bornean Tropical Heath Forests Affected by Exotic Acacia Mangium.”Data in Brief 51 (December). https://doi.org/10.1016/j.dib.2023.109670.
Moraga, Paula. 2019. Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny. CRC Press.
Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With Applications in R. 1st ed. New York: Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.
Wood, Simon N. 2017. Generalized Additive Models: An Introduction with R, Second Edition. 2nd ed. Boca Raton: Chapman and Hall/CRC. https://doi.org/10.1201/9781315370279.
Footnotes
Artificial data generated for the purposes of this tutorial. Not to be used for research purposes!↩︎
Source Code
---title: Analysing spatial data using Rsubtitle: Brunei R User Group Meetup 🇧🇳date: 9 March 2024author: - name: Haziq Jamil orcid: 0000-0003-3298-1010 email: haziq.jamil@ubd.edu.bn url: https://haziqj.ml affiliation: Universiti Brunei Darussalam degrees: PhD---### `https://bruneir.github.io/brm-spatial` {.unlisted}## PreliminariesWelcome to the first Brunei R User Group meetup!::: {layout="[ 65, 35 ]"}::: {#first-column}<br>> The RUGS mission is to facilitate the person-to-person exchange of knowledge in small group settings on a global scale. ---R Consortium:::::: {#second-column}![](https://bruneir.github.io/bruneiR-Rlogo.jpg)``` r"R"|>rug("b", _, "unei")```::::::<u>About us</u>- A group of UBD-ians and R enthusiasts- We want to create a community of R users in Brunei- Champion the Open Source causeMore events to come this year. Stay tuned!### Expectations::: {.callout-warning title="Outcomes"}This is a hands-on, live-coding, lecture-style "workshop". Expect to learn (or at the very least, see me do!)...1. What spatial data is and why it's important.2. What statistical analysis can be done with spatial data.3. How to perform spatial analysis using R.A basic understanding of R is assumed.:::![](img/inspirational_cat.jpg)For some, maybe it will be a bit fast-paced (sorry in advanced!). All the materials will be available online (see link on the right). Truthfully, I am not expecting you to walk away as an expert in spatial analysis, given the length of this talk. At the very least however I hope you become aware of the various spatial techniques and how they can be achieved in R.Also... ChatGPT is a great way to accelerate your learning!::: {layout-ncol="2"}![](https://i.imgflip.com/840ojr.jpg)![](https://miro.medium.com/v2/resize:fit:720/format:webp/1*dF8AoGe2VOMNbnxR-vgx0g.jpeg):::I'm very happy to answer questions afterwards! 😀### Getting started with RI'll just talk about these points briefly:- What's the difference between R and RStudio?- Quick run through RStudio's features- Set up a project- R Scripts vs Notebooks (`.qmd` or `.Rmd`)- Executing commands in R```{r}#| code-fold: false# Try this out for yourself!N <-100x <-runif(n = N, min =0, max =1)head(x) # Show the first 6 elementsmean(x) # Calculate the mean of this vectorrun_x <-rep(NA, length(x))for (i inseq_along(x)) { run_x[i] <-sum(x[1:i])}head(run_x)```### PipelinesSuppose you have a sequence of functions that you want to apply onto an object. For instance, doing the following in R:> *Sample 100 observations from the N(0,1), then apply the exponential function, then apply the sine function, then take the mean*```{r}mean(sin(exp(rnorm(100))))```This is a bit difficult to read! In programming, pipelines allow us to put the output of a previous function into the argument of the first function of the preceding one. The pipe operator is '`|`' followed by a '`>`'. My font has ligatures, that's why it's showing up as those triangles!```{r}rnorm(100) |># Random samplesexp() |># exponential functionsin() |># sine functionmean() # mean function```For us humans, it's much more readable so preferred when creating data science notebooks.### List of packagesThe power of R comes from its diverse range of user-contributed packages. To install a package in R, we type `install.packages("package_name")`. As an example, try install the `{tidyverse}` package.```{r}#| code-fold: false#| eval: falseinstall.packages("tidyverse")```A bunch of things will happen on your screen that makes you look like a legit hacker. (It's normal! Unless... there are some errors in the installation process 😅) Once that's done, you will want to load the package using `library()` to start using it. Here's a list of packages we will be using today. You'll need to install all of them before we begin. In RStudio, there will be a prompt (yellow line at the top of the source pane) for you to install all these packages with a single click.```{r}#| label: setuplibrary(kernlab) # for GPRlibrary(ggrepel) # nice labels in plotslibrary(osrm) # OSM route calculatorlibrary(spdep) # Moran's test and morelibrary(CARBayes) # spatial CAR modelslibrary(mapview) # interactive plotslibrary(scales) # pretty formatslibrary(tidyverse)theme_set(theme_bw()) # set the theme```Furthermore, there are packages that are not yet on CRAN, but are available on GitHub. Please install them using the `remotes` package.```{r}#| code-fold: false#| eval: falseremotes::install_github("propertypricebn/bruneimap")```Of course, don't forget to load it. For more information, please check out the package's [GitHub page](https://github.com/propertypricebn/bruneimap).```{r}#| code-fold: falselibrary(bruneimap)```## Introduction::: {.callout-tip title="What we'll learn"}- Importance of accounting for spatial effects in data.- Using the Moran's I test to detect spatial autocorrelations.:::Consider the distribution of house prices in Brunei[^1]. I think it is well accepted that the prices of houses in Brunei are not uniformly distributed across the country, but are determined by location. @fig-hsp-spatial (a) shows a faithful representation of house prices in Brunei, and we can clearly see a clustering effect. Many economists, realtors, and urban planners will undoubtedly tell you that there is some kind of "neighbouring-effect" at play here. House closer to each other tend to exhibit more similar properties.[^1]: Artificial data generated for the purposes of this tutorial. Not to be used for research purposes!```{r}#| label: fig-hsp-spatial#| layout-ncol: 2#| fig-cap: Spatial distribution of house prices in Brunei#| fig-subcap:#| - Original distribution#| - Random shuffleload("data/artificial_hsp.RData")# forest_col <- adjustcolor("forestgreen", alpha = 0.9)forest_col <-"grey"plot_hsp <-function(x) {left_join(kpg_sf, x, by =join_by(id, kampong, mukim)) |>ggplot() +geom_sf(aes(fill = price, col ="")) +scale_fill_viridis_c(labels = scales::dollar, name ="Predicted price", alpha =0.9, na.value = forest_col ) +scale_colour_manual(values ="grey30") +guides(colour =guide_legend("No property\nsales", override.aes =list(fill = forest_col, colour = forest_col) ))}p1 <-plot_hsp(hsp_bn)p2 <- hsp_bn |>mutate(price =sample(price)) |>plot_hsp()p1p2```When we perform statistical modelling and ignore the spatial component, effectively what we are assuming is that the prices of houses are independent of their location. Under this assumption, @fig-hsp-spatial (b) is a valid representation of house prices in Brunei (uniformity of prices). This is a very dangerous thing to do, especially when the spatial effect are non-ignorable. We may get nonsensical results and inferences out of the resulting model.Of course, the degree to which this assumption is violated depends on the context and the problem at hand, as well as the scale of the data. For the above-type problem where we have *area-level data*, one can use the Moran's I test of global heterogeneity to test for spatial autocorrelation. It tests> | $H_0$: No spatial autocorrelation> | $H_1$: (Positive) Spatial autocorrelation presentTo test this in R, we can use the `spdep::moran.test()` function from the `{spdep}` package. There are two key ingredients here, the actual data itself (house prices), and also the neighbourhood relationship between all the areas (spatial structure). The code is as follows.```{r}#| label: moran# Prepare neighbourhood strucuturehsp_sf <-left_join(kpg_sf, hsp_bn, by =join_by(id, kampong, mukim))nb <-poly2nb(hsp_sf, queen =FALSE, snap =1)Wlist <-nb2listw(nb, zero.policy =TRUE, style ="B")# Moran's testmoran.test(hsp_sf$price, listw = Wlist, zero.policy =TRUE, na.action = na.omit)```### Types of GIS data::: {.callout-tip title="What we'll learn"}- Types of GIS data and how these are handled in R.- Difference between spatial and non-spatial data analysis.- Importance of geocoding your data for spatial analysis.:::Roughly speaking, there are 4 types of GIS data.1. **Points** - Having $(X, Y)$ coordinates (latitude, longitude, or projected coordinates, and are "zero-dimensional". - E.g. shopping malls, hospitals, outbreaks, etc.2. **Lines** - A collection of points that form a path or a boundary. Has length. - E.g. roads, rivers, pipelines, etc.3. **Polygons** - A closed area made up of line segments or curves. - E.g. countries, districts, buildings, etc.4. **Raster** - Pixelated (or gridded) data where each pixel is associated with a geographical area and some measurement. - E.g. satellite images, elevation data, etc.The first three are usually referred to as *vector data*. GIS data can be stored in various formats such as `.shp` or `.geojson`. The handling of GIS data (at least vector type data) is facilitated by the `{sf}` package [@pebesma2023spatial] which uses the *simple features* standard.::: callout-note*Simple features* refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects.:::It's helpful to think about the shape of this spatial data set. As an example, here's a random slice of 10 kampong-level population data for Brunei:```{r}left_join( kpg_sf, bn_census2021, by =join_by(id, kampong, mukim, district)) |>select( kampong, population, geometry ) |>slice_sample(n =10)```Spatial data analysis must have these two components:1. The study variables (in the above example, this is population data).2. GIS data regarding that study variable.If we only have 1 without 2, then it really is just a regular data analysis (stating the obvious). Adding the GIS data is a process called "geocoding" the data points.::: callout-noteIn R, geocoding using `{tidyverse}` can be achieved using the `dplyr::left_join()` or similar `xxx_join()` family of functions.:::## `(MULTI)POINT` data::: {.callout-tip title="What we'll learn"}- Loading data sets in R using `readr::read_csv()`.- Identifying data types and their implications.:::Use the data from @jaafar2023data on the physicochemical characteristics and texture classification of soil in Bornean tropical heath forests affected by exotic Acacia mangium. There are three datasets provided.1. GIS data ([WGS84](https://en.wikipedia.org/wiki/World_Geodetic_System "World Geodetic System") coordinates) of all study plots.2. Soil physicochemical property data. This contains details of soil physical, chemical, nutrient concentration of the three habits studied.3. Soil texture classification. Provides details on the classification of the soil texture in the habitats studied.We will first load the data sets in R.```{r}## Load the data setssoil_gps <-read_csv("data/8389823/GPS - Revised.csv", # IMPORTANT!!! The csv file has latin1 encoding as opposed to UTF-8locale = readr::locale(encoding ="latin1"))soil_physico <-read_csv("data/8389823/Soil physicochemical properties.csv")soil_texture <-read_csv("data/8389823/Soil texture classification.csv")```### Clean up the point data::: {.callout-tip title="What we'll learn"}- Highlighting the need for cleaning and preprocessing data.- Using `glimpse()` to peek at the data.- Using `mutate()` to change stuff in the data set.- Using `str()` to look at the structure of an R object.:::Let's take a look at the point data set.```{r}#| code-fold: falseglimpse(soil_gps)```The first three columns are essentially the identifiers of the plots (forest type, habitat type, and the unique identification code for the study plot). However, the latitude and longitude needs a bit of cleaning up, because it's currently in character format. This needs to be in a formal Degree Minute Second `DMS` class that R can understand. For this we will use the `sp::char2dms()` function.As an example let's take a look at the first latitude.```{r}#| code-fold: falsex <- soil_gps$Latitude[1]x# convert it using sp::char2dms() functionx <- sp::char2dms(x, chd ="°")xstr(x)```This is a special class that R understands as being a latitude from Earth. To convert it to decimal, we just do `as.numeric()`:```{r}#| code-fold: falseas.numeric(x)```Now let's do this for all the values in the `soil_gps` data. We will use the `dplyr::mutate()` function in a pipeline.```{r}soil_gps <- soil_gps |>mutate(Latitude =as.numeric(sp::char2dms(Latitude, chd ="°")),Longitude =as.numeric(sp::char2dms(Longitude, chd ="°")) )soil_gps```### Preliminary plot of the data::: {.callout-tip title="What we'll learn"}- Structure of a `ggplot()` (grammar of graphics).- Using `geom_sf()` to plot the GIS data, and adding points using `geom_point()`.:::Using the data contained in the `{bruneimap}` package, we can plot the study areas on a map of Brunei. Use either the `brn_sf`, `dis_sf`, `mkm_sf` or `kpg_sf` data sets.```{r}ggplot(brn_sf) +geom_sf() +geom_point(data = soil_gps, aes(Longitude, Latitude)) ```We can zoom in a bit... but we have to find out manually the correct bounding box.```{r}ggplot(mkm_sf) +geom_sf() +geom_sf(data = dis_sf, fill =NA, col ="black", linewidth =1) +geom_point(data = soil_gps, aes(Longitude, Latitude)) +geom_text_repel(data = soil_gps,aes(Longitude, Latitude, label = Plot_name),box.padding =0.5,max.overlaps =30 ) +coord_sf(xlim =c(114.4, 114.6),ylim =c(4.5, 4.7) )```### Merge with the study data::: {.callout-tip title="What we'll learn"}- Using `left_join()` to merge two data sets together.- Using `geom_jitter()` to plot the study variables that are overlapping.:::Let's take a look at the data set.```{r}#| code-fold: falseglimpse(soil_physico)glimpse(soil_texture)```The `soil_physico` and `soil_texture` data sets contain the same columns, so we might as well merge them together. We will use the `dplyr::left_join()` function.```{r}# Actually I just want to merge these two togethersoil_df <-left_join( soil_physico, soil_texture,by =join_by(Habitat_type, Plot_name, Subplot_name, Soil_depth))soil_df```Once we've done that, the `soil_df` data set (the study variables) is actually missing the spatial data. We need to geocode it with the `soil_gps` data set. Again, `dplyr::left_join()` to the rescue!```{r}soil_df <-left_join( soil_df, soil_gps,by =join_by(Habitat_type, Plot_name))```Now we're in a position to plot the study variables on the map. Note that there are only 18 plots in the `soil_gps` data set, and each plot has repeated measurements. That means when we plot it, it will overlap and look like a single point. So a good thing to do is to jitter the point so it's easier to see.```{r}ggplot(kpg_sf) +geom_sf(fill =NA) +geom_jitter(data = soil_df, aes(Longitude, Latitude, col = Nitrogen, size = Nitrogen, shape = Habitat_type),width =0.001, height =0.001, alpha =0.7 ) +coord_sf(xlim =c(114.46, 114.54),ylim =c(4.58, 4.64) ) +scale_color_viridis_c() +guides(size ="none")```### Predictive models::: {.callout-tip title="What we'll learn"}- Using `kernlab::gpr()` to build a predictive Gaussian process regression (kriging) model.- Important to consult an expert in the field before building a model. GIGO!:::At this stage, we probably want to consult an expert in this field, and ask what variables are important for predicting the nitrogen content in the soil, so that we can build a *predictive model*. In mathematics, a model is simply a relationship between variables, like so: $$y = f(x_1,x_2,\dots,x_p) + \epsilon$$ where the $x_1,\dots,x_p$ are the input variables, and $y$ is the output variable of interest (in this case, nitrogen concentrations). No model can perfectly account for this relationship, so we add a term $\epsilon$ which represents the error in the model. Ideally this should be as small as possible.Several models, ranging from classical statistical models to complex machine learning models, are possible:1. Linear regression model -- `lm()`2. Generalised additive models (GAM) [@wood2017generalized] -- `mgcv::gam()`3. Gaussian process regression (Kriging) -- `kernlab::gausspr()`4. Random forests -- `randomForest::randomForest()`5. Geographically weighted regression (GWR) -- `spgwr::gwr()`Let's focus on Gaussian process regression (because that's one that I know quite well). The idea of GPR is to model the relationship between the input variables and the output variable as a multivariate Gaussian distribution: $$f(x) \sim \operatorname{N}\big(0, K(x,x')\big)$$ where $K(x,x')$ is the covariance function, which measures the similarity between the input variables $x$ and $x'$. The most common covariance function is the squared exponential function: $$K(x,x') = \exp\left(-\frac{1}{2}\sum_{i=1}^p (x_i - x'_i)^2\right).$$ Let $d$ represent the "distance" between two points. Then the squared exponential kernel becomes very small when this distance $d$ is large, and very large when $d$ is small. Put another way, since elements of the matrix $K$ represent co-variability, that means two points that are close together will behave similarly, and vice versa. This is very much in line with Tobler's first law of geography: "Everything is related to everything else, but near things are more related than distant things".```{r}x <-seq(-4, 4, length =100)y <-exp(-x^2)tibble(d = x, Kxx = y) |>ggplot(aes(d, Kxx)) +geom_line() ```Given the assumed behaviour of our function $f(x)$, and what is observed from the data, we can then make predictions about the nitrogen content at unobserved locations. Skipping over a whole lot of mathematics, let's just fit this model in R.```{r}# Build a model to predict Nitrogen from all numeric variables. This is# definitely not theory based, so just want to show the code.soil_df <- soil_df |>select(where(is.numeric)) mod <-gausspr(Nitrogen ~ ., data = soil_df)mod```Having done that, we now want to prepare a prediction data frame. Essentially, we will rasterise the study area into a predefined grid. For the other variables, we will just set them at their mean values.```{r}xr <-c(114.4, 114.6)xx <-seq(xr[1] -0.01, xr[2] +0.01, length =100)yr <-c(4.5, 4.7)yy <-seq(yr[1] -0.01, yr[2] +0.01, length =100)mean_X <- soil_df |>summarise(across(everything(), mean)) |>select(-Longitude, -Latitude)pred_df <-expand_grid(Longitude = xx,Latitude = yy ) |>bind_cols(mean_X)pred_df$ypred <-predict(mod, newdata = pred_df)# Additional step: filter points that are outside of the Brunei land area.pred_sf <- pred_df |>st_as_sf(coords =c("Longitude", "Latitude"), crs =4326) |>st_filter(y = brn_sf[1, ])ggplot() +geom_raster(data = pred_sf,aes(fill = ypred, geometry = geometry),stat ="sf_coordinates",alpha =0.8 ) +# geom_raster(data = pred_df, aes(Longitude, Latitude, fill = ypred),# alpha = 0.8) +geom_sf(data = kpg_sf, fill =NA, inherit.aes =FALSE, col ="black") +geom_sf(data = dis_sf, fill =NA, col ="black", linewidth =1) +geom_point(data = soil_gps, aes(Longitude, Latitude, shape = Habitat_type)) +geom_text_repel(data = soil_gps,aes(Longitude, Latitude, label = Plot_name),box.padding =0.5,max.overlaps =30 ) +scale_fill_viridis_c() +scale_colour_viridis_c() +coord_sf(xlim = xr, ylim = yr)```::: callout-warningGarbage In Garbage Out! Modelling is as much an art as it is a science. Careful consideration needs to be made as to what is considered a predictor of a variable.:::## Line data (`(MULTI)LINESTRING`)::: {.callout-tip title="What we'll learn"}- How to load spatial data sets using `sf::read_sf()` and editing the CRS using `sf::st_transform()`.- How to filter data using `dplyr::filter()`.- How to plot line data using `ggplot2::geom_sf()`.:::For this example, we'll play with the road network shape file obtained from OpenStreetMaps. The data is in geojson format, so let's import that into R.```{r}brd <-read_sf("data/hotosm_brn_roads_lines_geojson/hotosm_brn_roads_lines_geojson.geojson") |>st_transform(4326) # SET THE CRS!!! (WGS84)glimpse(brd)```There are 25,570 features in this data set, which may be a bit too much. Let's try to focus on the major roads only. This information seems to be contained in the `highway` column. What's in it?```{r}table(brd$highway)```According to this [wiki](https://wiki.openstreetmap.org/wiki/OpenStreetMap_Carto/Lines), In OpenStreetMap, the major roads of a road network are sorted on an importance scale, from motorway to quaternary road.![](img/osm_roads.png)```{r}brd_mjr <- brd |>filter(highway %in%c("motorway", "trunk", "primary", "secondary")) brd_mjr```And now a plot of these roads.```{r}ggplot() +geom_sf(data = brn_sf) +geom_sf(data = brd_mjr, aes(col = highway), size =0.5) +# scale_colour_viridis_d(option = "turbo") ggsci::scale_colour_npg()```With this, I asked ChatGPT what kind of spatial analyses can be done on this data set. It said, when paired with appropriate data, we can do things like:1. **Network Connectivity Analysis** - Assess reachability and identify disconnected road network components.2. **Accessibility and Service Area Analysis** - Determine service areas and catchment areas for essential services.3. **Traffic Simulation and Management** - Simulate traffic flow to identify bottlenecks and suggest optimal routing.4. **Environmental Impact Assessment** - Estimate vehicular emissions and model noise pollution from roads.5. **Urban and Regional Planning** - Examine land use compatibility and assess infrastructure development needs.6. **Safety Analysis** - Identify accident hotspots and assess pedestrian safety.7. **Economic Analysis** - Evaluate economic accessibility and the impact of road projects.Let's pick one of these: Calculate the distance between the centroid of several regions and the major hospital in the Belait district. This analysis guides urban and healthcare planning by pinpointing areas with inadequate access to emergency services, enabling targeted infrastructure and service improvements.### Road networks in Belait region::: {.callout-tip title="What we'll learn"}- Manipulating GIS data using `sf::st_intersection()` and the like. Useful for reorganising the spatial structure (without having to do this in QGIS or ArcGIS).- Sampling points from a line data set.- Calculating distances between points and lines using `{osrm}` package.:::First we "crop" the road network to the Belait region.```{r}brd_belait <-st_intersection( brd,filter(dis_sf, name =="Belait"))ggplot(brd_belait) +geom_sf() +geom_sf(data =filter(dis_sf, name =="Belait"), fill =NA)```If we were to sample random points from the Belait polygon, we might get non-sensical areas like the extremely rural areas or forest reserves. So the idea is to sample random points from the road network itself. For this, we need a function that will get us a random point on the path itself.```{r}get_random_point <-function(linestring) { coords <-st_coordinates(linestring) samp_coord <- coords[sample(nrow(coords), 1), , drop =FALSE] samp_coord[, 1:3]}get_random_point(brd_belait$geometry[1])```Once we have this function, we need to `map()` this function onto each of the linestrings in the `brd_belait` data set. The resulting list of points is too large! So we will just sample 100 points (you can experiment with this number).```{r}random_points <-map(brd_belait$geometry, get_random_point) |>bind_rows() |>slice_sample(n =100)```What we have now is a data frame of 100 random points on the road network in the Belait district. We will use the `{osrm}` package to calculate the distance between these points and the Suri Seri Begawan Hospital in Kuala Belait. The output will be three things: 1) The duration (minutes); 2) The distance (km); and 3) a `LINESTRING` object that represents the path to get to the hospital. Unfortunately the `osrmRoute()` function is not vectorised, i.e. we have to do it one-by-one for each of the 100 points. Luckily, we can just make a `for` loop and store the results in a list.```{r}suriseri <-c(114.198778, 4.583444)res <-list()for (i in1:100) { res[[i]] <-osrmRoute(src = random_points[i, 1:2], dst = suriseri, overview ="full")}res <-bind_rows(res) |>as_tibble() |>st_as_sf()res```So with all that done, we can now plot the paths taken by the 100 random points to the hospital. The map gives us an indication of which areas are underserved by the hospital, and can guide urban and healthcare planning by pinpointing areas with inadequate access to emergency services, enabling targeted infrastructure and service improvements.```{r}ggplot(res) +# geom_point(data = random_points, aes(x = X, y = Y), col = "red") +geom_sf(data =filter(kpg_sf, district =="Belait"), fill =NA) +geom_sf(aes(col = duration), linewidth =1.2, alpha =0.7) +geom_point(x = suriseri[1], y = suriseri[2], col ="red3", pch ="X", size =3) +scale_colour_viridis_c() ```Improving the analysis- Weight analysis by populous areas. Outcalls to hospitals can be modelled using a Poisson distribution with the population as the rate parameter.- Use a more sophisticated routing algorithm that accounts for traffic conditions and road quality (am vs pm, weekends vs weekdays, etc.).- Simpler to analyse at the kampong or mukim level?## Areal data (`(MULTI)POLYGONS`)::: {.callout-tip title="What we'll learn"}- Represent statistical data using colour mapping symbology (choropleth)- Use `ggplot2::geom_label()` or `ggrepel::geom_label_repel()` to add labels to the map- Using a binned colour scale, e.g. `ggplot2::geom_scale_fill_viridis_b()`:::When your study data is made up a finite number of non-overlapping areas, then you can represent them as polygons in R. This is the case for the kampong and mukim data in Brunei. As an example, let us look at the population of each kampong in Brunei. This dataset comes from the 2021 Brunei Census data [@deps2022population]```{r}glimpse(bn_census2021)```Each row of the data refers to a kampong-level observation. While there are unique identifiers to this (`id`, `kampong`, `mukim`, `district`), we would still need to geocode this data set so that we can do fun things like plot it on a map. Let's use (again) `left_join()` to do this.```{r}bn_pop_sf <-left_join( kpg_sf, bn_census2021, by =join_by(id, kampong, mukim, district) )```Great. Let's take a look at the population column. It would be very interesting to see where most of the `r scales::comma(sum(bn_pop_sf$population, na.rm = TRUE))` people of Brunei live!```{r}ggplot(bn_pop_sf) +geom_sf(aes(fill = population)) +scale_fill_viridis_c(na.value =NA)```As expected, there are "hotspots" of population in the Brunei-Muara district, and to a lesser extent in the Belait district. We can make this graph a bit better by binning the population values. It seems to be dominated by a lot of these low value colours. Let's take a look at this further by inspecting a histogram.```{r}ggplot(bn_pop_sf) +geom_histogram(aes(population), binwidth =100)```So maybe we can bin the population into 4 categories: \< 100, 101-1000, 1001-10000, and 10000+. For this we directly use the `scale_fill_viridis_b()` and adjust the breaks. Otherwise we would have to `cut()` the population column and then use `scale_fill_manual()`. We also added the names of the top 10 most populous kampongs to the map using `ggrepel::geom_label_repel()`.```{r}kpg_labels_sf <- bn_pop_sf |>arrange(desc(population)) |>slice_head(n =10)bn_pop_sf |># filter(population > 50) |>ggplot() +geom_sf(aes(fill = population), col =NA, alpha =0.8) +geom_sf(data = kpg_sf, fill =NA, col ="black") + ggrepel::geom_label_repel(data = kpg_labels_sf,aes(label = kampong, geometry = geometry),stat ="sf_coordinates",inherit.aes =FALSE,box.padding =1,size =2,max.overlaps =Inf ) +scale_fill_viridis_b(name ="Population",na.value =NA,labels = scales::comma,breaks =c(0, 100, 1000, 10000, 20000)# limits = c(0, 12000) ) +theme_bw()```### Condtional Autoregressive (CAR) models::: {.callout-tip title="What we'll learn"}- Use spatial autoregressive models (CAR models) to quantify discrepancies in the variable of interest due to the area.- Use the `{CARBayes}` package to do this.- Plot interactive maps using `{mapview}`.:::One of the most interesting things about areal data is that you might want to find out how the value of a variable of interest is impacted simply by being in a certain area. A possible model could be $$y_i = \alpha + f(x_i) + \phi_i + \epsilon_i$$ where $y_i$ is the value of the variable of interest in area $i$, $\alpha$ is the intercept, $f(x_i)$ is the covariate predictor, and $\phi_i$ represents the "offset" to the average value of $y_i$ due to the area $i$. If you're familiar with ANOVA this feels similar to that, except that we impose a certain structure to these $\phi_i$s: $$\phi_i \mid \boldsymbol{\phi}_{-i} \sim \operatorname{N} \left( \rho\sum_{j \sim i} w_{ij} \phi_j \ , \ \sigma^2 \right).$$ Here $\rho$ is a parameter that controls the amount of spatial autocorrelation in the model, and $w_{ij}$ is assumed row-normalised. This means that these random effects $\phi$ must take into account the structure of the neighbourhoods of the areas when estimating these values. This is the Conditional Autoregressive (CAR) model, and it is a very popular model for areal data.Let's take a peek at the `hsp_bn` data that we saw earlier.```{r}glimpse(hsp_bn)```::: callout-importantThe `hsp_bn` data is an artificial data set that I produced for instructive purposes only. It should **NOT** be used for research purposes.:::Suppose we are interested in building a linear house price model for Brunei that predicts the price of a house based on the built up size, number of bedrooms, number of bathrooms, and the size of the land it sits on. This will look like this: $$\texttt{price} = \alpha + \beta_1 \texttt{built\_up} + \beta_2 \texttt{beds} + \beta_3 \texttt{baths} + \beta_4 \texttt{land\_size} + \phi + \epsilon$$ From the earlier plot (@fig-hsp-spatial), we saw that there a lot of gaps in the kampong areas, which create "islands" of areas disconnected from each other. Ideally the areas are all interconnected in some way. One way to solve this is to work at the higher mukim level. However, since we have kampong-level data (multiple observations per mukim), we have two approaches:1) Aggregate the data upwards to the mukim-level. This involves summarising the data for each mukim based on the multiple kampong observations (e.g. using medians, means and modes).2) Utilise the kampong-level data directly, incorporating the multilevel structure into the model.The first approach has the downside of reducing the number of observations to the number of mukims, which might not be ideal if you have very few mukims. We only have 37 observed mukims so in terms of sample size strength, this is not great. So, we will use the second approach.The first thing to do is to create the $W$ matrix required for the CAR model, which is deduced from the neighbourhood structure of the mukims. This can be obtained using the `spdep::poly2nb()` function, as shown below.```{r}dat_sp <- hsp_bn |>group_by(mukim) |>summarise(mukim =first(mukim)) |>left_join(mkm_sf) |>st_as_sf() |>as("Spatial")nb_mkm <-poly2nb(dat_sp, row.names = dat_sp$mukim, queen =FALSE)# Plotnb_sf <-nb2lines(nb_mkm, coords = sp::coordinates(dat_sp), as_sf =TRUE)nb_sf <-st_set_crs(nb_sf, st_crs(dat_sp))ggplot() +geom_sf(data = mkm_sf) +geom_sf(data = nb_sf, col ="red3")```This graph contains 2 subgraphs which are disjoint, because the none of the borders of the mukims in Temburong are found to be touching with any of the other mukims. This is an arbitrary decision, and we can manually change this if we wanted to. Maybe our reasoning is because the bridge is already built, we can consider Mukim Kota Batu and Mukim Bangar to be neighbours.```{r}i <-which(attr(nb_mkm, "region.id") =="Mukim Kota Batu")j <-which(attr(nb_mkm, "region.id") =="Mukim Bangar")nb_mkm[[i]] <-c(nb_mkm[[i]], j)nb_mkm[[j]] <-c(nb_mkm[[j]], i)nb_mkm```The CAR model requires the $W$ matrix, which is then obtained using the `spdep::nb2mat()` function. To estimate this model in R, we use the `{CARBayes}` package. This is done as follows:```{r}#| output: falsemod1 <- CARBayes::S.CARmultilevel(formula =I(price/1000) ~I(built_up/1000) + beds + baths + land_size,data = hsp_bn,family ="gaussian",burnin =1000,n.sample =2000,ind.area =match(hsp_bn$mukim, dat_sp$mukim),W =nb2mat(nb_mkm, style ="B"))``````{r}mod1```A note on scaling of the variables. We divide the `price` and `built_up` by 1000. This has the effect of rescaling the price to thousands of Brunei Dollars, and the effect of `built_up` is then measured per 1000 sqft. This is just to make the output more interpretable (and look nice), which we can do now.Let's take a look at the mukim-level estimates of the $\phi$ values.```{r}phi <-apply(mod1$samples$phi, 2, mean)names(phi) <- dat_sp$mukimhead(phi)```And if we wanted to visualise this interactively, we can us the `{mapview}` package for this.```{r}hsp_mapview <-left_join(mkm_sf, tibble(mukim =names(phi),phi =round(phi *1000, -3) ))mapview( hsp_mapview,zcol ="phi",layer.name ="Spatial effects (BND)",label ="mukim",alpha.regions =0.9)```Next steps- Usually want to compare against a model with no spatial effects, using e.g. `CARBayes::S.glm()`, and compare model fit (DIC, WAIC, predictive marginal log-likelihood). This gives (hopefully) justification as to why a spatial model is needed.- Model spatial effects at the kampong level perhaps?- Are the variables sufficient for the model? Maybe there are other interaction effects not sufficiently accounted for (e.g. house type? age/condition of property? neighbourhood characteristics? )## Summary1. Spatial data is data that has a spatial component to it, and can be represented as points, lines, polygons, or grids.2. When your study variables are spatially dependent, or spatially heterogeneous, you must account for this in your analyses.3. Depending on the type of spatial data you, you can use different types of spatial models.- For modelling point data, you can use something as simple as linear regression (with X,Y coordinates as input) or more complex models like kriging.- For line data, the interest is usually in paths and distances.- For areal data, you can use spatial autoregressive models (CAR or SAR), or spatially varying coefficient models (e.g. GWR).4. R has a number of packages that can help you with spatial data analysis!