Introduction to R & RStudio

Brunei R User Group Meetup šŸ‡§šŸ‡³

Author

Hafeezul Raziq

Published

June 1, 2024

https://bruneir.github.io/brm-intro-to-r

Preliminaries

Welcome to the 3rd Brunei R User Group meetup!


The RUGS mission is to facilitate the person-to-person exchange of knowledge in small group settings on a global scale. ā€”R Consortium

"R" |> 
  rug("b", _, "unei")

About us

  • A group of UBD-ians and R enthusiasts
  • We want to create a community of R users in Brunei
  • Champion the Open Source cause

Past events

  • Analyzing Spatial Data using R [Workshop]
  • R>aya Meetup Sharing Sessions

Expectations

Todayā€™s Plan

This is a hands-on, live-coding, lecture-style ā€œworkshopā€. Expect to learn (or at the very least, see me do!)ā€¦

  1. What is R & RStudio? How can it ease the burden of repeated reporting?
  2. Basic functions for manipulating data
  3. Using R effectively
  4. More data manipulation
  5. Visualizing data
  6. A peek at advanced topics

1. Introduction to R & RStudio

When people consider switching to R, they usually think about it as a direct replacement for whatever tool theyā€™re currently using. While R can indeed replace software like Excel, SPSS, or Stata, it offers much more!

Suppose a current workflow that looks like this,

  1. Data analysis in SPSS (or a similar tool)
  2. Data visualization in Excel
  3. Report writing in Word

Have you ever encountered an error in the first step and had to go back through all three steps to fix it? Itā€™s quite frustrating, isnā€™t it?

R can get around this by combining data analysis, visualization, and reporting in one tool using RMarkdown. Any time you realize youā€™ve made a mistake, you just rerun your code and you get a new report. Think of the time it can save you!

Hence, R is a popular programming language, especially in certain fields such as data science, academic research, and statistics.

RStudio is an integrated development editor (IDE) for R. It is easier to write code using the editor.

Here are several reasons why you should use R:

  1. R is widely used among statisticians, especially academic statisticians.
    • If there is a new statistical procedure developed somewhere in academia, chances are that the code for it will be made available in R. This distinguishes R from, say, Python.
  2. R is commonly used for statistical analyses in many disciplines.
    • Other software, such as SPSS or SAS is also used and in some disciplines would be the primary choice for some discipline specific courses, but R is popular and its user base is growing.
  3. R is free.
    • You can install it and all optional packages on your computer at no cost. This is a big difference between R and SAS, SPSS, MATLAB, and most other statistical software.
  4. R is has a vibrant and growing community.
    • With the advent of the tidyverse and RStudio, R is a vibrant and growing community. We also have found the community to be extremely welcoming. The R ecosystem is one of its strengths.

In this workshop, we will dive into the fundamentals of R and RStudio.

2. Getting Started with R & RStudio

What weā€™ll learn
  • Installing the R Language.
  • Installing the RStudio.
  • Exploring the RStudio Interface
  • Packages & help() function

Follow the guidelines in Brunei R website - Under blog, ā€œHow to Install R and Rstudioā€

Once installed, launch RStudio and this is probably what youā€™ll see.

Notice the default panes:

  • Console (entire left)
  • Environment/History (tabbed in upper right)
  • Files/Plots/Packages/Help (tabbed in lower right)

Packages

Everything which is done in R is done by functions. Commonly used functions are grouped in packages. Installing different packages expand the functionality of R.

Packages are bundles of code that add new functions to R.

  • Base packages are installed with R but not loaded by default.

  • Contributed packages need to be downloaded, installed & loaded separately.

To install a package, say tidyverse, for the first time, type

install.packages("tidyverse")

To load the package, type the package name without quotation

library(tidyverse)

Recommended Packages

For data manipulation:

  • tidyverse - An opinionated collection of R packages designed for data science that share an underlying design philosophy, grammar, and data structures. This collection includes all the packages in this section, plus many more for data import, tidying, and visualization listed here.

For data visualization:

  • ggplot2 - Rā€™s famous package for making beautiful graphics. ggplot2 lets you use the grammar of graphics to build layered, customizable plots.

For reproducible reporting:

  • R Markdown - The perfect workflow for reproducible reporting. Write R code in your markdown reports. When you run render, R Markdown will replace the code with its results and then export your report as an HTML, pdf, or MS Word document, or a HTML or pdf slideshow. The result? Automated reporting. R Markdown is integrated straight into RStudio.

and many moreā€¦

Getting help

To access Rā€™s built-in help facility to get information on any function simply use the help() function. For example, to open the help page for our friend the mean() function.

help("mean")

or you can use the equivalent shortcut.

?mean

After you run the code, the help page is displayed in the ā€˜Helpā€™ tab in the Files pane (bottom right of RStudio).

To find what package a function belongs to, use the ?? operator.

??survfit

or you can always google it!

3. Basics of R Programming

What weā€™ll learn
  • Data types in R
  • Variables and assignment
  • Basic arithmetic operations
  • Working with vectors and basic vector operations

Important basics:

  • R is case sensitive i.e. A is not the same as a and anova is not the same as Anova.

  • Anything that follows a # symbol is interpreted as a comment and ignored by R. Comments should be used liberally throughout your code for both your own information and also to help your collaborators

  • In general, R is fairly tolerant of extra spaces inserted into your code, in fact using spaces is actively encouraged. However, spaces should not be inserted into operators i.e. <- should not read < - (note the space).

Data types

There are 6 basic types of data in R; numeric, integer, logical, character, complex and raw. However, in this workshop we will not be covering complex and raw as it is usually not widely used.

  1. Logical
    • Logical data take on the value of either TRUE or FALSE. Thereā€™s also another special type of logical called NA to represent missing values.
x <- TRUE
x
[1] TRUE
y <- FALSE
y
[1] FALSE
z <- NA
z
[1] NA

Logical Operators

Logical operators are used to combine conditional statements:

Operator Operation Vectorized?
x|y or Yes Element-wise Logical OR operator. It returns TRUE if one of the statement is TRUE
x & y and Yes Element-wise Logical AND operator. It returns TRUE if both elements are TRUE
!x not Yes Logical NOT - returns FALSE if statement is TRUE
x || y or No Logical OR operator. It returns TRUE if one of the statement is TRUE.
x && y and No Logical AND operator - Returns TRUE if both statements are TRUE

Comparison Operators

Comparison operators are used to compare two values:

Operator Comparison Vectorized?
x<y less than Yes
x>y greater than Yes
x <= y less than or equals to Yes
x >= y greater than or equals to Yes
x != y not equals to Yes
x == y equals to Yes
x %in% y contains Yes
  1. Numeric
    • Numeric data are real numbers that contain a decimal. The default numerical type are known as ā€œdoubleā€, which are floating point values.
x <- 2.6
x
[1] 2.6
class(x)
[1] "numeric"
typeof(x)
[1] "double"
  1. Integers
    • Integers are whole numbers (those numbers without a decimal point). It is represented by number and letter L: 1L, 2L, 3L.
x <- 1L
x
[1] 1
class(x)
[1] "integer"
  1. Character
    • Character data are used to represent string values. You can think of character strings as something like a word (or multiple words).
    x <- "Hello, World"
    class(x)
[1] "character"
    is.character(x)
[1] TRUE

A special type of character string is a Factor, which is a string but with additional attributes (like levels or an order). For example, Low, Medium and High which are denoted as factors where the computer record them as by 1, 2 and 3 respectively.

perf <- c("Low", "Medium", "High")
factor(perf)
[1] Low    Medium High  
Levels: High Low Medium

Hereā€™s a summary table of some of the logical test and coercion functions available to you.

Type Logical test Coercing
Logical is.logical as.logical
Double is.numeric as.numeric
Integer is.integer as.integer
Character is.character as.character
Factor is.factor as.factor
Complex is.complex as.complex

Variables and Assignment

Variables in R are used to store data values. You can create a variable using the assignment operator <- or =.

first_name <- "Hafeezul"
height <- 175.5

first_name
[1] "Hafeezul"
height
[1] 175.5
last_name = "Raziq"

last_name
[1] "Raziq"

If you use just one equal sign, R will assign a value to an object. However, TWO equal signs would give a different function.

x = 6   # This assigns the value 6 to x
x == 5  # This checks to see if x equals 5
[1] FALSE
Note

Best practice: Use <- for assignment to avoid confusion with the equality operator ==.

Basic Arithmetic Operations

R supports basic arithmetic operations, which are similar to those in other programming languages.

  1. Addition
5 + 2
[1] 7
  1. Subtraction
10 - 2
[1] 8
  1. Multiplication/Product
7 * 5
[1] 35
  1. Division/Quotient
20 / 4
[1] 5
  1. Exponential
10 ^ 2
[1] 100
  1. Modulus
  • Returns the remainder of the division.
10 %% 2
[1] 0

Basic Vector Operations

Vectors can be combined using the concatenate c() function.

numbers <- c(1,2,3)
numbers
[1] 1 2 3
rbaf <- c("Land Force", "Navy", "Air Force")
rbaf
[1] "Land Force" "Navy"       "Air Force" 
numbers <- c(1:10, 15:20)
numbers
 [1]  1  2  3  4  5  6  7  8  9 10 15 16 17 18 19 20

length(): Returns the number of elements in a vector.

length(rbaf)
[1] 3
length(numbers)
[1] 16

sum(): Returns the sum of all elements in a numeric vector.

sum(numbers)
[1] 160

mean(): Returns the average of the elements in a numeric vector.

mean(numbers)
[1] 10

Use square brackets [] to access elements by their index (starting from 1).

rbaf[2]
[1] "Navy"
rbaf[-2]
[1] "Land Force" "Air Force" 
rbaf[1:2]
[1] "Land Force" "Navy"      
numbers[11]
[1] 15

10-MINUTE BREAK

4. Data structures in R

What weā€™ll learn
  • Introduction to matrices, arrays, data frames and list.
  • Creating and manipulating data frames.
  • Accessing elements in data structures.

R offers various data structures for storing and manipulating data. The most commonly used ones are vector, matrices, arrays, data frames and list.

  1. Matrix
    • Matrix is a two-dimensional array. Alternatively, it is stacking multiple vectors of the same length.

To define a matrix from a vector, the syntax is matrix(vector, nrow, ncol, byrow). byrow is the way we fill the array. It is either TRUE or FALSE.

Size of matrix is rather complicated since it has two dimensions. There are three basics operations:

  • length(): total number of elements

  • ncol(): total number of columns

  • nrow(): total number of rows

z <- matrix(1:6, ncol = 2, byrow = TRUE)
z
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
length(z)
[1] 6
ncol(z) # check number of columns
[1] 2
nrow(z) # check number of rows
[1] 3

The following code fills the matrix by column.

x <- matrix(1:20, nrow=5, ncol=4, byrow=FALSE)
x
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20

Extracting elements from matrix is similar to extraction in vector.

x[2,] #the second row
[1]  2  7 12 17
x[,1] #the first column
[1] 1 2 3 4 5
x[1,2] #first row, second column
[1] 6
mat <- matrix(c(85, 90, 88, 75, 80, 78, 95, 85, 89), nrow = 3, ncol = 3, byrow = TRUE)
colnames(mat) <- c("Physical", "Shooting", "Strategy")
rownames(mat) <- c("Hasbul", "Khalid", "Fitri")

mat
       Physical Shooting Strategy
Hasbul       85       90       88
Khalid       75       80       78
Fitri        95       85       89
  1. Array

Array behaves like matrix but it is multi-dimensional (more than 2). To define array from vector, the syntax is array(vector/input, c(nrow, ncol, nmatrix))

x<- array(1:12, c(2,3,2))
x
, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12
  1. Data frame
    • Data frame is most useful form of data type in R. It behaves like matrix but can contain vectors of different types. That is we can have vectors of characters and numeric together, which is not feasible under matrix or array.

      To visualize a data frame, one may consider a spreadsheet: Each column is a vector and each spreadsheet is a data frame ā€“ it is a collection of columns of cells.

# RBAF Personnel Data
rbaf_df <- data.frame(
  name = c("Hasbul", "Khalid", "Fitri"),
  rank = c("Lieutenant", "Sergeant", "Captain"),
  age = c(25, 30, 35)
)
rbaf_df
    name       rank age
1 Hasbul Lieutenant  25
2 Khalid   Sergeant  30
3  Fitri    Captain  35

```

Letā€™s add a column for years of service:

rbaf_df$service <- c(2, 10, 5) # adding a new column for years of service
rbaf_df
    name       rank age service
1 Hasbul Lieutenant  25       2
2 Khalid   Sergeant  30      10
3  Fitri    Captain  35       5

The $ operator is used to extract or subset a specific part of a data object in R.

rbaf_df$name
[1] "Hasbul" "Khalid" "Fitri" 

Removing the ā€˜Rankā€™ column:

rbaf_df$rank <- NULL
rbaf_df
    name age service
1 Hasbul  25       2
2 Khalid  30      10
3  Fitri  35       5

Use rbind() to add rows, such as:

new_row <- data.frame(name = "Hafeezul", age = 28, service = 1)
rbaf_df <- rbind(rbaf_df, new_row)

rbaf_df
      name age service
1   Hasbul  25       2
2   Khalid  30      10
3    Fitri  35       5
4 Hafeezul  28       1

Remove rows by sub-setting:

rbaf_df <- rbaf_df[-2, ]  # Removes the second row

5. Operators, Functions and Control Structures

What weā€™ll learn
  • Introduction to functions
  • Introduction to control structures (if-else statements, loops).
  • Example applications of control structures and loops.

Functions

Functions are defined by two components: the arguments (formals) and the code (body).

You can create your own functions using the function keyword.

square <- function(x) {
  return(x^2)
}

square(4)
[1] 16

Control structures

Control structures are used to manage the flow of execution in R scripts.

  1. if-else Statements
  • Conditional execution based on a logical test

Here is a common example for if-else statement.

x <- 3

if(x < 0){
  "x is negative"
} else if (x > 0) {
  "x is positive"
} else {
  "x is zero"
}
[1] "x is positive"

Another example, determining if a soldier is eligible for a promotion based on years of service.

service <- 6

if (service > 5) {
  "Eligible"
} else {
  "Not Eligible"
}
[1] "Eligible"
  1. for Loops

A for loop is the simplest, and most common type of loop in Rā€“given a vector iterate through the elements and evaluate the code block for each.

How to Understand For Loops:

  1. Initialization: The loop starts by initializing a variable to the first element in the sequence.

  2. Condition: The loop continues to run as long as there are elements left in the sequence.

  3. Increment: After each iteration, the loop moves to the next element in the sequence.

Example 1: Letā€™s calculate the sum of the first 10 natural numbers using a for loop. This example demonstrates how to use a loop to perform repetitive calculations.

# Initialize the sum variable
sum <- 0

# Loop through numbers from 1 to 10
for (i in 1:10) {
  sum <- sum + i  # Add the current number to the sum
}

# Print the result
print(paste("The sum of the first 10 natural numbers is:", sum))
[1] "The sum of the first 10 natural numbers is: 55"

Example 2: Generate the multiplication table for a given number using a for loop.

# Define the number for the multiplication table
number <- 5

# Loop through numbers 1 to 10 to generate the multiplication table
for (i in 1:10) {
    result <- number * i
    print(paste(number, "x", i, "=", result))
}
[1] "5 x 1 = 5"
[1] "5 x 2 = 10"
[1] "5 x 3 = 15"
[1] "5 x 4 = 20"
[1] "5 x 5 = 25"
[1] "5 x 6 = 30"
[1] "5 x 7 = 35"
[1] "5 x 8 = 40"
[1] "5 x 9 = 45"
[1] "5 x 10 = 50"
  1. while loops

A while loop repeatedly executes a block of code as long as a specified condition is TRUE (i.e. evaluates to FALSE). This type of loop is useful when you donā€™t know in advance how many times youā€™ll need to repeat the loop.

How to Understand while Loops:

  1. Condition: Before each iteration, the loop checks if the condition is true.

  2. Execution: If the condition is true, the loop executes the block of code.

  3. Update: After each iteration, some part of the code updates the condition.

Example 1:

# Initialize the countdown variable
countdown <- 10

# While the countdown is greater than zero
while (countdown > 0) {
    print(paste("Countdown:", countdown))  # Print the current countdown value
    countdown <- countdown - 1  # Decrease the countdown by 1
}
[1] "Countdown: 10"
[1] "Countdown: 9"
[1] "Countdown: 8"
[1] "Countdown: 7"
[1] "Countdown: 6"
[1] "Countdown: 5"
[1] "Countdown: 4"
[1] "Countdown: 3"
[1] "Countdown: 2"
[1] "Countdown: 1"
# Print a message when the countdown is complete
print("Countdown complete!")
[1] "Countdown complete!"

Exercise:

Write a set of conditional(s) that satisfies the following requirements,

  • If x is greater than 3 and y is less than or equal to 3 then print ā€œHello world!ā€
  • Otherwise if x is greater than 3 print ā€œ!dlrow olleHā€
  • If is x is less than or equal to 3 then print ā€œSomething elseā€¦ā€
  • Stop execution if x is odd and y is even and report an error, donā€™t print any of the text strings above.

Test out your code by trying various values of x and y.

10-MINUTE BREAK

6. Data Import and Export

What weā€™ll learn
  • Importing data from various file formats (CSV, Excel, etc.) into R.
  • Exporting data from R to different file formats

R provides various functions for importing data from different file formats, making it easy to work with external data sources.

Importing CSV files

The read.csv() function is used to read CSV files.

titanic <- read.csv("path/to/titanic.csv")

Importing Excel files

The readxl package provides functions to read Excel files. Install the package (if not already installed) and load it.

install.packages("readxl")
library(readxl)

titanic <- read_excel("path/to/titanic.xlsx")
head(titanic)  # Display the first few rows of the data

How to find the path to the data?

To find the path to load the data, make sure the file the data is in, can be seen on the Files panel in RStudio

Exporting to CSV files

The write.csv() function is used to write data to CSV files.

write.csv(titanic, "path/to/exported_titanic.csv", row.names = FALSE)

Exporting to Excel files

The writexl package provides functions to write data to Excel files. Install the package (if not already installed) and load it.

install.packages("writexl")
library(writexl)
write_xlsx(titanic, "path/to/exported_titanic.xlsx")

7. Data Visualization

What weā€™ll learn
  • Creating basic plots (scatter plots, bar plots, histograms, etc.).

R provides powerful tools for data visualization, allowing you to create various types of plots to explore and present your data.

Lets first take a look at the basic plot function.

help(plot)
library(datasets)  # Load built-in datasets

head(iris)         # Show the first six lines of iris data
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
summary(iris)      # Summary statistics for iris data
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                
plot(iris)         # Scatterplot matrix for iris data

Now, lets use the Titanic data set, we can create several common types of plots.

  1. Scatter Plots

Scatter plots are useful for visualizing the relationship between two variables. Lets plot the relationship between age and fare in the Titanic data set.

# Scatter plot of age vs. fare
plot(titanic$Age, titanic$Fare,
     main = "Scatter Plot of Age vs. Fare",
     xlab = "Age",
     ylab = "Fare",
     col = "green4",
     pch = 1) # plotting symbols

  1. Bar Plots
    • Bar plots are useful for comparing. We will visualize the count of passengers in each class.
# Bar plot of passenger class
barplot(table(titanic$Pclass),
        main = "Passenger Class Distribution",
        xlab = "Class",
        ylab = "Count",
        col = "blue4")

  1. Histograms
    • Histograms are useful for visualizing the distribution of a single numeric variable. Lets plot the the distribution of ages in the Titanic data set.
# Histogram of passenger age
hist(titanic$Age, 
     main="Distribution of Ages on the Titanic", 
     xlab="Age",
     ylab = "Frequency",
     col="red3",
     breaks=10)

  1. Box Plots
    • Box plots are useful for visualizing the distribution and identifying outliers. Lets visualize the distribution of agesc by passenger class.
# Box plot of age by passenger class
boxplot(Age ~ Pclass,
        data = titanic,
        main = "Box Plot of Age by Passenger Class",
        xlab = "Class",
        ylab = "Age",
        col = c("orange", "purple", "cyan"),
        na.rm = TRUE)  # Remove NA values

Plotting multiple graphs in 1 plot

We can put multiple graphs in a single plot by setting some graphical parameters with the help of par() function.

# create a new plotting window and set the plotting area into a 1*2 array
par(mfrow = c(1, 2)) # c(row, column)

What else can R do?

Lets take a peek at some of the advanced topics.

  1. Linear Regression

  2. Spatial Analysis

  1. Correlation between data

  1. Quantitative Text Analysis

References

R for the Rest of Us: https://book.rfortherestofus.com/

Techincal Analysis with R (second edition): https://bookdown.org/kochiuyu/technical-analysis-with-r-second-edition2/

Probability, Statistics, and Data: A fresh approach using R: https://mathstat.slu.edu/~speegle/_book/

An Introduction to R: https://intro2r.com/

and many moreā€¦