install.packages("tidyverse")
Introduction to R & RStudio
Brunei R User Group Meetup š§š³
https://bruneir.github.io/brm-intro-to-r
Preliminaries
Welcome to the 3rd Brunei R User Group meetup!
The RUGS mission is to facilitate the person-to-person exchange of knowledge in small group settings on a global scale. āR Consortium
"R" |>
rug("b", _, "unei")
About us
- A group of UBD-ians and R enthusiasts
- We want to create a community of R users in Brunei
- Champion the Open Source cause
Past events
- Analyzing Spatial Data using R [Workshop]
- R>aya Meetup Sharing Sessions
Expectations
This is a hands-on, live-coding, lecture-style āworkshopā. Expect to learn (or at the very least, see me do!)ā¦
- What is R & RStudio? How can it ease the burden of repeated reporting?
- Basic functions for manipulating data
- Using R effectively
- More data manipulation
- Visualizing data
- A peek at advanced topics
1. Introduction to R & RStudio
When people consider switching to R, they usually think about it as a direct replacement for whatever tool theyāre currently using. While R can indeed replace software like Excel, SPSS, or Stata, it offers much more!
Suppose a current workflow that looks like this,
- Data analysis in SPSS (or a similar tool)
- Data visualization in Excel
- Report writing in Word
Have you ever encountered an error in the first step and had to go back through all three steps to fix it? Itās quite frustrating, isnāt it?
R can get around this by combining data analysis, visualization, and reporting in one tool using RMarkdown. Any time you realize youāve made a mistake, you just rerun your code and you get a new report. Think of the time it can save you!
Hence, R is a popular programming language, especially in certain fields such as data science, academic research, and statistics.
RStudio is an integrated development editor (IDE) for R. It is easier to write code using the editor.
Here are several reasons why you should use R:
- R is widely used among statisticians, especially academic statisticians.
- If there is a new statistical procedure developed somewhere in academia, chances are that the code for it will be made available in R. This distinguishes R from, say, Python.
- R is commonly used for statistical analyses in many disciplines.
- Other software, such as SPSS or SAS is also used and in some disciplines would be the primary choice for some discipline specific courses, but R is popular and its user base is growing.
- R is free.
- You can install it and all optional packages on your computer at no cost. This is a big difference between R and SAS, SPSS, MATLAB, and most other statistical software.
- R is has a vibrant and growing community.
- With the advent of the tidyverse and RStudio, R is a vibrant and growing community. We also have found the community to be extremely welcoming. The R ecosystem is one of its strengths.
In this workshop, we will dive into the fundamentals of R and RStudio.
2. Getting Started with R & RStudio
- Installing the R Language.
- Installing the RStudio.
- Exploring the RStudio Interface
- Packages & help() function
Follow the guidelines in Brunei R website - Under blog, āHow to Install R and Rstudioā
Once installed, launch RStudio and this is probably what youāll see.
Notice the default panes:
- Console (entire left)
- Environment/History (tabbed in upper right)
- Files/Plots/Packages/Help (tabbed in lower right)
Packages
Everything which is done in R is done by functions. Commonly used functions are grouped in packages. Installing different packages expand the functionality of R.
Packages are bundles of code that add new functions to R.
Base packages are installed with R but not loaded by default.
Contributed packages need to be downloaded, installed & loaded separately.
To install a package, say tidyverse, for the first time, type
To load the package, type the package name without quotation
library(tidyverse)
Recommended Packages
For data manipulation:
tidyverse
- An opinionated collection of R packages designed for data science that share an underlying design philosophy, grammar, and data structures. This collection includes all the packages in this section, plus many more for data import, tidying, and visualization listed here.
For data visualization:
ggplot2
- Rās famous package for making beautiful graphics.ggplot2
lets you use the grammar of graphics to build layered, customizable plots.
For reproducible reporting:
R Markdown
- The perfect workflow for reproducible reporting. Write R code in your markdown reports. When you run render, R Markdown will replace the code with its results and then export your report as an HTML, pdf, or MS Word document, or a HTML or pdf slideshow. The result? Automated reporting. R Markdown is integrated straight into RStudio.
and many moreā¦
Getting help
To access Rās built-in help facility to get information on any function simply use the help() function. For example, to open the help page for our friend the mean() function.
help("mean")
or you can use the equivalent shortcut.
?mean
After you run the code, the help page is displayed in the āHelpā tab in the Files pane (bottom right of RStudio).
To find what package a function belongs to, use the ?? operator.
??survfit
or you can always google it!
3. Basics of R Programming
- Data types in R
- Variables and assignment
- Basic arithmetic operations
- Working with vectors and basic vector operations
Important basics:
R is case sensitive i.e.
A
is not the same asa
andanova
is not the same asAnova
.Anything that follows a # symbol is interpreted as a comment and ignored by R. Comments should be used liberally throughout your code for both your own information and also to help your collaborators
In general, R is fairly tolerant of extra spaces inserted into your code, in fact using spaces is actively encouraged. However, spaces should not be inserted into operators i.e. <- should not read < - (note the space).
Data types
There are 6 basic types of data in R; numeric, integer, logical, character, complex and raw. However, in this workshop we will not be covering complex and raw as it is usually not widely used.
- Logical
- Logical data take on the value of either TRUE or FALSE. Thereās also another special type of logical called NA to represent missing values.
<- TRUE
x x
[1] TRUE
<- FALSE
y y
[1] FALSE
<- NA
z z
[1] NA
Logical Operators
Logical operators are used to combine conditional statements:
Operator | Operation | Vectorized? | |
---|---|---|---|
x|y |
or | Yes | Element-wise Logical OR operator. It returns TRUE if one of the statement is TRUE |
x & y |
and | Yes | Element-wise Logical AND operator. It returns TRUE if both elements are TRUE |
!x |
not | Yes | Logical NOT - returns FALSE if statement is TRUE |
x || y |
or | No | Logical OR operator. It returns TRUE if one of the statement is TRUE. |
x && y |
and | No | Logical AND operator - Returns TRUE if both statements are TRUE |
Comparison Operators
Comparison operators are used to compare two values:
Operator | Comparison | Vectorized? |
---|---|---|
x<y |
less than | Yes |
x>y |
greater than | Yes |
x <= y |
less than or equals to | Yes |
x >= y |
greater than or equals to | Yes |
x != y |
not equals to | Yes |
x == y |
equals to | Yes |
x %in% y |
contains | Yes |
- Numeric
- Numeric data are real numbers that contain a decimal. The default numerical type are known as ādoubleā, which are floating point values.
<- 2.6
x x
[1] 2.6
class(x)
[1] "numeric"
typeof(x)
[1] "double"
- Integers
- Integers are whole numbers (those numbers without a decimal point). It is represented by number and letter L: 1L, 2L, 3L.
<- 1L
x x
[1] 1
class(x)
[1] "integer"
- Character
- Character data are used to represent string values. You can think of character strings as something like a word (or multiple words).
<- "Hello, World"
x class(x)
[1] "character"
is.character(x)
[1] TRUE
A special type of character string is a Factor, which is a string but with additional attributes (like levels or an order). For example, Low, Medium and High which are denoted as factors where the computer record them as by 1, 2 and 3 respectively.
<- c("Low", "Medium", "High")
perf factor(perf)
[1] Low Medium High
Levels: High Low Medium
Hereās a summary table of some of the logical test and coercion functions available to you.
Type | Logical test | Coercing |
---|---|---|
Logical | is.logical |
as.logical |
Double | is.numeric |
as.numeric |
Integer | is.integer |
as.integer |
Character | is.character |
as.character |
Factor | is.factor |
as.factor |
Complex | is.complex |
as.complex |
Variables and Assignment
Variables in R are used to store data values. You can create a variable using the assignment operator <- or =.
<- "Hafeezul"
first_name <- 175.5
height
first_name
[1] "Hafeezul"
height
[1] 175.5
= "Raziq"
last_name
last_name
[1] "Raziq"
If you use just one equal sign, R will assign a value to an object. However, TWO equal signs would give a different function.
= 6 # This assigns the value 6 to x
x == 5 # This checks to see if x equals 5 x
[1] FALSE
Best practice: Use <- for assignment to avoid confusion with the equality operator ==.
Basic Arithmetic Operations
R supports basic arithmetic operations, which are similar to those in other programming languages.
- Addition
5 + 2
[1] 7
- Subtraction
10 - 2
[1] 8
- Multiplication/Product
7 * 5
[1] 35
- Division/Quotient
20 / 4
[1] 5
- Exponential
10 ^ 2
[1] 100
- Modulus
- Returns the remainder of the division.
10 %% 2
[1] 0
Basic Vector Operations
Vectors can be combined using the concatenate c() function.
<- c(1,2,3)
numbers numbers
[1] 1 2 3
<- c("Land Force", "Navy", "Air Force")
rbaf rbaf
[1] "Land Force" "Navy" "Air Force"
<- c(1:10, 15:20)
numbers numbers
[1] 1 2 3 4 5 6 7 8 9 10 15 16 17 18 19 20
length()
: Returns the number of elements in a vector.
length(rbaf)
[1] 3
length(numbers)
[1] 16
sum()
: Returns the sum of all elements in a numeric vector.
sum(numbers)
[1] 160
mean()
: Returns the average of the elements in a numeric vector.
mean(numbers)
[1] 10
Use square brackets []
to access elements by their index (starting from 1).
2] rbaf[
[1] "Navy"
-2] rbaf[
[1] "Land Force" "Air Force"
1:2] rbaf[
[1] "Land Force" "Navy"
11] numbers[
[1] 15
10-MINUTE BREAK
4. Data structures in R
- Introduction to matrices, arrays, data frames and list.
- Creating and manipulating data frames.
- Accessing elements in data structures.
R offers various data structures for storing and manipulating data. The most commonly used ones are vector, matrices, arrays, data frames and list.
- Matrix
- Matrix is a two-dimensional array. Alternatively, it is stacking multiple vectors of the same length.
To define a matrix from a vector, the syntax is matrix(vector, nrow, ncol, byrow)
. byrow
is the way we fill the array. It is either TRUE
or FALSE
.
Size of matrix is rather complicated since it has two dimensions. There are three basics operations:
length()
: total number of elementsncol()
: total number of columnsnrow()
: total number of rows
<- matrix(1:6, ncol = 2, byrow = TRUE)
z z
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
length(z)
[1] 6
ncol(z) # check number of columns
[1] 2
nrow(z) # check number of rows
[1] 3
The following code fills the matrix by column.
<- matrix(1:20, nrow=5, ncol=4, byrow=FALSE)
x x
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
Extracting elements from matrix is similar to extraction in vector.
2,] #the second row x[
[1] 2 7 12 17
1] #the first column x[,
[1] 1 2 3 4 5
1,2] #first row, second column x[
[1] 6
<- matrix(c(85, 90, 88, 75, 80, 78, 95, 85, 89), nrow = 3, ncol = 3, byrow = TRUE)
mat colnames(mat) <- c("Physical", "Shooting", "Strategy")
rownames(mat) <- c("Hasbul", "Khalid", "Fitri")
mat
Physical Shooting Strategy
Hasbul 85 90 88
Khalid 75 80 78
Fitri 95 85 89
- Array
Array behaves like matrix but it is multi-dimensional (more than 2). To define array from vector, the syntax is array(vector/input, c(nrow, ncol, nmatrix))
<- array(1:12, c(2,3,2))
x x
, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
, , 2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
- Data frame
Data frame is most useful form of data type in R. It behaves like matrix but can contain vectors of different types. That is we can have vectors of characters and numeric together, which is not feasible under matrix or array.
To visualize a data frame, one may consider a spreadsheet: Each column is a vector and each spreadsheet is a data frame ā it is a collection of columns of cells.
# RBAF Personnel Data
<- data.frame(
rbaf_df name = c("Hasbul", "Khalid", "Fitri"),
rank = c("Lieutenant", "Sergeant", "Captain"),
age = c(25, 30, 35)
) rbaf_df
name rank age
1 Hasbul Lieutenant 25
2 Khalid Sergeant 30
3 Fitri Captain 35
```
Letās add a column for years of service:
$service <- c(2, 10, 5) # adding a new column for years of service
rbaf_df rbaf_df
name rank age service
1 Hasbul Lieutenant 25 2
2 Khalid Sergeant 30 10
3 Fitri Captain 35 5
The $
operator is used to extract or subset a specific part of a data object in R.
$name rbaf_df
[1] "Hasbul" "Khalid" "Fitri"
Removing the āRankā column:
$rank <- NULL
rbaf_df rbaf_df
name age service
1 Hasbul 25 2
2 Khalid 30 10
3 Fitri 35 5
Use rbind()
to add rows, such as:
<- data.frame(name = "Hafeezul", age = 28, service = 1)
new_row <- rbind(rbaf_df, new_row)
rbaf_df
rbaf_df
name age service
1 Hasbul 25 2
2 Khalid 30 10
3 Fitri 35 5
4 Hafeezul 28 1
Remove rows by sub-setting:
<- rbaf_df[-2, ] # Removes the second row rbaf_df
5. Operators, Functions and Control Structures
- Introduction to functions
- Introduction to control structures (if-else statements, loops).
- Example applications of control structures and loops.
Functions
Functions are defined by two components: the arguments (formals) and the code (body).
You can create your own functions using the function keyword.
<- function(x) {
square return(x^2)
}
square(4)
[1] 16
Control structures
Control structures are used to manage the flow of execution in R scripts.
if-else
Statements
- Conditional execution based on a logical test
Here is a common example for if-else
statement.
<- 3
x
if(x < 0){
"x is negative"
else if (x > 0) {
} "x is positive"
else {
} "x is zero"
}
[1] "x is positive"
Another example, determining if a soldier is eligible for a promotion based on years of service.
<- 6
service
if (service > 5) {
"Eligible"
else {
} "Not Eligible"
}
[1] "Eligible"
for
Loops
A for loop is the simplest, and most common type of loop in Rāgiven a vector iterate through the elements and evaluate the code block for each.
How to Understand For Loops:
Initialization: The loop starts by initializing a variable to the first element in the sequence.
Condition: The loop continues to run as long as there are elements left in the sequence.
Increment: After each iteration, the loop moves to the next element in the sequence.
Example 1: Letās calculate the sum of the first 10 natural numbers using a for loop. This example demonstrates how to use a loop to perform repetitive calculations.
# Initialize the sum variable
<- 0
sum
# Loop through numbers from 1 to 10
for (i in 1:10) {
<- sum + i # Add the current number to the sum
sum
}
# Print the result
print(paste("The sum of the first 10 natural numbers is:", sum))
[1] "The sum of the first 10 natural numbers is: 55"
Example 2: Generate the multiplication table for a given number using a for loop.
# Define the number for the multiplication table
<- 5
number
# Loop through numbers 1 to 10 to generate the multiplication table
for (i in 1:10) {
<- number * i
result print(paste(number, "x", i, "=", result))
}
[1] "5 x 1 = 5"
[1] "5 x 2 = 10"
[1] "5 x 3 = 15"
[1] "5 x 4 = 20"
[1] "5 x 5 = 25"
[1] "5 x 6 = 30"
[1] "5 x 7 = 35"
[1] "5 x 8 = 40"
[1] "5 x 9 = 45"
[1] "5 x 10 = 50"
while
loops
A while loop repeatedly executes a block of code as long as a specified condition is TRUE (i.e. evaluates to FALSE
). This type of loop is useful when you donāt know in advance how many times youāll need to repeat the loop.
How to Understand while
Loops:
Condition: Before each iteration, the loop checks if the condition is true.
Execution: If the condition is true, the loop executes the block of code.
Update: After each iteration, some part of the code updates the condition.
Example 1:
# Initialize the countdown variable
<- 10
countdown
# While the countdown is greater than zero
while (countdown > 0) {
print(paste("Countdown:", countdown)) # Print the current countdown value
<- countdown - 1 # Decrease the countdown by 1
countdown }
[1] "Countdown: 10"
[1] "Countdown: 9"
[1] "Countdown: 8"
[1] "Countdown: 7"
[1] "Countdown: 6"
[1] "Countdown: 5"
[1] "Countdown: 4"
[1] "Countdown: 3"
[1] "Countdown: 2"
[1] "Countdown: 1"
# Print a message when the countdown is complete
print("Countdown complete!")
[1] "Countdown complete!"
Exercise:
Write a set of conditional(s) that satisfies the following requirements,
- If
x
is greater than3
andy
is less than or equal to3
then print āHello world!ā - Otherwise if
x
is greater than3
print ā!dlrow olleHā - If is
x
is less than or equal to3
then print āSomething elseā¦ā - Stop execution if
x
is odd andy
is even and report an error, donāt print any of the text strings above.
Test out your code by trying various values of x
and y
.
10-MINUTE BREAK
6. Data Import and Export
- Importing data from various file formats (CSV, Excel, etc.) into R.
- Exporting data from R to different file formats
R provides various functions for importing data from different file formats, making it easy to work with external data sources.
Importing CSV files
The read.csv() function is used to read CSV files.
<- read.csv("path/to/titanic.csv") titanic
Importing Excel files
The readxl package provides functions to read Excel files. Install the package (if not already installed) and load it.
install.packages("readxl")
library(readxl)
<- read_excel("path/to/titanic.xlsx") titanic
head(titanic) # Display the first few rows of the data
How to find the path to the data?
To find the path to load the data, make sure the file the data is in, can be seen on the Files panel in RStudio
Exporting to CSV files
The write.csv() function is used to write data to CSV files.
write.csv(titanic, "path/to/exported_titanic.csv", row.names = FALSE)
Exporting to Excel files
The writexl package provides functions to write data to Excel files. Install the package (if not already installed) and load it.
install.packages("writexl")
library(writexl)
write_xlsx(titanic, "path/to/exported_titanic.xlsx")
7. Data Visualization
- Creating basic plots (scatter plots, bar plots, histograms, etc.).
R provides powerful tools for data visualization, allowing you to create various types of plots to explore and present your data.
Lets first take a look at the basic plot function.
help(plot)
library(datasets) # Load built-in datasets
head(iris) # Show the first six lines of iris data
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
summary(iris) # Summary statistics for iris data
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
plot(iris) # Scatterplot matrix for iris data
Now, lets use the Titanic data set, we can create several common types of plots.
- Scatter Plots
Scatter plots are useful for visualizing the relationship between two variables. Lets plot the relationship between age and fare in the Titanic data set.
# Scatter plot of age vs. fare
plot(titanic$Age, titanic$Fare,
main = "Scatter Plot of Age vs. Fare",
xlab = "Age",
ylab = "Fare",
col = "green4",
pch = 1) # plotting symbols
- Bar Plots
- Bar plots are useful for comparing. We will visualize the count of passengers in each class.
# Bar plot of passenger class
barplot(table(titanic$Pclass),
main = "Passenger Class Distribution",
xlab = "Class",
ylab = "Count",
col = "blue4")
- Histograms
- Histograms are useful for visualizing the distribution of a single numeric variable. Lets plot the the distribution of ages in the Titanic data set.
# Histogram of passenger age
hist(titanic$Age,
main="Distribution of Ages on the Titanic",
xlab="Age",
ylab = "Frequency",
col="red3",
breaks=10)
- Box Plots
- Box plots are useful for visualizing the distribution and identifying outliers. Lets visualize the distribution of agesc by passenger class.
# Box plot of age by passenger class
boxplot(Age ~ Pclass,
data = titanic,
main = "Box Plot of Age by Passenger Class",
xlab = "Class",
ylab = "Age",
col = c("orange", "purple", "cyan"),
na.rm = TRUE) # Remove NA values
Plotting multiple graphs in 1 plot
We can put multiple graphs in a single plot by setting some graphical parameters with the help of par() function.
# create a new plotting window and set the plotting area into a 1*2 array
par(mfrow = c(1, 2)) # c(row, column)
What else can R do?
Lets take a peek at some of the advanced topics.
Linear Regression
Spatial Analysis
- Correlation between data
- Quantitative Text Analysis
References
R for the Rest of Us: https://book.rfortherestofus.com/
Techincal Analysis with R (second edition): https://bookdown.org/kochiuyu/technical-analysis-with-r-second-edition2/
Probability, Statistics, and Data: A fresh approach using R: https://mathstat.slu.edu/~speegle/_book/
An Introduction to R: https://intro2r.com/
and many moreā¦