Skip to the content.

Basic Programming

Last Updated: 14, October, 2024 at 14:50

Basics of programming: variables and functions

Programming is basically (1) storing data, (2) performing operations on this data.

We will store data in so-called variables. We use functions to perform operations on the data. We will also learn about flow control which allows us to execute code depending on conditions or to repeat code. Finally, objects combine data and functions.

Basic operations: R as a calculator

R can perform the classic operations.

1 / 200 * 30
## [1] 0.15
(59 + 73 + 2) / 3
## [1] 44.66667
sin(pi / 2)
## [1] 1

Logical operations

5 > 6
## [1] FALSE
5 + 1 == 6 #NOTE: I am using == to check equality!
## [1] TRUE
1234 != 1234
## [1] FALSE

Storing data in variables

Assigning data to a variable

R use <- to make assignments. This is a pain to type. You could use = but it will cause confusion later on.

my_variable <- 5

Variables (also called values) come in many types (or classes). The very basic ones are the following:

my_logical <- TRUE
my_character <- 'this is just a piece of text'
my_numeric <- 1.23455

These are very simple data types. We will often used much more complex ones when working with actual data.

Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
Age <- c(23, 41, 32, 58, 26)
my_data_frame <- data.frame(Name, Age)

R studio shows values and data separately in the Environment window. However, this is just a visualization used by R studio. You can use this window to inspect variables!

Note on naming variables

Try to use descriptive names for variables. And try to stick to a naming convention that works for you - preferably one that makes your code easy to read.

i_like_snake_case <- 'snake_case'
otherPeopleUseCamelCase <- 'CamelCase'
some.people.use.periods <- 'periods.are.allowed'
And_aFew.People_DONTLIKEconventions <- 'Madness, Madness, I tell you!'

From R for Data Science:

There’s an implied contract between you and R: it will do the tedious computation for you, but in return, you must be completely precise in your instructions. Typos matter. Case matters.

Also, it is important that you use names that are not keywords or functions in R. For example, the following is a bad idea:

#length <- 15 ## THIS IS A BAD IDEA

Trick 1: Using the up and down keys

You can use the up and down keys to navigate through the history of commands you’ve entered. This is a very useful feature.

The vector

R is another basic variable in R. It’s the simplest type of variable that actually allows you to store something recognizable as ‘data’. We will spend some time on vectors as they are a good place to start to work with relatively simple data. Also, understanding how to work with vectors makes working with more complex data easier. Much of the operations you can do on vectors, which are 1D, can also be done on 2D data frames.

Creating a vector manually

a_vector <- c(1, 5, 4, 9, 0) # Technically an atomic vector
another_one <- c(1, 5.4, TRUE, "hello") # Technically, this is a list

Creating a vector using the : operator

x <- 1:7
y <- 2:-2

Using seq to make a vector

step_size <- seq(1, 10, by=0.25)
length_specified <- seq(1, 10, length.out = 20)

Indexing vectors

Every item in a vector has an index. Vector indices in R start from 1, unlike most programming languages where index start from 0.

my_longer_vector <- c(1, 2, 'three', '4', 'V', 6, 7, 8)

You can use the [] to select (multiple) elements from a vector.

my_single_element <- my_longer_vector[5]
the_start <- my_longer_vector[1:3]
my_part_of_vector <- my_longer_vector[c(1, 2, 5)] # I'm using a vector to select parts of a vector. Life is funny.

You can also use [] to overwrite a part of a vector

my_longer_vector[1:3] <- c('replace', 'this', 'now')

Logical vectors

vector1 <- c(1,5,6,7,2,3,5,4,6,8,1,9,0,1)
binary_vector <- vector1 > 5
binary_vector
##  [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE
## [13] FALSE FALSE
some_other_vector <- seq(from = 0, to = 100, length.out = length(vector1))
selected <- some_other_vector[binary_vector]
selected
## [1] 15.38462 23.07692 61.53846 69.23077 84.61538

Another trick

Before we go on, I want to share a simple trick. Using an IDE like Rstudio makes life easier (or at least it should). One of the benefits of the IDE is tab-completion.

[DEMO GOES HERE]

Functions

Now we know how to store data, we can start manipulating the data using functions.

Functions take 0 or more inputs (also called arguments), perform some operation (i.e., the function of the function), and return some output. This output can be complex and consist of multiple parts. This are generic ways in which functions are used:

output <- function_name(arg1 = val1, arg2 = val2, ...)
output <- function_name(val1, val2, ...)

We’ve already encountered a function:

output<-seq(from = 1, to= 123, by = 0.123)

How do we know which arguments a function can take? Using the help:

?seq

Some very simple functions that might be useful.

a <- max(output)
b <- mean(output)
c <- min(output)
d <- ceiling(output)
e <- sd(output)

Here is a function which returns more complex data. At this point, you’re not supposed to know what this function does (it fits a regression line). The point is that it returns complex data with multiple fields.

x <- runif(100)
y <- 10 + 5 * x + rnorm(100)
result <- lm(y ~ x)
print(result)
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      10.126        4.731

Flow control in R

You could write all R scripts as a serial statements of functions. However, to fully exploit the power of programming, you would need to learn about flow control. Flow control refers to (1) executing bits of code depending on a condition, and (2) iteratively executing pieces of code.

This is another introduction to flow control.

This is a short script which does one thing after another.

data <- read.csv('data/wages1833.csv')
data$average <- ((data$mnum * data$mwage) + (data$fnum * data$fwage)) / (data$mnum + data$fnum)
model <- lm(data$average ~ data$age)
result <- summary(model)
plot(data$age, data$average )

Overview

Uses of the different flow commands
Keyword Use Example 1 Example 2
if (or, else if) Execute some steps if a condition is true (or false) If the value of a variable is larger than 5, print it to the screen. If the result of a statistical test is significant, add a symbol to the graph.
For Repeat some steps for each item in collection, such as a vector. For each value in a vector, print the value to the screen. Repeat something exactly n times.
While Repeat some steps as long as something is true (or false). As long as the value of a variable is smaller than 5, generate a new value for it.

While your data has outliers, remove them.


Uses of the different flow commands

The if statement

This is the basic anatomy of an if statement

if (expression) {
   #statement to execute if condition is true
}

Example:

my_number <-12

if (my_number < 20){
  x <- sprintf('%i is less than 20', my_number)
  print(x)
}
## [1] "12 is less than 20"

The if else statement

There is also an if-else variant of this,

a  <- -5
 
# condition
if(a > 0)
{
    print("Positive Number")
}else{
    print("negative number")
}
## [1] "negative number"

Rewriting the previous one on 1 line (maybe that makes it easier to read?)

a  <- -5
 
# condition
if(a > 0){print("Positive Number")}else{print("negative number")}
## [1] "negative number"

The else if statement

a <- 200
b <- 33

if (b > a) {
  print("b is greater than a")
} else if (a == b) {
  print("a and b are equal")
} else {
  print("a is greater than b")
} 
## [1] "a is greater than b"

Overview of if statements

The for loop

The for loop iterates over a sequence.

my_vector <- runif(5)
for (x in my_vector) {
  y <- x * 3
  print(y)
}
## [1] 2.625455
## [1] 1.246434
## [1] 2.315482
## [1] 1.473068
## [1] 0.3961015

Just to drive the point home, another example:

fruits <- list("apple", "banana", "cherry")
for (x in fruits) {
  print(x)
} 
## [1] "apple"
## [1] "banana"
## [1] "cherry"

One very common use of the for loop is to iterate a bit of code exactly n times.

number_of_time_i_want_to_repeat_this <-10
for (x in 1:10) {
  print('This is being repeated!')
} 
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "This is being repeated!"

You can use a break statement to break the loop at any point.

number_of_time_i_want_to_repeat_this <-10
for (x in 1:10) {
  print('This is being repeated!')
  if (x > 7){
    print('I quit!')
    break
  }
} 
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "This is being repeated!"
## [1] "I quit!"

The while loop

The while repeats a piece of code if something is true and as long as it is true.

i <- 1
while (i < 6) {
  print(i)
  i <- i + 1
} 
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

The break keyword

You can use break to exit a loop at any time

i <- 1
while (i < 100000) {
  print(i)
  i <- i + 1
  if (i > 5){break}
} 
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

Exercises

Note on vector preallocation

This piece of code builds a vector by appending numbers to the end of it.

repeats <- 10000
startTime <- Sys.time()
my_vector <- c()
for (i in 0:repeats){
  x <- runif(1)
  vector <- append(vector, x)
}
endTime <- Sys.time()
print(sprintf('Duration: %.2f', endTime - startTime))
## [1] "Duration: 0.45"

This piece of code preallocates a vector and is more efficient.

repeats <- 10000
startTime <- Sys.time()
my_vector <- numeric(repeats)
for (i in 0:repeats){
  x <- runif(1)
  vector[i] <- x
  }
endTime <- Sys.time()
print(sprintf('Duration: %.2f', endTime - startTime))
## [1] "Duration: 0.02"

Working with text: the paste() function

x <- runif(100)
y <- 10 + 5 * x + rnorm(100)
result <- lm(y ~ x)
summary(result)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.57325 -0.51700 -0.07491  0.53852  2.19317 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.8760     0.1567   63.03   <2e-16 ***
## x             5.0109     0.2767   18.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8576 on 98 degrees of freedom
## Multiple R-squared:  0.7699, Adjusted R-squared:  0.7676 
## F-statistic: 327.9 on 1 and 98 DF,  p-value: < 2.2e-16
test1 <- paste(10000)
test2<-paste(result$coefficients[1], result$coefficients[2], sep = ', ')
test3<-paste('The coefficients are: ', result$coefficients[1],  ', ', result$coefficients[2], sep='')
print(test1)
## [1] "10000"
print(test2)
## [1] "9.87603331162598, 5.0108914630349"
print(test3)
## [1] "The coefficients are: 9.87603331162598, 5.0108914630349"
for (x in 1:10) {print(test3)}
## [1] "The coefficients are: 9.87603331162598, 5.0108914630349"
## [1] "The coefficients are: 9.87603331162598, 5.0108914630349"
## [1] "The coefficients are: 9.87603331162598, 5.0108914630349"
## [1] "The coefficients are: 9.87603331162598, 5.0108914630349"
## [1] "The coefficients are: 9.87603331162598, 5.0108914630349"
## [1] "The coefficients are: 9.87603331162598, 5.0108914630349"
## [1] "The coefficients are: 9.87603331162598, 5.0108914630349"
## [1] "The coefficients are: 9.87603331162598, 5.0108914630349"
## [1] "The coefficients are: 9.87603331162598, 5.0108914630349"
## [1] "The coefficients are: 9.87603331162598, 5.0108914630349"

Solutions to exercises

One

Write a for loop that iterates over the numbers 1 to 7 and prints the cube of each number using print().

for(i in 1:7){
  print(i^2)
  }
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
## [1] 49

Two

Write a while loop that prints out standard random normal numbers (use rnorm()) but stops (breaks) if you get a number bigger than 1.

Option 1

value <- 0
counter <-0
while(value < 1)
{
  value <- rnorm(1)
  counter <- counter + 1
}
print(value)
## [1] 1.489717
print(counter)
## [1] 8

Option 2

counter <-0
while(TRUE)
{
  value <- rnorm(1)
  counter <- counter + 1
  if (value > 1){break}
}
print(value)
## [1] 1.587473
print(counter)
## [1] 1

Three

Using a for loop simulate the flip a coin twenty times, keeping track of the individual outcomes (1 = heads, 0 = tails) in a vector.

repeats <- 20
outcomes <- character(repeats)
for(i in 1:repeats)
{
  outcome <- sample(c('H','T'), 1)
  outcomes[i] <- outcome
}
outcomes
##  [1] "H" "H" "T" "T" "H" "T" "H" "H" "H" "H" "H" "H" "T" "H" "T" "H" "T" "T" "T"
## [20] "H"

You could do this in one line (but that was not the exercise).

repeats <- 20
outcomes <- sample(c('H','T'), repeats, replace = TRUE)
outcomes
##  [1] "H" "H" "T" "H" "T" "T" "H" "H" "H" "T" "H" "T" "H" "H" "T" "T" "H" "H" "H"
## [20] "H"

Four

Use a while loop to investigate the number of terms required before the series 1 ,reaches above 10 million.

product <- 1
term <- 0

while(product < 10000000)
{
  term <- term + 1
  product <- product * term
  
}
print(term)
## [1] 11
# Check
1:term
##  [1]  1  2  3  4  5  6  7  8  9 10 11
cumprod(1:term)>10000000
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE