Skip to the content.

Reading Data

Last Updated: 24, October, 2024 at 08:58

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.2 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Before we begin…

Download the following data files to your computer:

Using the tidyverse vs built-in data reading functions

Default function

In base R, the default function for reading tabular data from a CSV file is read.csv(). This function is widely used, but it has some quirks:

Example usage:

data <- read.csv("data/pakistan_intellectual_capital.csv")

The tidyverse way

The tidyverse provides its own functions to read data. For example, the nearly-equivalent function to read.csv() is read_csv() (from the readr package). Notice the small difference in the function name!

The tidyverse provides a more user-friendly function for reading CSV files: read_csv() from the readr package. It addresses many of the limitations of read.csv() and offers a more intuitive experience for modern data analysis:

Example usage:

data <- read_csv("data/pakistan_intellectual_capital.csv")
## New names:
## Rows: 1142 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (10): Teacher Name, University Currently Teaching, Department, Province ... dbl
## (3): ...1, S#, Year
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

Throughout this class, I will try to use the tidyverse functions for reading data.

Reading data from excel using readxl

Excel spreadsheets are often used and easy ways to store data.

Explore the data in transit-data.xlsx. (Please note the organization of my project folder.)

Note that:

Read the first sheet of data:

library(readxl)
data <-read_excel("data/transit-data.xlsx") #Be sure to use the correct relative path here.

Look at the help for the read_excel function:

?read_excel

Let’s specify the data range we want to read.

info <-read_excel("data/transit-data.xlsx", sheet = 'info', range = 'B1:C7')
info <-read_excel("data/transit-data.xlsx", sheet = 'info', range = cell_cols('B:C'))

Let’s now read the data in the second sheet:

more_data <- read_excel("data/transit-data.xlsx", sheet = 'transport data')
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...8`

We need to skip the first line of the data.

more_data <- read_excel("data/transit-data.xlsx", sheet = 'transport data', skip=1)

The column names are human readable but not very handy for coding. There is a quick way to solve this.

Tip: Only use underscores or dots in column names. This will make your life easier when coding.

colnames(more_data) <- make.names(colnames(more_data))

Reading data from a comma separated text file

R has functions for reading in text files. However, the functions provided by tidyverse are often more powerful. When reading in data, R reports on the column types. Information about the different types and their labels can be found here: Column Types

Note: The default report on the column types is a bit disorganized in this view. It will look better in your console.

data <- read_csv('data/pakistan_intellectual_capital.csv')
## New names:
## Rows: 1142 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (10): Teacher Name, University Currently Teaching, Department, Province ... dbl
## (3): ...1, S#, Year
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

Tibbles

Reading in data using tidyverse functions generates tibbles, instead of R’s traditional data.frame. These are a newer version of the classic data frame with features tweaked to make life a bit easier.

print(class(data))
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Convert to an old school data frame.

old_school<-as.data.frame(data)

Some interesting options when using read_csv()

No column names? No problem!

If data has no column names, use col_names = FALSE

read_csv("1,2,3\n4,5,6", col_names = FALSE)
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): X1, X2, X3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 3
##      X1    X2    X3
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

You can also directly set the column names in this case.

read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): x, y, z
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

Specifying missing data

Another option that commonly needs tweaking is na: this specifies the value (or values) that are used to represent missing values in your file:

read_csv("a,b,c\n1,2,.", na = ".")
## Rows: 1 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): a, b
## lgl (1): c
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 1 × 3
##       a     b c    
##   <dbl> <dbl> <lgl>
## 1     1     2 NA

You can also specify multiple values that should be considered as missing data.

data <- read_csv("a,b,c\n1,2,.\n5,6,NA\n8,0,1", na = c("NA", "."))
## Rows: 3 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): a, b, c
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 3 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     2    NA
## 2     5     6    NA
## 3     8     0     1

Reading from URL

You can read data directly from a URL as well.

url<-'https://raw.githubusercontent.com/dvanderelst-python-class/python-class/fall2022/11_Pandas_Statistics/data/pizzasize.csv'
data <- read_csv(url)
## Rows: 250 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Store, CrustDescription, Topping
## dbl (2): ID, Diameter
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Reading files not seperated by commas

read_csv() expects comma separated field. For data separated by other characters use these:

Exploring data

After reading in the data, it’s a good idea to explore the data. Use either the environment tab in Rstudio or use functions that give you some info about your data.

“This is a critical step that should always be performed. It is simple but it is vital. You should make numerical summaries such as means, standard deviations (SDs), maximum and minimum, correlations and whatever else is appropriate to the specific dataset. Equally important are graphical summaries.”

(Faraway, J. J. (2004). Linear models with R. Chapman and Hall/CRC.).

One of the most simple things you can do is inspect your data using a text editor. When text files are large, it’s a good idea to get an editor that can handle large files. I suggest Sublime Text Editor. This editor can handle much larger files than you can open (as text) in Rstudio. Keep the file open while you’re cleaning it!

Let’s read in some data:

data <- read_csv('data/pakistan_intellectual_capital.csv') # n_max is the number of rows to read
## New names:
## Rows: 1142 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (10): Teacher Name, University Currently Teaching, Department, Province ... dbl
## (3): ...1, S#, Year
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

Column names

colnames(data)
##  [1] "...1"                                     
##  [2] "S#"                                       
##  [3] "Teacher Name"                             
##  [4] "University Currently Teaching"            
##  [5] "Department"                               
##  [6] "Province University Located"              
##  [7] "Designation"                              
##  [8] "Terminal Degree"                          
##  [9] "Graduated from"                           
## [10] "Country"                                  
## [11] "Year"                                     
## [12] "Area of Specialization/Research Interests"
## [13] "Other Information"

Head and tail

head(data)
## # A tibble: 6 × 13
##    ...1  `S#` Teacher …¹ Unive…² Depar…³ Provi…⁴ Desig…⁵ Termi…⁶ Gradu…⁷ Country
##   <dbl> <dbl> <chr>      <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
## 1     2     3 Dr. Abdul… Univer… Comput… Baloch… Assist… PhD     Asian … Thaila…
## 2     4     5 Dr. Wahee… Univer… Comput… Baloch… Assist… PhD     Asian … Thaila…
## 3     5     6 Dr. Junai… Univer… Comput… Baloch… Assist… PhD     Asian … Thaila…
## 4     6     7 Dr. Mahee… Univer… Comput… Baloch… Assist… PhD     Asian … Thaila…
## 5    24    25 Samina Az… Sardar… Comput… Baloch… Lectur… BS      Baloch… Pakist…
## 6    25    26 Nausheed … Sardar… Comput… Baloch… Lectur… MCS     Univer… Pakist…
## # … with 3 more variables: Year <dbl>,
## #   `Area of Specialization/Research Interests` <chr>,
## #   `Other Information` <chr>, and abbreviated variable names ¹​`Teacher Name`,
## #   ²​`University Currently Teaching`, ³​Department,
## #   ⁴​`Province University Located`, ⁵​Designation, ⁶​`Terminal Degree`,
## #   ⁷​`Graduated from`
tail(data)
## # A tibble: 6 × 13
##    ...1  `S#` Teacher …¹ Unive…² Depar…³ Provi…⁴ Desig…⁵ Termi…⁶ Gradu…⁷ Country
##   <dbl> <dbl> <chr>      <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
## 1  1971  1972 Dr. Khali… Ghulam… Comput… KPK     Profes… PhD     Concor… Canada 
## 2  1974  1975 Dr. Ahmar… Ghulam… Comput… KPK     Associ… PhD     JNU     South …
## 3  1975  1976 Dr. Fawad… Ghulam… Comput… KPK     Associ… PhD     Grenob… France 
## 4  1977  1978 Dr. Rasha… Ghulam… Comput… KPK     Assist… PhD     Florid… USA    
## 5  1979  1980 Dr. Shaha… Ghulam… Comput… KPK     Assist… PhD     Ghulam… Pakist…
## 6  1980  1981 Dr. Sajid… Ghulam… Comput… KPK     Assist… PhD     Seoul … South …
## # … with 3 more variables: Year <dbl>,
## #   `Area of Specialization/Research Interests` <chr>,
## #   `Other Information` <chr>, and abbreviated variable names ¹​`Teacher Name`,
## #   ²​`University Currently Teaching`, ³​Department,
## #   ⁴​`Province University Located`, ⁵​Designation, ⁶​`Terminal Degree`,
## #   ⁷​`Graduated from`

Summary

Get a quick summary

summary(data)
##       ...1              S#         Teacher Name      
##  Min.   :   2.0   Min.   :   3.0   Length:1142       
##  1st Qu.: 689.2   1st Qu.: 690.2   Class :character  
##  Median :1087.5   Median :1088.5   Mode  :character  
##  Mean   :1054.4   Mean   :1055.4                     
##  3rd Qu.:1476.8   3rd Qu.:1477.8                     
##  Max.   :1980.0   Max.   :1981.0                     
##                                                      
##  University Currently Teaching  Department        Province University Located
##  Length:1142                   Length:1142        Length:1142                
##  Class :character              Class :character   Class :character           
##  Mode  :character              Mode  :character   Mode  :character           
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  Designation        Terminal Degree    Graduated from       Country         
##  Length:1142        Length:1142        Length:1142        Length:1142       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       Year      Area of Specialization/Research Interests Other Information 
##  Min.   :1983   Length:1142                               Length:1142       
##  1st Qu.:2008   Class :character                          Class :character  
##  Median :2012   Mode  :character                          Mode  :character  
##  Mean   :2010                                                               
##  3rd Qu.:2014                                                               
##  Max.   :2018                                                               
##  NA's   :653

Glimpse

The function glimpse gives you also a quick overview:

glimpse(data)
## Rows: 1,142
## Columns: 13
## $ ...1                                        <dbl> 2, 4, 5, 6, 24, 25, 26, 27…
## $ `S#`                                        <dbl> 3, 5, 6, 7, 25, 26, 27, 28…
## $ `Teacher Name`                              <chr> "Dr. Abdul Basit", "Dr. Wa…
## $ `University Currently Teaching`             <chr> "University of Balochistan…
## $ Department                                  <chr> "Computer Science & IT", "…
## $ `Province University Located`               <chr> "Balochistan", "Balochista…
## $ Designation                                 <chr> "Assistant Professor", "As…
## $ `Terminal Degree`                           <chr> "PhD", "PhD", "PhD", "PhD"…
## $ `Graduated from`                            <chr> "Asian Institute of Techno…
## $ Country                                     <chr> "Thailand", "Thailand", "T…
## $ Year                                        <dbl> NA, NA, NA, NA, 2005, 2008…
## $ `Area of Specialization/Research Interests` <chr> "Software Engineering & DB…
## $ `Other Information`                         <chr> NA, NA, NA, NA, NA, NA, NA…

Skim

And what about this beautiful exploratory tool?

library(skimr)
skim(data)
   
Name data
Number of rows 1142
Number of columns 13
_______________________  
Column type frequency:  
character 10
numeric 3
________________________  
Group variables None

Data summary

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Teacher Name 0 1.00 5 40 0 1133 0
University Currently Teaching 0 1.00 7 71 0 63 0
Department 0 1.00 7 43 0 17 0
Province University Located 0 1.00 3 11 0 5 0
Designation 19 0.98 4 39 0 46 0
Terminal Degree 4 1.00 2 30 0 41 0
Graduated from 0 1.00 3 88 0 347 0
Country 0 1.00 2 18 0 35 0
Area of Specialization/Research Interests 519 0.55 3 477 0 570 0
Other Information 1018 0.11 8 132 0 51 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
…1 0 1.00 1054.35 520.20 2 689.25 1087.5 1476.75 1980 ▅▅▆▇▅
S# 0 1.00 1055.35 520.20 3 690.25 1088.5 1477.75 1981 ▅▅▆▇▅
Year 653 0.43 2010.46 5.58 1983 2008.00 2012.0 2014.00 2018 ▁▁▁▆▇
skim_tee(data)
## ── Data Summary ────────────────────────
##                            Values
## Name                       data  
## Number of rows             1142  
## Number of columns          13    
## _______________________          
## Column type frequency:           
##   character                10    
##   numeric                  3     
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────
##    skim_variable                             n_missing complete_rate min max
##  1 Teacher Name                                      0         1       5  40
##  2 University Currently Teaching                     0         1       7  71
##  3 Department                                        0         1       7  43
##  4 Province University Located                       0         1       3  11
##  5 Designation                                      19         0.983   4  39
##  6 Terminal Degree                                   4         0.996   2  30
##  7 Graduated from                                    0         1       3  88
##  8 Country                                           0         1       2  18
##  9 Area of Specialization/Research Interests       519         0.546   3 477
## 10 Other Information                              1018         0.109   8 132
##    empty n_unique whitespace
##  1     0     1133          0
##  2     0       63          0
##  3     0       17          0
##  4     0        5          0
##  5     0       46          0
##  6     0       41          0
##  7     0      347          0
##  8     0       35          0
##  9     0      570          0
## 10     0       51          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean     sd   p0   p25   p50   p75 p100
## 1 ...1                  0         1     1054. 520.      2  689. 1088. 1477. 1980
## 2 S#                    0         1     1055. 520.      3  690. 1088. 1478. 1981
## 3 Year                653         0.428 2010.   5.58 1983 2008  2012  2014  2018
##   hist 
## 1 ▅▅▆▇▅
## 2 ▅▅▆▇▅
## 3 ▁▁▁▆▇

Tabulating

You can use the janitor package to get a quick overview of categorical variables, i.e., tabulating categorical variables. It’s somewhat confusing that the percent column is not a percentage but a proportion.

library(janitor)
## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
tabyl(data, `Province University Located`)
##  Province University Located   n    percent
##                  Balochistan  16 0.01401051
##                      Capital 247 0.21628722
##                          KPK  86 0.07530648
##                       Punjab 449 0.39316988
##                        Sindh 344 0.30122592

https://chatgpt.com/c/6707c7ce-00c0-800a-b35a-a647ab7503be The janitor package is also useful for other things, like cleaning up column names. This is a link to an online article that describes some of the uses of the package on medium.com.

Addressing a single variable

A simple way to get variables from a tibble is using the $ operator.

my_variable <- data$Year
summary(my_variable)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1983    2008    2012    2010    2014    2018     653

However, this does not work if your variables names have names with a space (or other weird characters). This does work, however,

my_variable <- data['Terminal Degree']

Creating new variables

Often, you will want to create new variables based on existing ones to make subsequent processing easier.

data['very_simple'] <- 1
data['very_simple']
## # A tibble: 1,142 × 1
##    very_simple
##          <dbl>
##  1           1
##  2           1
##  3           1
##  4           1
##  5           1
##  6           1
##  7           1
##  8           1
##  9           1
## 10           1
## # … with 1,132 more rows

However, we can do more interesting things.

data['has_phd'] <- data['Terminal Degree'] == 'PhD'
data['has_phd']
## # A tibble: 1,142 × 1
##    has_phd
##    <lgl>  
##  1 TRUE   
##  2 TRUE   
##  3 TRUE   
##  4 TRUE   
##  5 FALSE  
##  6 FALSE  
##  7 FALSE  
##  8 FALSE  
##  9 FALSE  
## 10 FALSE  
## # … with 1,132 more rows

We can calculate new variables.

data['years_since_graduation'] <- 2022 - data$Year

Creating new variables using mutate

The function mutate is the Tidyverse-approach to creating new variables. Let’s use the pizza data for this demo.

url<-'https://raw.githubusercontent.com/dvanderelst-python-class/python-class/fall2022/11_Pandas_Statistics/data/pizzasize.csv'
data <- read_csv(url)
## Rows: 250 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Store, CrustDescription, Topping
## dbl (2): ID, Diameter
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Mathematical Operations on Existing Columns

Here are number of slides (not by me) that also explain this: https://tinyurl.com/4a6hek9f

Let’s create a new column giving the surface area of each column.

data <- mutate(data, area = pi * (Diameter/2)^2)
data
## # A tibble: 250 × 6
##       ID Store     CrustDescription Topping       Diameter  area
##    <dbl> <chr>     <chr>            <chr>            <dbl> <dbl>
##  1     1 Dominos   ThinNCrispy      Supreme           29.4  679.
##  2     2 Dominos   ThinNCrispy      BBQMeatlovers     29.6  690.
##  3     3 Dominos   DeepPan          Hawaiian          27.1  575.
##  4     4 Dominos   ThinNCrispy      Supreme           27.4  592.
##  5     5 Dominos   ClassicCrust     Hawaiian          26.6  555.
##  6     6 Dominos   DeepPan          BBQMeatlovers     27.2  579.
##  7     7 EagleBoys MidCrust         SuperSupremo      29.2  667.
##  8     8 EagleBoys DeepPan          Hawaiian          28.8  651.
##  9     9 EagleBoys ThinCrust        BBQMeatlovers     30.0  709.
## 10    10 EagleBoys DeepPan          BBQMeatlovers     29.4  678.
## # … with 240 more rows

For fun: transmute only keeps the new variable.

data <- transmute(data, area = pi * (Diameter/2)^2)
data
## # A tibble: 250 × 1
##     area
##    <dbl>
##  1  679.
##  2  690.
##  3  575.
##  4  592.
##  5  555.
##  6  579.
##  7  667.
##  8  651.
##  9  709.
## 10  678.
## # … with 240 more rows

Logical/Conditional Variables using case_when

Tip: You can use mutate to created categories in your data.

data <- read_delim("data/cars.txt", delim = " ")
## Rows: 93 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr  (6): make, model, type, cylinders, rearseat, luggage
## dbl (20): min_price, mid_price, max_price, mpg_city, mpg_hgw, airbag, drive,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data <- mutate(data, mpg_difference = mpg_hgw - mpg_city,
         fuel_efficiency_category = case_when(
           mpg_difference < 5 ~ "Efficient",
           mpg_difference >= 5 & mpg_difference <= 10 ~ "Moderate",
           mpg_difference > 10 ~ "Inefficient"
         ))
head(data)
## # A tibble: 6 × 28
##   make  model type  min_p…¹ mid_p…² max_p…³ mpg_c…⁴ mpg_hgw airbag drive cylin…⁵
##   <chr> <chr> <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <chr>  
## 1 Acura Inte… Small    12.9    15.9    18.8      25      31      0     1 4      
## 2 Acura Lege… Mids…    29.2    33.9    38.7      18      25      2     1 6      
## 3 Audi  90    Comp…    25.9    29.1    32.3      20      26      1     1 6      
## 4 Audi  100   Mids…    30.8    37.7    44.6      19      26      2     1 6      
## 5 BMW   535i  Mids…    23.7    30      36.2      22      30      1     0 4      
## 6 Buick Cent… Mids…    14.2    15.7    17.3      22      31      1     1 4      
## # … with 17 more variables: engine <dbl>, horsepower <dbl>, rpm <dbl>,
## #   rpmile <dbl>, manual <dbl>, tank <dbl>, passengers <dbl>, length <dbl>,
## #   wheelbase <dbl>, width <dbl>, uturn <dbl>, rearseat <chr>, luggage <chr>,
## #   weight <dbl>, domestic <dbl>, mpg_difference <dbl>,
## #   fuel_efficiency_category <chr>, and abbreviated variable names ¹​min_price,
## #   ²​mid_price, ³​max_price, ⁴​mpg_city, ⁵​cylinders

Logical/Conditional Variables using if_else

data <- mutate(data, small_engine = if_else(engine < 2, 1, 0))
names(data)
##  [1] "make"                     "model"                   
##  [3] "type"                     "min_price"               
##  [5] "mid_price"                "max_price"               
##  [7] "mpg_city"                 "mpg_hgw"                 
##  [9] "airbag"                   "drive"                   
## [11] "cylinders"                "engine"                  
## [13] "horsepower"               "rpm"                     
## [15] "rpmile"                   "manual"                  
## [17] "tank"                     "passengers"              
## [19] "length"                   "wheelbase"               
## [21] "width"                    "uturn"                   
## [23] "rearseat"                 "luggage"                 
## [25] "weight"                   "domestic"                
## [27] "mpg_difference"           "fuel_efficiency_category"
## [29] "small_engine"
data <- mutate(data, `small tank` = if_else(tank < 15, 1, 0))
names(data)
##  [1] "make"                     "model"                   
##  [3] "type"                     "min_price"               
##  [5] "mid_price"                "max_price"               
##  [7] "mpg_city"                 "mpg_hgw"                 
##  [9] "airbag"                   "drive"                   
## [11] "cylinders"                "engine"                  
## [13] "horsepower"               "rpm"                     
## [15] "rpmile"                   "manual"                  
## [17] "tank"                     "passengers"              
## [19] "length"                   "wheelbase"               
## [21] "width"                    "uturn"                   
## [23] "rearseat"                 "luggage"                 
## [25] "weight"                   "domestic"                
## [27] "mpg_difference"           "fuel_efficiency_category"
## [29] "small_engine"             "small tank"

Logical/Conditional Variables using cut

The cut function is useful for creating categories based on continuous variables. You can find more information about the cut() function here.

data <- mutate(data, category = cut(min_price,
                              breaks = c(0, 10, 20, 30, 40),
                              labels = c("Low", "Medium", "High", "Very High")))
glimpse(data)
## Rows: 93
## Columns: 31
## $ make                     <chr> "Acura", "Acura", "Audi", "Audi", "BMW", "Bui…
## $ model                    <chr> "Integra", "Legend", "90", "100", "535i", "Ce…
## $ type                     <chr> "Small", "Midsize", "Compact", "Midsize", "Mi…
## $ min_price                <dbl> 12.9, 29.2, 25.9, 30.8, 23.7, 14.2, 19.9, 22.…
## $ mid_price                <dbl> 15.9, 33.9, 29.1, 37.7, 30.0, 15.7, 20.8, 23.…
## $ max_price                <dbl> 18.8, 38.7, 32.3, 44.6, 36.2, 17.3, 21.7, 24.…
## $ mpg_city                 <dbl> 25, 18, 20, 19, 22, 22, 19, 16, 19, 16, 16, 2…
## $ mpg_hgw                  <dbl> 31, 25, 26, 26, 30, 31, 28, 25, 27, 25, 25, 3…
## $ airbag                   <dbl> 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 0, 1, 2, 0, …
## $ drive                    <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, …
## $ cylinders                <chr> "4", "6", "6", "6", "4", "4", "6", "6", "6", …
## $ engine                   <dbl> 1.8, 3.2, 2.8, 2.8, 3.5, 2.2, 3.8, 5.7, 3.8, …
## $ horsepower               <dbl> 140, 200, 172, 172, 208, 110, 170, 180, 170, …
## $ rpm                      <dbl> 6300, 5500, 5500, 5500, 5700, 5200, 4800, 400…
## $ rpmile                   <dbl> 2890, 2335, 2280, 2535, 2545, 2565, 1570, 132…
## $ manual                   <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, …
## $ tank                     <dbl> 13.2, 18.0, 16.9, 21.1, 21.1, 16.4, 18.0, 23.…
## $ passengers               <dbl> 5, 5, 5, 6, 4, 6, 6, 6, 5, 6, 5, 5, 5, 4, 6, …
## $ length                   <dbl> 177, 195, 180, 193, 186, 189, 200, 216, 198, …
## $ wheelbase                <dbl> 102, 115, 102, 106, 109, 105, 111, 116, 108, …
## $ width                    <dbl> 68, 71, 67, 70, 69, 69, 74, 78, 73, 73, 74, 6…
## $ uturn                    <dbl> 37, 38, 37, 37, 39, 41, 42, 45, 41, 43, 44, 3…
## $ rearseat                 <chr> "26.5", "30.0", "28.0", "31.0", "27.0", "28.0…
## $ luggage                  <chr> "11", "15", "14", "17", "13", "16", "17", "21…
## $ weight                   <dbl> 2705, 3560, 3375, 3405, 3640, 2880, 3470, 410…
## $ domestic                 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ mpg_difference           <dbl> 6, 7, 6, 7, 8, 9, 9, 9, 8, 9, 9, 11, 9, 9, 8,…
## $ fuel_efficiency_category <chr> "Moderate", "Moderate", "Moderate", "Moderate…
## $ small_engine             <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `small tank`             <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ category                 <fct> Medium, High, High, Very High, High, Medium, …

Recode values

recode() is a method to change the observations within a variable. It can be used together with mutate().

data <- mutate(data,
               category = recode(cylinders,
                                 '3' = "Low Power",
                                 '4' = "Low Power",
                                 '6' = "Moderate Power",
                                 '8' = "High Power"))
glimpse(data)
## Rows: 93
## Columns: 31
## $ make                     <chr> "Acura", "Acura", "Audi", "Audi", "BMW", "Bui…
## $ model                    <chr> "Integra", "Legend", "90", "100", "535i", "Ce…
## $ type                     <chr> "Small", "Midsize", "Compact", "Midsize", "Mi…
## $ min_price                <dbl> 12.9, 29.2, 25.9, 30.8, 23.7, 14.2, 19.9, 22.…
## $ mid_price                <dbl> 15.9, 33.9, 29.1, 37.7, 30.0, 15.7, 20.8, 23.…
## $ max_price                <dbl> 18.8, 38.7, 32.3, 44.6, 36.2, 17.3, 21.7, 24.…
## $ mpg_city                 <dbl> 25, 18, 20, 19, 22, 22, 19, 16, 19, 16, 16, 2…
## $ mpg_hgw                  <dbl> 31, 25, 26, 26, 30, 31, 28, 25, 27, 25, 25, 3…
## $ airbag                   <dbl> 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 0, 1, 2, 0, …
## $ drive                    <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, …
## $ cylinders                <chr> "4", "6", "6", "6", "4", "4", "6", "6", "6", …
## $ engine                   <dbl> 1.8, 3.2, 2.8, 2.8, 3.5, 2.2, 3.8, 5.7, 3.8, …
## $ horsepower               <dbl> 140, 200, 172, 172, 208, 110, 170, 180, 170, …
## $ rpm                      <dbl> 6300, 5500, 5500, 5500, 5700, 5200, 4800, 400…
## $ rpmile                   <dbl> 2890, 2335, 2280, 2535, 2545, 2565, 1570, 132…
## $ manual                   <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, …
## $ tank                     <dbl> 13.2, 18.0, 16.9, 21.1, 21.1, 16.4, 18.0, 23.…
## $ passengers               <dbl> 5, 5, 5, 6, 4, 6, 6, 6, 5, 6, 5, 5, 5, 4, 6, …
## $ length                   <dbl> 177, 195, 180, 193, 186, 189, 200, 216, 198, …
## $ wheelbase                <dbl> 102, 115, 102, 106, 109, 105, 111, 116, 108, …
## $ width                    <dbl> 68, 71, 67, 70, 69, 69, 74, 78, 73, 73, 74, 6…
## $ uturn                    <dbl> 37, 38, 37, 37, 39, 41, 42, 45, 41, 43, 44, 3…
## $ rearseat                 <chr> "26.5", "30.0", "28.0", "31.0", "27.0", "28.0…
## $ luggage                  <chr> "11", "15", "14", "17", "13", "16", "17", "21…
## $ weight                   <dbl> 2705, 3560, 3375, 3405, 3640, 2880, 3470, 410…
## $ domestic                 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ mpg_difference           <dbl> 6, 7, 6, 7, 8, 9, 9, 9, 8, 9, 9, 11, 9, 9, 8,…
## $ fuel_efficiency_category <chr> "Moderate", "Moderate", "Moderate", "Moderate…
## $ small_engine             <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `small tank`             <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ category                 <chr> "Low Power", "Moderate Power", "Moderate Powe…

Exercises

It might be a good idea to start your first R script while working on these exercises.

Exercise 1

Exercise 2

Download the wages1833.csv data file (from the data folder) and save it to your computer. This file contains data on the wages of Lancashire cotton factory workers in 1833. For each age category, the file lists the following:

More info on the data can be found in this paper: Boot, H.M. 1995. How Skilled Were the Lancashire Cotton Factory Workers in 1833? Economic History Review 48: 283-303.

Write a script that does the following:

Exercise 3

Read in the wages1833.csv data file using the read_csv() function. Create a new column called age_group, which categorizes workers into the following groups based on their age:

Exercise 4

Read in the wages1833.csv file using read_csv(). Create a new column called wage_difference, which gives the absolute difference between the average male wage (mwage) and the average female wage (fwage). Create a second new column called higher_wage, which indicates whether males or females earned more on average in each row (“male” or “female”). Print the first few rows of the data.

Possible solutions

Exercise 1

data <- read_delim("data/cars.txt", delim = " ")
## Rows: 93 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr  (6): make, model, type, cylinders, rearseat, luggage
## dbl (20): min_price, mid_price, max_price, mpg_city, mpg_hgw, airbag, drive,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 26
##   make  model type  min_p…¹ mid_p…² max_p…³ mpg_c…⁴ mpg_hgw airbag drive cylin…⁵
##   <chr> <chr> <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <chr>  
## 1 Acura Inte… Small    12.9    15.9    18.8      25      31      0     1 4      
## 2 Acura Lege… Mids…    29.2    33.9    38.7      18      25      2     1 6      
## 3 Audi  90    Comp…    25.9    29.1    32.3      20      26      1     1 6      
## 4 Audi  100   Mids…    30.8    37.7    44.6      19      26      2     1 6      
## 5 BMW   535i  Mids…    23.7    30      36.2      22      30      1     0 4      
## 6 Buick Cent… Mids…    14.2    15.7    17.3      22      31      1     1 4      
## # … with 15 more variables: engine <dbl>, horsepower <dbl>, rpm <dbl>,
## #   rpmile <dbl>, manual <dbl>, tank <dbl>, passengers <dbl>, length <dbl>,
## #   wheelbase <dbl>, width <dbl>, uturn <dbl>, rearseat <chr>, luggage <chr>,
## #   weight <dbl>, domestic <dbl>, and abbreviated variable names ¹​min_price,
## #   ²​mid_price, ³​max_price, ⁴​mpg_city, ⁵​cylinders
colnames(data)
##  [1] "make"       "model"      "type"       "min_price"  "mid_price" 
##  [6] "max_price"  "mpg_city"   "mpg_hgw"    "airbag"     "drive"     
## [11] "cylinders"  "engine"     "horsepower" "rpm"        "rpmile"    
## [16] "manual"     "tank"       "passengers" "length"     "wheelbase" 
## [21] "width"      "uturn"      "rearseat"   "luggage"    "weight"    
## [26] "domestic"
data <- mutate(data, mpg_difference = mpg_hgw - mpg_city)
head(data)
## # A tibble: 6 × 27
##   make  model type  min_p…¹ mid_p…² max_p…³ mpg_c…⁴ mpg_hgw airbag drive cylin…⁵
##   <chr> <chr> <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <chr>  
## 1 Acura Inte… Small    12.9    15.9    18.8      25      31      0     1 4      
## 2 Acura Lege… Mids…    29.2    33.9    38.7      18      25      2     1 6      
## 3 Audi  90    Comp…    25.9    29.1    32.3      20      26      1     1 6      
## 4 Audi  100   Mids…    30.8    37.7    44.6      19      26      2     1 6      
## 5 BMW   535i  Mids…    23.7    30      36.2      22      30      1     0 4      
## 6 Buick Cent… Mids…    14.2    15.7    17.3      22      31      1     1 4      
## # … with 16 more variables: engine <dbl>, horsepower <dbl>, rpm <dbl>,
## #   rpmile <dbl>, manual <dbl>, tank <dbl>, passengers <dbl>, length <dbl>,
## #   wheelbase <dbl>, width <dbl>, uturn <dbl>, rearseat <chr>, luggage <chr>,
## #   weight <dbl>, domestic <dbl>, mpg_difference <dbl>, and abbreviated
## #   variable names ¹​min_price, ²​mid_price, ³​max_price, ⁴​mpg_city, ⁵​cylinders
glimpse(data)
## Rows: 93
## Columns: 27
## $ make           <chr> "Acura", "Acura", "Audi", "Audi", "BMW", "Buick", "Buic…
## $ model          <chr> "Integra", "Legend", "90", "100", "535i", "Century", "L…
## $ type           <chr> "Small", "Midsize", "Compact", "Midsize", "Midsize", "M…
## $ min_price      <dbl> 12.9, 29.2, 25.9, 30.8, 23.7, 14.2, 19.9, 22.6, 26.3, 3…
## $ mid_price      <dbl> 15.9, 33.9, 29.1, 37.7, 30.0, 15.7, 20.8, 23.7, 26.3, 3…
## $ max_price      <dbl> 18.8, 38.7, 32.3, 44.6, 36.2, 17.3, 21.7, 24.9, 26.3, 3…
## $ mpg_city       <dbl> 25, 18, 20, 19, 22, 22, 19, 16, 19, 16, 16, 25, 25, 19,…
## $ mpg_hgw        <dbl> 31, 25, 26, 26, 30, 31, 28, 25, 27, 25, 25, 36, 34, 28,…
## $ airbag         <dbl> 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 0, 1, 2, 0, 0, 0, 1, 1…
## $ drive          <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 2, 0, 0…
## $ cylinders      <chr> "4", "6", "6", "6", "4", "4", "6", "6", "6", "8", "8", …
## $ engine         <dbl> 1.8, 3.2, 2.8, 2.8, 3.5, 2.2, 3.8, 5.7, 3.8, 4.9, 4.6, …
## $ horsepower     <dbl> 140, 200, 172, 172, 208, 110, 170, 180, 170, 200, 295, …
## $ rpm            <dbl> 6300, 5500, 5500, 5500, 5700, 5200, 4800, 4000, 4800, 4…
## $ rpmile         <dbl> 2890, 2335, 2280, 2535, 2545, 2565, 1570, 1320, 1690, 1…
## $ manual         <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1…
## $ tank           <dbl> 13.2, 18.0, 16.9, 21.1, 21.1, 16.4, 18.0, 23.0, 18.8, 1…
## $ passengers     <dbl> 5, 5, 5, 6, 4, 6, 6, 6, 5, 6, 5, 5, 5, 4, 6, 7, 8, 6, 2…
## $ length         <dbl> 177, 195, 180, 193, 186, 189, 200, 216, 198, 206, 204, …
## $ wheelbase      <dbl> 102, 115, 102, 106, 109, 105, 111, 116, 108, 114, 111, …
## $ width          <dbl> 68, 71, 67, 70, 69, 69, 74, 78, 73, 73, 74, 66, 68, 74,…
## $ uturn          <dbl> 37, 38, 37, 37, 39, 41, 42, 45, 41, 43, 44, 38, 39, 43,…
## $ rearseat       <chr> "26.5", "30.0", "28.0", "31.0", "27.0", "28.0", "30.5",…
## $ luggage        <chr> "11", "15", "14", "17", "13", "16", "17", "21", "14", "…
## $ weight         <dbl> 2705, 3560, 3375, 3405, 3640, 2880, 3470, 4105, 3495, 3…
## $ domestic       <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mpg_difference <dbl> 6, 7, 6, 7, 8, 9, 9, 9, 8, 9, 9, 11, 9, 9, 8, 5, 5, 9, …

Exercise 2

data <- read_csv("data/wages1833.csv")
## New names:
## Rows: 51 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (6): ...1, age, mnum, mwage, fnum, fwage
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
data <- mutate(data, diff = mnum - fnum, diff_pct = 100 * (mwage - fwage) / fwage)
head(data)
## # A tibble: 6 × 8
##    ...1   age  mnum mwage  fnum fwage  diff diff_pct
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1     1    10   204  30.5   122    35    82  -12.9  
## 2     2    11   195  37.8   198    38    -3   -0.526
## 3     3    12   245  43     241    44     4   -2.27 
## 4     4    13   233  50.5   233    46     0    9.78 
## 5     5    14   256  56.5   236    59    20   -4.24 
## 6     6    15   240  63     215    68    25   -7.35
glimpse(data)
## Rows: 51
## Columns: 8
## $ ...1     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ age      <dbl> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2…
## $ mnum     <dbl> 204, 195, 245, 233, 256, 240, 204, 141, 164, 135, 92, 180, 14…
## $ mwage    <dbl> 30.5, 37.8, 43.0, 50.5, 56.5, 63.0, 83.5, 88.5, 141.0, 138.3,…
## $ fnum     <dbl> 122, 198, 241, 233, 236, 215, 256, 245, 279, 251, 209, 251, 1…
## $ fwage    <dbl> 35, 38, 44, 46, 59, 68, 72, 78, 90, 98, 97, 96, 102, 100, 101…
## $ diff     <dbl> 82, -3, 4, 0, 20, 25, -52, -104, -115, -116, -117, -71, -47, …
## $ diff_pct <dbl> -12.8571429, -0.5263158, -2.2727273, 9.7826087, -4.2372881, -…

Exercise 3

data <- read_csv("data/wages1833.csv")
## New names:
## Rows: 51 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (6): ...1, age, mnum, mwage, fnum, fwage
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
data <- mutate(data, age_group = case_when(
  age < 14 ~ "child",
  age >= 14 & age <= 20 ~ "young",
  age > 20 ~ "adult"
))
head(data)
## # A tibble: 6 × 7
##    ...1   age  mnum mwage  fnum fwage age_group
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>    
## 1     1    10   204  30.5   122    35 child    
## 2     2    11   195  37.8   198    38 child    
## 3     3    12   245  43     241    44 child    
## 4     4    13   233  50.5   233    46 child    
## 5     5    14   256  56.5   236    59 young    
## 6     6    15   240  63     215    68 young
glimpse(data)
## Rows: 51
## Columns: 7
## $ ...1      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ age       <dbl> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, …
## $ mnum      <dbl> 204, 195, 245, 233, 256, 240, 204, 141, 164, 135, 92, 180, 1…
## $ mwage     <dbl> 30.5, 37.8, 43.0, 50.5, 56.5, 63.0, 83.5, 88.5, 141.0, 138.3…
## $ fnum      <dbl> 122, 198, 241, 233, 236, 215, 256, 245, 279, 251, 209, 251, …
## $ fwage     <dbl> 35, 38, 44, 46, 59, 68, 72, 78, 90, 98, 97, 96, 102, 100, 10…
## $ age_group <chr> "child", "child", "child", "child", "young", "young", "young…

Exercise 4

data <- read_csv("data/wages1833.csv")
## New names:
## Rows: 51 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (6): ...1, age, mnum, mwage, fnum, fwage
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
data <- mutate(data, wage_difference = abs(mwage - fwage), higher_wage = ifelse(mwage > fwage, 'male', 'female'))
head(data)
## # A tibble: 6 × 8
##    ...1   age  mnum mwage  fnum fwage wage_difference higher_wage
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>           <dbl> <chr>      
## 1     1    10   204  30.5   122    35           4.5   female     
## 2     2    11   195  37.8   198    38           0.200 female     
## 3     3    12   245  43     241    44           1     female     
## 4     4    13   233  50.5   233    46           4.5   male       
## 5     5    14   256  56.5   236    59           2.5   female     
## 6     6    15   240  63     215    68           5     female
glimpse(data)
## Rows: 51
## Columns: 8
## $ ...1            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ age             <dbl> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23…
## $ mnum            <dbl> 204, 195, 245, 233, 256, 240, 204, 141, 164, 135, 92, …
## $ mwage           <dbl> 30.5, 37.8, 43.0, 50.5, 56.5, 63.0, 83.5, 88.5, 141.0,…
## $ fnum            <dbl> 122, 198, 241, 233, 236, 215, 256, 245, 279, 251, 209,…
## $ fwage           <dbl> 35, 38, 44, 46, 59, 68, 72, 78, 90, 98, 97, 96, 102, 1…
## $ wage_difference <dbl> 4.5, 0.2, 1.0, 4.5, 2.5, 5.0, 11.5, 10.5, 51.0, 40.3, …
## $ higher_wage     <chr> "female", "female", "female", "male", "female", "femal…