Skip to the content.

Reading Data

Last Updated: 04, November, 2025 at 09:13

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Before we begin…

Download the following data files to your computer from the GradStats > Data Wrangling > Data.

Using the tidyverse vs built-in data reading functions

Default function

In base R, the default function for reading tabular data from a CSV file is read.csv(). This function is widely used, but it has some quirks:

Example usage:

data <- read.csv("data/pakistan_intellectual_capital.csv")

The tidyverse way

The tidyverse provides its own functions to read data. For example, the nearly-equivalent function to read.csv() is read_csv() (from the readr package). Notice the small difference in the function name!

The tidyverse provides a more user-friendly function for reading CSV files: read_csv() from the readr package. It addresses many of the limitations of read.csv() and offers a more intuitive experience for modern data analysis:

Example usage:

data <- read_csv("data/pakistan_intellectual_capital.csv")
## New names:
## Rows: 1142 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (10): Teacher Name, University Currently Teaching, Department, Province ... dbl
## (3): ...1, S#, Year
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

This is just a quick example. We’ll look more at how to read in data from CSV below. Throughout this class, I will try to use the tidyverse functions for reading data.

Note on paths

Since we’re working with in a RStudio Project, your working directory is automatically set to your project folder. This makes managing file paths much simpler! In this case, I assume you organized your project with the following structure (the project name MyRProject can differ):

MyRProject/
├── MyRProject.Rproj
├── data/
│   ├── transit-data.xlsx
│   ├── pakistan_intellectual_capital.csv
│   ├── cars.txt
│   └── wages1833.csv
├── scripts/
└── output/

R will still find the files as long as you use paths relative to your project folder. This means you can place your scripts in a separate scripts/ folder and your data in a data/ folder without any issues. You could also keep the scripts and data in the top folder.

You can verify where R is looking for files using:

getwd()
## [1] "/home/dieter/Dropbox/GraduateStatistics/GradStats/02 Data Wrangling"

With your data organized in a data/ folder, you can use relative paths to read files. The path is “relative” because it’s relative to your project folder. When you share your project with others or move it to a different computer, the code will still work as long as the folder structure stays the same.

# This looks for the 'data' folder inside your project directory
data <- read_csv("data/pakistan_intellectual_capital.csv")

Note on file path separators

R accepts forward slashes (/) in file paths on all operating systems, including Windows. This is the recommended approach for cross-platform compatibility:

Bottom line: Always use forward slashes in your R code, even on Windows.

Reading data from excel using readxl

Excel spreadsheets are often used and easy way to store data.

Explore the data in transit-data.xlsx. (Please note the organization of my project folder.)

Note that:

Read the first sheet of data:

library(readxl)
data <-read_excel("data/transit-data.xlsx") #Be sure to use the correct relative path here.

Look at the help for the read_excel function:

#Let's specify the data range we want to read.
info <-read_excel("data/transit-data.xlsx", sheet = 'info', range = 'B1:C7')
info <-read_excel("data/transit-data.xlsx", sheet = 'info', range = cell_cols('B:C'))

Let’s now read the data in the second sheet:

more_data <- read_excel("data/transit-data.xlsx", sheet = 'transport data')
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...8`

We need to skip the first line of the data.

more_data <- read_excel("data/transit-data.xlsx", sheet = 'transport data', skip=1)

The column names are human readable but not very handy for coding. There is a quick way to solve this.

Pro Tip: Only use underscores or dots in column names. This will make your life easier when coding.

more_data <- read_excel("data/transit-data.xlsx", sheet = 'transport data', skip=1)
colnames(more_data)
## [1] "sender location"    "sender latitude"    "sender longitude"  
## [4] "receiver location"  "receiver latitude"  "receiver longitude"
## [7] "date"               "number of items"
colnames(more_data) <- make.names(colnames(more_data))
colnames(more_data)
## [1] "sender.location"    "sender.latitude"    "sender.longitude"  
## [4] "receiver.location"  "receiver.latitude"  "receiver.longitude"
## [7] "date"               "number.of.items"

Another, more modern, way to do this is using the janitor package (we will use this package again below).

library(janitor)
## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
more_data <- read_excel("data/transit-data.xlsx", sheet = 'transport data', skip=1)
names(more_data)
## [1] "sender location"    "sender latitude"    "sender longitude"  
## [4] "receiver location"  "receiver latitude"  "receiver longitude"
## [7] "date"               "number of items"
names(more_data) <- make_clean_names(names(more_data))
names(more_data)
## [1] "sender_location"    "sender_latitude"    "sender_longitude"  
## [4] "receiver_location"  "receiver_latitude"  "receiver_longitude"
## [7] "date"               "number_of_items"

Reading data from a comma separated text file

R has functions for reading in text files. However, the functions provided by tidyverse are often more powerful. When reading in data, R reports on the column types. Information about the different types and their labels can be found here: Column Types

Note: The default report on the column types is a bit disorganized in this view. It will look better in your console.

data <- read_csv('data/pakistan_intellectual_capital.csv')
## New names:
## Rows: 1142 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (10): Teacher Name, University Currently Teaching, Department, Province ... dbl
## (3): ...1, S#, Year
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# The read functions in the tidyverse are trying to be helpful by printing some output about the how the data was interpreted during reading. If you don't like the output, you can switch this off.
# Now the function only reports it has 'repaired` a variable name.
data <- read_csv('data/pakistan_intellectual_capital.csv', show_col_types = FALSE)
## New names:
## • `` -> `...1`

Note that this file has stray line endings that look like empty lines. This explains why the datafile has 1236 lines but the data read in has only 1142 lines.

You can check whether there were any problems during reading the data using the problems() function. In this case, no problems are reported.

# Check for problems
probs <- problems(data)
print(probs)
## # A tibble: 0 × 5
## # ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>

Tibbles

Reading in data using tidyverse functions generates tibbles, instead of R’s traditional data.frame. These are a newer version of the classic data frame with features tweaked to make life a bit easier.

print(class(data))
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Convert to an old school data frame.

old_school<-as.data.frame(data)

Some interesting options when using read_csv()

No column names? No problem!

If data has no column names, use col_names = FALSE

read_csv("1,2,3\n4,5,6", col_names = FALSE)
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): X1, X2, X3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 3
##      X1    X2    X3
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

You can also directly set the column names in this case.

read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): x, y, z
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

Specifying missing data

Another option that commonly needs tweaking is na: this specifies the value (or values) that are used to represent missing values in your file:

read_csv("a,b,c\n1,2,.", na = ".")
## Rows: 1 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): a, b
## lgl (1): c
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 1 × 3
##       a     b c    
##   <dbl> <dbl> <lgl>
## 1     1     2 NA

You can also specify multiple values that should be considered as missing data.

data <- read_csv("a,b,c\n1,2,.\n5,6,NA\n8,0,1", na = c("NA", "."))
## Rows: 3 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): a, b, c
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 3 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     2    NA
## 2     5     6    NA
## 3     8     0     1

Reading from URL

You can read data directly from a URL as well.

url<-'https://raw.githubusercontent.com/dvanderelst-python-class/python-class/fall2022/11_Pandas_Statistics/data/pizzasize.csv'
data <- read_csv(url)
## Rows: 250 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Store, CrustDescription, Topping
## dbl (2): ID, Diameter
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Getting rid of that pesky index column

Many datasets have a first column that is some index. This is typically not needed. And sometimes, these aren’t given names in the data file. Here is a way to avoid reading these in.

data <- read_csv('data/pakistan_intellectual_capital.csv')
## New names:
## Rows: 1142 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (10): Teacher Name, University Currently Teaching, Department, Province ... dbl
## (3): ...1, S#, Year
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
names(data)
##  [1] "...1"                                     
##  [2] "S#"                                       
##  [3] "Teacher Name"                             
##  [4] "University Currently Teaching"            
##  [5] "Department"                               
##  [6] "Province University Located"              
##  [7] "Designation"                              
##  [8] "Terminal Degree"                          
##  [9] "Graduated from"                           
## [10] "Country"                                  
## [11] "Year"                                     
## [12] "Area of Specialization/Research Interests"
## [13] "Other Information"
# Don't read the first column
data <- read_csv('data/pakistan_intellectual_capital.csv', col_select = -1)
## New names:
## Rows: 1142 Columns: 12
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (10): Teacher Name, University Currently Teaching, Department, Province ... dbl
## (2): S#, Year
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
names(data)
##  [1] "S#"                                       
##  [2] "Teacher Name"                             
##  [3] "University Currently Teaching"            
##  [4] "Department"                               
##  [5] "Province University Located"              
##  [6] "Designation"                              
##  [7] "Terminal Degree"                          
##  [8] "Graduated from"                           
##  [9] "Country"                                  
## [10] "Year"                                     
## [11] "Area of Specialization/Research Interests"
## [12] "Other Information"

Reading files not separated by commas

read_csv() expects comma separated field. For data separated by other characters use these:

Exploring data

After reading in the data, it’s a good idea to explore the data. Use either the environment tab in Rstudio or use functions that give you some info about your data.

“This is a critical step that should always be performed. It is simple but it is vital. You should make numerical summaries such as means, standard deviations (SDs), maximum and minimum, correlations and whatever else is appropriate to the specific dataset. Equally important are graphical summaries.”

(Faraway, J. J. (2004). Linear models with R. Chapman and Hall/CRC.).

One of the most simple things you can do is inspect your data using a text editor. When text files are large, it’s a good idea to get an editor that can handle large files. I suggest Sublime Text Editor. This editor can handle much larger files than you can open (as text) in Rstudio. Keep the file open while you’re cleaning it!

There are many text editors out there. Shop around and find one you like. Some other suggestions:

Let’s read in some data:

data <- read_csv('data/pakistan_intellectual_capital.csv')
## New names:
## Rows: 1142 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (10): Teacher Name, University Currently Teaching, Department, Province ... dbl
## (3): ...1, S#, Year
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

Column names

names(data)
##  [1] "...1"                                     
##  [2] "S#"                                       
##  [3] "Teacher Name"                             
##  [4] "University Currently Teaching"            
##  [5] "Department"                               
##  [6] "Province University Located"              
##  [7] "Designation"                              
##  [8] "Terminal Degree"                          
##  [9] "Graduated from"                           
## [10] "Country"                                  
## [11] "Year"                                     
## [12] "Area of Specialization/Research Interests"
## [13] "Other Information"

Head and tail

head(data)
## # A tibble: 6 × 13
##    ...1  `S#` `Teacher Name`      `University Currently Teaching`     Department
##   <dbl> <dbl> <chr>               <chr>                               <chr>     
## 1     2     3 Dr. Abdul Basit     University of Balochistan           Computer …
## 2     4     5 Dr. Waheed Noor     University of Balochistan           Computer …
## 3     5     6 Dr. Junaid Baber    University of Balochistan           Computer …
## 4     6     7 Dr. Maheen Bakhtyar University of Balochistan           Computer …
## 5    24    25 Samina Azim         Sardar Bahadur Khan Women's Univer… Computer …
## 6    25    26 Nausheed Saeed      Sardar Bahadur Khan Women's Univer… Computer …
## # ℹ 8 more variables: `Province University Located` <chr>, Designation <chr>,
## #   `Terminal Degree` <chr>, `Graduated from` <chr>, Country <chr>, Year <dbl>,
## #   `Area of Specialization/Research Interests` <chr>,
## #   `Other Information` <chr>
tail(data)
## # A tibble: 6 × 13
##    ...1  `S#` `Teacher Name`         `University Currently Teaching` Department 
##   <dbl> <dbl> <chr>                  <chr>                           <chr>      
## 1  1971  1972 Dr. Khalid J Siddiqui  Ghulam Ishaq Khan Institute     Computer S…
## 2  1974  1975 Dr. Ahmar Rashid       Ghulam Ishaq Khan Institute     Computer S…
## 3  1975  1976 Dr. Fawad Hussain      Ghulam Ishaq Khan Institute     Computer S…
## 4  1977  1978 Dr. Rashad M Jillani   Ghulam Ishaq Khan Institute     Computer S…
## 5  1979  1980 Dr. Shahabuddin Ansari Ghulam Ishaq Khan Institute     Computer S…
## 6  1980  1981 Dr. Sajid Anwar        Ghulam Ishaq Khan Institute     Computer S…
## # ℹ 8 more variables: `Province University Located` <chr>, Designation <chr>,
## #   `Terminal Degree` <chr>, `Graduated from` <chr>, Country <chr>, Year <dbl>,
## #   `Area of Specialization/Research Interests` <chr>,
## #   `Other Information` <chr>

Summary

Get a quick summary

summary(data)
##       ...1              S#         Teacher Name      
##  Min.   :   2.0   Min.   :   3.0   Length:1142       
##  1st Qu.: 689.2   1st Qu.: 690.2   Class :character  
##  Median :1087.5   Median :1088.5   Mode  :character  
##  Mean   :1054.4   Mean   :1055.4                     
##  3rd Qu.:1476.8   3rd Qu.:1477.8                     
##  Max.   :1980.0   Max.   :1981.0                     
##                                                      
##  University Currently Teaching  Department        Province University Located
##  Length:1142                   Length:1142        Length:1142                
##  Class :character              Class :character   Class :character           
##  Mode  :character              Mode  :character   Mode  :character           
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  Designation        Terminal Degree    Graduated from       Country         
##  Length:1142        Length:1142        Length:1142        Length:1142       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       Year      Area of Specialization/Research Interests Other Information 
##  Min.   :1983   Length:1142                               Length:1142       
##  1st Qu.:2008   Class :character                          Class :character  
##  Median :2012   Mode  :character                          Mode  :character  
##  Mean   :2010                                                               
##  3rd Qu.:2014                                                               
##  Max.   :2018                                                               
##  NA's   :653

Glimpse

The function glimpse gives you also a quick overview:

glimpse(data)
## Rows: 1,142
## Columns: 13
## $ ...1                                        <dbl> 2, 4, 5, 6, 24, 25, 26, 27…
## $ `S#`                                        <dbl> 3, 5, 6, 7, 25, 26, 27, 28…
## $ `Teacher Name`                              <chr> "Dr. Abdul Basit", "Dr. Wa…
## $ `University Currently Teaching`             <chr> "University of Balochistan…
## $ Department                                  <chr> "Computer Science & IT", "…
## $ `Province University Located`               <chr> "Balochistan", "Balochista…
## $ Designation                                 <chr> "Assistant Professor", "As…
## $ `Terminal Degree`                           <chr> "PhD", "PhD", "PhD", "PhD"…
## $ `Graduated from`                            <chr> "Asian Institute of Techno…
## $ Country                                     <chr> "Thailand", "Thailand", "T…
## $ Year                                        <dbl> NA, NA, NA, NA, 2005, 2008…
## $ `Area of Specialization/Research Interests` <chr> "Software Engineering & DB…
## $ `Other Information`                         <chr> NA, NA, NA, NA, NA, NA, NA…

Skim

And what about this beautiful exploratory tool?

library(skimr)
skim(data)
   
Name data
Number of rows 1142
Number of columns 13
_______________________  
Column type frequency:  
character 10
numeric 3
________________________  
Group variables None

Data summary

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Teacher Name 0 1.00 5 40 0 1133 0
University Currently Teaching 0 1.00 7 71 0 63 0
Department 0 1.00 7 43 0 17 0
Province University Located 0 1.00 3 11 0 5 0
Designation 19 0.98 4 39 0 46 0
Terminal Degree 4 1.00 2 30 0 41 0
Graduated from 0 1.00 3 88 0 347 0
Country 0 1.00 2 18 0 35 0
Area of Specialization/Research Interests 519 0.55 3 477 0 570 0
Other Information 1018 0.11 8 132 0 51 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
…1 0 1.00 1054.35 520.20 2 689.25 1087.5 1476.75 1980 ▅▅▆▇▅
S# 0 1.00 1055.35 520.20 3 690.25 1088.5 1477.75 1981 ▅▅▆▇▅
Year 653 0.43 2010.46 5.58 1983 2008.00 2012.0 2014.00 2018 ▁▁▁▆▇
skim_tee(data)
## ── Data Summary ────────────────────────
##                            Values
## Name                       data  
## Number of rows             1142  
## Number of columns          13    
## _______________________          
## Column type frequency:           
##   character                10    
##   numeric                  3     
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────
##    skim_variable                             n_missing complete_rate min max
##  1 Teacher Name                                      0         1       5  40
##  2 University Currently Teaching                     0         1       7  71
##  3 Department                                        0         1       7  43
##  4 Province University Located                       0         1       3  11
##  5 Designation                                      19         0.983   4  39
##  6 Terminal Degree                                   4         0.996   2  30
##  7 Graduated from                                    0         1       3  88
##  8 Country                                           0         1       2  18
##  9 Area of Specialization/Research Interests       519         0.546   3 477
## 10 Other Information                              1018         0.109   8 132
##    empty n_unique whitespace
##  1     0     1133          0
##  2     0       63          0
##  3     0       17          0
##  4     0        5          0
##  5     0       46          0
##  6     0       41          0
##  7     0      347          0
##  8     0       35          0
##  9     0      570          0
## 10     0       51          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean     sd   p0   p25   p50   p75 p100
## 1 ...1                  0         1     1054. 520.      2  689. 1088. 1477. 1980
## 2 S#                    0         1     1055. 520.      3  690. 1088. 1478. 1981
## 3 Year                653         0.428 2010.   5.58 1983 2008  2012  2014  2018
##   hist 
## 1 ▅▅▆▇▅
## 2 ▅▅▆▇▅
## 3 ▁▁▁▆▇

Tabulating

You can use the janitor package to get a quick overview of categorical variables, i.e., tabulating categorical variables. It’s somewhat confusing that the percent column is not a percentage but a proportion.

library(janitor)
tabyl(data, `Province University Located`)
##  Province University Located   n    percent
##                  Balochistan  16 0.01401051
##                      Capital 247 0.21628722
##                          KPK  86 0.07530648
##                       Punjab 449 0.39316988
##                        Sindh 344 0.30122592

Addressing a single variable

A simple way to get variables from a tibble is using the $ operator.

my_variable <- data$Year
summary(my_variable)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1983    2008    2012    2010    2014    2018     653

However, this does not work if your variables names have names with a space (or other weird characters). This does work, however,

my_variable <- data['Terminal Degree']

This table gives some more details about variables can be addressed.

Expression Works Returns / Behavior
data$Year Yes Column vector
data[["Year"]] Yes Column vector
data["Year"] Yes One-column data.frame / tibble
data$Terminal Degree No Error: space in name — not syntactically valid
data$'Terminal Degree' No Error: quotes not allowed after $
data$"Terminal Degree" No Error: same as above
data[terminal degree] No Error: object ‘terminal’ not found — needs quotes
data[["Terminal Degree"]] Yes Column vector
data["Terminal Degree"] Yes One-column data.frame / tibble

Creating new variables

Often, you will want to create new variables based on existing ones to make subsequent processing easier.

data['very_simple'] <- 1
data['very_simple']
## # A tibble: 1,142 × 1
##    very_simple
##          <dbl>
##  1           1
##  2           1
##  3           1
##  4           1
##  5           1
##  6           1
##  7           1
##  8           1
##  9           1
## 10           1
## # ℹ 1,132 more rows

However, we can do more interesting things.

data['has_phd'] <- data['Terminal Degree'] == 'PhD'
data['has_phd']
## # A tibble: 1,142 × 1
##    has_phd
##    <lgl>  
##  1 TRUE   
##  2 TRUE   
##  3 TRUE   
##  4 TRUE   
##  5 FALSE  
##  6 FALSE  
##  7 FALSE  
##  8 FALSE  
##  9 FALSE  
## 10 FALSE  
## # ℹ 1,132 more rows

We can calculate new variables.

data['years_since_graduation'] <- 2022 - data$Year

Creating new variables using mutate

The function mutate is the Tidyverse-approach to creating new variables. Let’s use the pizza data for this demo.

The stuff below looks complex. But once you understand the examples, you can create quite complex new variables based on existing ones. These are essentially a few recipes that you can mix and match to create new variables and which can be very powerful (and save you a lot of time messing around in excel).

# Here, I read in data from another repository on GitHub, which I created for another class.
url<-'https://raw.githubusercontent.com/dvanderelst-python-class/python-class/fall2022/11_Pandas_Statistics/data/pizzasize.csv'
data <- read_csv(url)
## Rows: 250 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Store, CrustDescription, Topping
## dbl (2): ID, Diameter
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Mathematical Operations on Existing Columns

Here are number of slides (not by me) that also explain this: https://tinyurl.com/4a6hek9f

Let’s create a new column giving the surface area of each column.

data <- mutate(data, area = pi * (Diameter/2)^2)
data
## # A tibble: 250 × 6
##       ID Store     CrustDescription Topping       Diameter  area
##    <dbl> <chr>     <chr>            <chr>            <dbl> <dbl>
##  1     1 Dominos   ThinNCrispy      Supreme           29.4  679.
##  2     2 Dominos   ThinNCrispy      BBQMeatlovers     29.6  690.
##  3     3 Dominos   DeepPan          Hawaiian          27.1  575.
##  4     4 Dominos   ThinNCrispy      Supreme           27.4  592.
##  5     5 Dominos   ClassicCrust     Hawaiian          26.6  555.
##  6     6 Dominos   DeepPan          BBQMeatlovers     27.2  579.
##  7     7 EagleBoys MidCrust         SuperSupremo      29.2  667.
##  8     8 EagleBoys DeepPan          Hawaiian          28.8  651.
##  9     9 EagleBoys ThinCrust        BBQMeatlovers     30.0  709.
## 10    10 EagleBoys DeepPan          BBQMeatlovers     29.4  678.
## # ℹ 240 more rows

For fun: transmute only keeps the new variable. Usually, it’s better to keep all variables.

data <- transmute(data, area = pi * (Diameter/2)^2)
data
## # A tibble: 250 × 1
##     area
##    <dbl>
##  1  679.
##  2  690.
##  3  575.
##  4  592.
##  5  555.
##  6  579.
##  7  667.
##  8  651.
##  9  709.
## 10  678.
## # ℹ 240 more rows

Logical/Conditional Variables using case_when

Tip: You can use mutate to created categories in your data.

# Step 1: Read the data
data <- read_delim("data/cars.txt", delim = " ")
## Rows: 93 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr  (6): make, model, type, cylinders, rearseat, luggage
## dbl (20): min_price, mid_price, max_price, mpg_city, mpg_hgw, airbag, drive,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Step 2: Create mpg_difference
data <- mutate(data, mpg_difference = mpg_hgw - mpg_city)

# Step 3: Create fuel_efficiency_category based on mpg_difference
data <- mutate(data,
  fuel_efficiency_category = case_when(
    mpg_difference < 5 ~ "Efficient",
    mpg_difference >= 5 & mpg_difference <= 10 ~ "Moderate",
    mpg_difference > 10 ~ "Inefficient"
  )
)

# Step 4: Preview the result
head(data)
## # A tibble: 6 × 28
##   make  model  type  min_price mid_price max_price mpg_city mpg_hgw airbag drive
##   <chr> <chr>  <chr>     <dbl>     <dbl>     <dbl>    <dbl>   <dbl>  <dbl> <dbl>
## 1 Acura Integ… Small      12.9      15.9      18.8       25      31      0     1
## 2 Acura Legend Mids…      29.2      33.9      38.7       18      25      2     1
## 3 Audi  90     Comp…      25.9      29.1      32.3       20      26      1     1
## 4 Audi  100    Mids…      30.8      37.7      44.6       19      26      2     1
## 5 BMW   535i   Mids…      23.7      30        36.2       22      30      1     0
## 6 Buick Centu… Mids…      14.2      15.7      17.3       22      31      1     1
## # ℹ 18 more variables: cylinders <chr>, engine <dbl>, horsepower <dbl>,
## #   rpm <dbl>, rpmile <dbl>, manual <dbl>, tank <dbl>, passengers <dbl>,
## #   length <dbl>, wheelbase <dbl>, width <dbl>, uturn <dbl>, rearseat <chr>,
## #   luggage <chr>, weight <dbl>, domestic <dbl>, mpg_difference <dbl>,
## #   fuel_efficiency_category <chr>

This code creates a new variable fuel_efficiency_category based on the values in mpg_difference. Step 3 is a bit dense. Let’s break it down:

Here is another way of doing the same thing which might look cleaner:

# Step 1: Read and prepare data
data <- read_delim("data/cars.txt", delim = " ")
## Rows: 93 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr  (6): make, model, type, cylinders, rearseat, luggage
## dbl (20): min_price, mid_price, max_price, mpg_city, mpg_hgw, airbag, drive,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data <- mutate(data, mpg_difference = mpg_hgw - mpg_city)

# Step 2: Build the case_when() expression separately
efficiency_labels <- case_when(
  data$mpg_difference < 5 ~ "Efficient",
  data$mpg_difference >= 5 & data$mpg_difference <= 10 ~ "Moderate",
  data$mpg_difference > 10 ~ "Inefficient"
)

# Step 3: Inspect what case_when() produced
head(efficiency_labels)
## [1] "Moderate" "Moderate" "Moderate" "Moderate" "Moderate" "Moderate"
# → A character vector of labels, same length as data

# Step 4: Add that vector to the dataframe
data$fuel_efficiency_category <- efficiency_labels

# Step 5: Check the result
head(data)
## # A tibble: 6 × 28
##   make  model  type  min_price mid_price max_price mpg_city mpg_hgw airbag drive
##   <chr> <chr>  <chr>     <dbl>     <dbl>     <dbl>    <dbl>   <dbl>  <dbl> <dbl>
## 1 Acura Integ… Small      12.9      15.9      18.8       25      31      0     1
## 2 Acura Legend Mids…      29.2      33.9      38.7       18      25      2     1
## 3 Audi  90     Comp…      25.9      29.1      32.3       20      26      1     1
## 4 Audi  100    Mids…      30.8      37.7      44.6       19      26      2     1
## 5 BMW   535i   Mids…      23.7      30        36.2       22      30      1     0
## 6 Buick Centu… Mids…      14.2      15.7      17.3       22      31      1     1
## # ℹ 18 more variables: cylinders <chr>, engine <dbl>, horsepower <dbl>,
## #   rpm <dbl>, rpmile <dbl>, manual <dbl>, tank <dbl>, passengers <dbl>,
## #   length <dbl>, wheelbase <dbl>, width <dbl>, uturn <dbl>, rearseat <chr>,
## #   luggage <chr>, weight <dbl>, domestic <dbl>, mpg_difference <dbl>,
## #   fuel_efficiency_category <chr>

Logical/Conditional Variables using if_else

data <- mutate(data, small_engine = if_else(engine < 2, 1, 0))
head(data)
## # A tibble: 6 × 29
##   make  model  type  min_price mid_price max_price mpg_city mpg_hgw airbag drive
##   <chr> <chr>  <chr>     <dbl>     <dbl>     <dbl>    <dbl>   <dbl>  <dbl> <dbl>
## 1 Acura Integ… Small      12.9      15.9      18.8       25      31      0     1
## 2 Acura Legend Mids…      29.2      33.9      38.7       18      25      2     1
## 3 Audi  90     Comp…      25.9      29.1      32.3       20      26      1     1
## 4 Audi  100    Mids…      30.8      37.7      44.6       19      26      2     1
## 5 BMW   535i   Mids…      23.7      30        36.2       22      30      1     0
## 6 Buick Centu… Mids…      14.2      15.7      17.3       22      31      1     1
## # ℹ 19 more variables: cylinders <chr>, engine <dbl>, horsepower <dbl>,
## #   rpm <dbl>, rpmile <dbl>, manual <dbl>, tank <dbl>, passengers <dbl>,
## #   length <dbl>, wheelbase <dbl>, width <dbl>, uturn <dbl>, rearseat <chr>,
## #   luggage <chr>, weight <dbl>, domestic <dbl>, mpg_difference <dbl>,
## #   fuel_efficiency_category <chr>, small_engine <dbl>
data <- mutate(data, `small tank` = if_else(tank < 15, 1, 0))
head(data)
## # A tibble: 6 × 30
##   make  model  type  min_price mid_price max_price mpg_city mpg_hgw airbag drive
##   <chr> <chr>  <chr>     <dbl>     <dbl>     <dbl>    <dbl>   <dbl>  <dbl> <dbl>
## 1 Acura Integ… Small      12.9      15.9      18.8       25      31      0     1
## 2 Acura Legend Mids…      29.2      33.9      38.7       18      25      2     1
## 3 Audi  90     Comp…      25.9      29.1      32.3       20      26      1     1
## 4 Audi  100    Mids…      30.8      37.7      44.6       19      26      2     1
## 5 BMW   535i   Mids…      23.7      30        36.2       22      30      1     0
## 6 Buick Centu… Mids…      14.2      15.7      17.3       22      31      1     1
## # ℹ 20 more variables: cylinders <chr>, engine <dbl>, horsepower <dbl>,
## #   rpm <dbl>, rpmile <dbl>, manual <dbl>, tank <dbl>, passengers <dbl>,
## #   length <dbl>, wheelbase <dbl>, width <dbl>, uturn <dbl>, rearseat <chr>,
## #   luggage <chr>, weight <dbl>, domestic <dbl>, mpg_difference <dbl>,
## #   fuel_efficiency_category <chr>, small_engine <dbl>, `small tank` <dbl>

Logical/Conditional Variables using cut

The cut function is useful for creating categories based on continuous variables. You can find more information about the cut() function here.

data <- mutate(data, category = cut(min_price,
                              breaks = c(0, 10, 20, 30, 40),
                              labels = c("Low", "Medium", "High", "Very High")))
glimpse(data)
## Rows: 93
## Columns: 31
## $ make                     <chr> "Acura", "Acura", "Audi", "Audi", "BMW", "Bui…
## $ model                    <chr> "Integra", "Legend", "90", "100", "535i", "Ce…
## $ type                     <chr> "Small", "Midsize", "Compact", "Midsize", "Mi…
## $ min_price                <dbl> 12.9, 29.2, 25.9, 30.8, 23.7, 14.2, 19.9, 22.…
## $ mid_price                <dbl> 15.9, 33.9, 29.1, 37.7, 30.0, 15.7, 20.8, 23.…
## $ max_price                <dbl> 18.8, 38.7, 32.3, 44.6, 36.2, 17.3, 21.7, 24.…
## $ mpg_city                 <dbl> 25, 18, 20, 19, 22, 22, 19, 16, 19, 16, 16, 2…
## $ mpg_hgw                  <dbl> 31, 25, 26, 26, 30, 31, 28, 25, 27, 25, 25, 3…
## $ airbag                   <dbl> 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 0, 1, 2, 0, …
## $ drive                    <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, …
## $ cylinders                <chr> "4", "6", "6", "6", "4", "4", "6", "6", "6", …
## $ engine                   <dbl> 1.8, 3.2, 2.8, 2.8, 3.5, 2.2, 3.8, 5.7, 3.8, …
## $ horsepower               <dbl> 140, 200, 172, 172, 208, 110, 170, 180, 170, …
## $ rpm                      <dbl> 6300, 5500, 5500, 5500, 5700, 5200, 4800, 400…
## $ rpmile                   <dbl> 2890, 2335, 2280, 2535, 2545, 2565, 1570, 132…
## $ manual                   <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, …
## $ tank                     <dbl> 13.2, 18.0, 16.9, 21.1, 21.1, 16.4, 18.0, 23.…
## $ passengers               <dbl> 5, 5, 5, 6, 4, 6, 6, 6, 5, 6, 5, 5, 5, 4, 6, …
## $ length                   <dbl> 177, 195, 180, 193, 186, 189, 200, 216, 198, …
## $ wheelbase                <dbl> 102, 115, 102, 106, 109, 105, 111, 116, 108, …
## $ width                    <dbl> 68, 71, 67, 70, 69, 69, 74, 78, 73, 73, 74, 6…
## $ uturn                    <dbl> 37, 38, 37, 37, 39, 41, 42, 45, 41, 43, 44, 3…
## $ rearseat                 <chr> "26.5", "30.0", "28.0", "31.0", "27.0", "28.0…
## $ luggage                  <chr> "11", "15", "14", "17", "13", "16", "17", "21…
## $ weight                   <dbl> 2705, 3560, 3375, 3405, 3640, 2880, 3470, 410…
## $ domestic                 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ mpg_difference           <dbl> 6, 7, 6, 7, 8, 9, 9, 9, 8, 9, 9, 11, 9, 9, 8,…
## $ fuel_efficiency_category <chr> "Moderate", "Moderate", "Moderate", "Moderate…
## $ small_engine             <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `small tank`             <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ category                 <fct> Medium, High, High, Very High, High, Medium, …

Recode values

recode() is a method to change the observations within a variable. It can be used together with mutate().

data <- mutate(data,
               category = recode(cylinders,
                                 '3' = "Low Power",
                                 '4' = "Low Power",
                                 '6' = "Moderate Power",
                                 '8' = "High Power"))
glimpse(data)
## Rows: 93
## Columns: 31
## $ make                     <chr> "Acura", "Acura", "Audi", "Audi", "BMW", "Bui…
## $ model                    <chr> "Integra", "Legend", "90", "100", "535i", "Ce…
## $ type                     <chr> "Small", "Midsize", "Compact", "Midsize", "Mi…
## $ min_price                <dbl> 12.9, 29.2, 25.9, 30.8, 23.7, 14.2, 19.9, 22.…
## $ mid_price                <dbl> 15.9, 33.9, 29.1, 37.7, 30.0, 15.7, 20.8, 23.…
## $ max_price                <dbl> 18.8, 38.7, 32.3, 44.6, 36.2, 17.3, 21.7, 24.…
## $ mpg_city                 <dbl> 25, 18, 20, 19, 22, 22, 19, 16, 19, 16, 16, 2…
## $ mpg_hgw                  <dbl> 31, 25, 26, 26, 30, 31, 28, 25, 27, 25, 25, 3…
## $ airbag                   <dbl> 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 0, 1, 2, 0, …
## $ drive                    <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, …
## $ cylinders                <chr> "4", "6", "6", "6", "4", "4", "6", "6", "6", …
## $ engine                   <dbl> 1.8, 3.2, 2.8, 2.8, 3.5, 2.2, 3.8, 5.7, 3.8, …
## $ horsepower               <dbl> 140, 200, 172, 172, 208, 110, 170, 180, 170, …
## $ rpm                      <dbl> 6300, 5500, 5500, 5500, 5700, 5200, 4800, 400…
## $ rpmile                   <dbl> 2890, 2335, 2280, 2535, 2545, 2565, 1570, 132…
## $ manual                   <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, …
## $ tank                     <dbl> 13.2, 18.0, 16.9, 21.1, 21.1, 16.4, 18.0, 23.…
## $ passengers               <dbl> 5, 5, 5, 6, 4, 6, 6, 6, 5, 6, 5, 5, 5, 4, 6, …
## $ length                   <dbl> 177, 195, 180, 193, 186, 189, 200, 216, 198, …
## $ wheelbase                <dbl> 102, 115, 102, 106, 109, 105, 111, 116, 108, …
## $ width                    <dbl> 68, 71, 67, 70, 69, 69, 74, 78, 73, 73, 74, 6…
## $ uturn                    <dbl> 37, 38, 37, 37, 39, 41, 42, 45, 41, 43, 44, 3…
## $ rearseat                 <chr> "26.5", "30.0", "28.0", "31.0", "27.0", "28.0…
## $ luggage                  <chr> "11", "15", "14", "17", "13", "16", "17", "21…
## $ weight                   <dbl> 2705, 3560, 3375, 3405, 3640, 2880, 3470, 410…
## $ domestic                 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ mpg_difference           <dbl> 6, 7, 6, 7, 8, 9, 9, 9, 8, 9, 9, 11, 9, 9, 8,…
## $ fuel_efficiency_category <chr> "Moderate", "Moderate", "Moderate", "Moderate…
## $ small_engine             <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `small tank`             <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ category                 <chr> "Low Power", "Moderate Power", "Moderate Powe…

Choosing the right function inside mutate()

We have seen that using the cut(), case_when(), if_else(), and recode() functions inside mutate() allows you to create new variables based on different types of conditions. These functions overlap in what they can do but each have their own best cases in which to use them. Here is a summary and overview.

Function Use when you need to… Think of it as… Best for… Notes / Tip
recode() Replace specific known values with new ones a look-up table Renaming or relabeling categories (e.g., “M” → “Male”) Fastest and clearest when you already know all the possible old values.
if_else() Choose between two categories or outcomes based on a condition a simple yes/no switch Binary variables (e.g., pass/fail, above/below threshold) Stricter version of base R’s ifelse() — safer types. If you have more than two categories, use case_when().
case_when() Assign values based on several logical conditions a series of if-else statements Multi-category classification (e.g., low/medium/high) Clearer and more readable than nested if_else(); stop at the first TRUE condition.
cut() Split a numeric variable into ranges an automatic binning tool Turning continuous data into ordinal categories (e.g., income brackets, age groups) Requires setting breaks and optional labels; returns a factor.

Exercises

It might be a good idea to start a new R script while working on these exercises.

Exercise 1

Exercise 2

Download the wages1833.csv data file (from the data folder) and save it to your computer. This file contains data on the wages of Lancashire cotton factory workers in 1833. For each age category, the file lists the following:

More info on the data can be found in this paper: Boot, H.M. 1995. How Skilled Were the Lancashire Cotton Factory Workers in 1833? Economic History Review 48: 283-303.

Write a script that does the following:

Exercise 3

Read in the wages1833.csv data file using the read_csv() function. Create a new column called age_group, which categorizes workers into the following groups based on their age:

Exercise 4

Read in the wages1833.csv file using read_csv(). Create a new column called wage_difference, which gives the absolute difference between the average male wage (mwage) and the average female wage (fwage). Create a second new column called higher_wage, which indicates whether males or females earned more on average in each row (“male” or “female”). Print the first few rows of the data.

Possible solutions

Exercise 1

data <- read_delim("data/cars.txt", delim = " ")
## Rows: 93 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr  (6): make, model, type, cylinders, rearseat, luggage
## dbl (20): min_price, mid_price, max_price, mpg_city, mpg_hgw, airbag, drive,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 26
##   make  model  type  min_price mid_price max_price mpg_city mpg_hgw airbag drive
##   <chr> <chr>  <chr>     <dbl>     <dbl>     <dbl>    <dbl>   <dbl>  <dbl> <dbl>
## 1 Acura Integ… Small      12.9      15.9      18.8       25      31      0     1
## 2 Acura Legend Mids…      29.2      33.9      38.7       18      25      2     1
## 3 Audi  90     Comp…      25.9      29.1      32.3       20      26      1     1
## 4 Audi  100    Mids…      30.8      37.7      44.6       19      26      2     1
## 5 BMW   535i   Mids…      23.7      30        36.2       22      30      1     0
## 6 Buick Centu… Mids…      14.2      15.7      17.3       22      31      1     1
## # ℹ 16 more variables: cylinders <chr>, engine <dbl>, horsepower <dbl>,
## #   rpm <dbl>, rpmile <dbl>, manual <dbl>, tank <dbl>, passengers <dbl>,
## #   length <dbl>, wheelbase <dbl>, width <dbl>, uturn <dbl>, rearseat <chr>,
## #   luggage <chr>, weight <dbl>, domestic <dbl>
colnames(data)
##  [1] "make"       "model"      "type"       "min_price"  "mid_price" 
##  [6] "max_price"  "mpg_city"   "mpg_hgw"    "airbag"     "drive"     
## [11] "cylinders"  "engine"     "horsepower" "rpm"        "rpmile"    
## [16] "manual"     "tank"       "passengers" "length"     "wheelbase" 
## [21] "width"      "uturn"      "rearseat"   "luggage"    "weight"    
## [26] "domestic"
data <- mutate(data, mpg_difference = mpg_hgw - mpg_city)
head(data)
## # A tibble: 6 × 27
##   make  model  type  min_price mid_price max_price mpg_city mpg_hgw airbag drive
##   <chr> <chr>  <chr>     <dbl>     <dbl>     <dbl>    <dbl>   <dbl>  <dbl> <dbl>
## 1 Acura Integ… Small      12.9      15.9      18.8       25      31      0     1
## 2 Acura Legend Mids…      29.2      33.9      38.7       18      25      2     1
## 3 Audi  90     Comp…      25.9      29.1      32.3       20      26      1     1
## 4 Audi  100    Mids…      30.8      37.7      44.6       19      26      2     1
## 5 BMW   535i   Mids…      23.7      30        36.2       22      30      1     0
## 6 Buick Centu… Mids…      14.2      15.7      17.3       22      31      1     1
## # ℹ 17 more variables: cylinders <chr>, engine <dbl>, horsepower <dbl>,
## #   rpm <dbl>, rpmile <dbl>, manual <dbl>, tank <dbl>, passengers <dbl>,
## #   length <dbl>, wheelbase <dbl>, width <dbl>, uturn <dbl>, rearseat <chr>,
## #   luggage <chr>, weight <dbl>, domestic <dbl>, mpg_difference <dbl>
glimpse(data)
## Rows: 93
## Columns: 27
## $ make           <chr> "Acura", "Acura", "Audi", "Audi", "BMW", "Buick", "Buic…
## $ model          <chr> "Integra", "Legend", "90", "100", "535i", "Century", "L…
## $ type           <chr> "Small", "Midsize", "Compact", "Midsize", "Midsize", "M…
## $ min_price      <dbl> 12.9, 29.2, 25.9, 30.8, 23.7, 14.2, 19.9, 22.6, 26.3, 3…
## $ mid_price      <dbl> 15.9, 33.9, 29.1, 37.7, 30.0, 15.7, 20.8, 23.7, 26.3, 3…
## $ max_price      <dbl> 18.8, 38.7, 32.3, 44.6, 36.2, 17.3, 21.7, 24.9, 26.3, 3…
## $ mpg_city       <dbl> 25, 18, 20, 19, 22, 22, 19, 16, 19, 16, 16, 25, 25, 19,…
## $ mpg_hgw        <dbl> 31, 25, 26, 26, 30, 31, 28, 25, 27, 25, 25, 36, 34, 28,…
## $ airbag         <dbl> 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 0, 1, 2, 0, 0, 0, 1, 1…
## $ drive          <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 2, 0, 0…
## $ cylinders      <chr> "4", "6", "6", "6", "4", "4", "6", "6", "6", "8", "8", …
## $ engine         <dbl> 1.8, 3.2, 2.8, 2.8, 3.5, 2.2, 3.8, 5.7, 3.8, 4.9, 4.6, …
## $ horsepower     <dbl> 140, 200, 172, 172, 208, 110, 170, 180, 170, 200, 295, …
## $ rpm            <dbl> 6300, 5500, 5500, 5500, 5700, 5200, 4800, 4000, 4800, 4…
## $ rpmile         <dbl> 2890, 2335, 2280, 2535, 2545, 2565, 1570, 1320, 1690, 1…
## $ manual         <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1…
## $ tank           <dbl> 13.2, 18.0, 16.9, 21.1, 21.1, 16.4, 18.0, 23.0, 18.8, 1…
## $ passengers     <dbl> 5, 5, 5, 6, 4, 6, 6, 6, 5, 6, 5, 5, 5, 4, 6, 7, 8, 6, 2…
## $ length         <dbl> 177, 195, 180, 193, 186, 189, 200, 216, 198, 206, 204, …
## $ wheelbase      <dbl> 102, 115, 102, 106, 109, 105, 111, 116, 108, 114, 111, …
## $ width          <dbl> 68, 71, 67, 70, 69, 69, 74, 78, 73, 73, 74, 66, 68, 74,…
## $ uturn          <dbl> 37, 38, 37, 37, 39, 41, 42, 45, 41, 43, 44, 38, 39, 43,…
## $ rearseat       <chr> "26.5", "30.0", "28.0", "31.0", "27.0", "28.0", "30.5",…
## $ luggage        <chr> "11", "15", "14", "17", "13", "16", "17", "21", "14", "…
## $ weight         <dbl> 2705, 3560, 3375, 3405, 3640, 2880, 3470, 4105, 3495, 3…
## $ domestic       <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mpg_difference <dbl> 6, 7, 6, 7, 8, 9, 9, 9, 8, 9, 9, 11, 9, 9, 8, 5, 5, 9, …

Exercise 2

data <- read_csv("data/wages1833.csv")
## New names:
## Rows: 51 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (6): ...1, age, mnum, mwage, fnum, fwage
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
data <- mutate(data, diff = mnum - fnum, diff_pct = 100 * (mwage - fwage) / fwage)
head(data)
## # A tibble: 6 × 8
##    ...1   age  mnum mwage  fnum fwage  diff diff_pct
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1     1    10   204  30.5   122    35    82  -12.9  
## 2     2    11   195  37.8   198    38    -3   -0.526
## 3     3    12   245  43     241    44     4   -2.27 
## 4     4    13   233  50.5   233    46     0    9.78 
## 5     5    14   256  56.5   236    59    20   -4.24 
## 6     6    15   240  63     215    68    25   -7.35
glimpse(data)
## Rows: 51
## Columns: 8
## $ ...1     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ age      <dbl> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2…
## $ mnum     <dbl> 204, 195, 245, 233, 256, 240, 204, 141, 164, 135, 92, 180, 14…
## $ mwage    <dbl> 30.5, 37.8, 43.0, 50.5, 56.5, 63.0, 83.5, 88.5, 141.0, 138.3,…
## $ fnum     <dbl> 122, 198, 241, 233, 236, 215, 256, 245, 279, 251, 209, 251, 1…
## $ fwage    <dbl> 35, 38, 44, 46, 59, 68, 72, 78, 90, 98, 97, 96, 102, 100, 101…
## $ diff     <dbl> 82, -3, 4, 0, 20, 25, -52, -104, -115, -116, -117, -71, -47, …
## $ diff_pct <dbl> -12.8571429, -0.5263158, -2.2727273, 9.7826087, -4.2372881, -…

Exercise 3

data <- read_csv("data/wages1833.csv")
## New names:
## Rows: 51 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (6): ...1, age, mnum, mwage, fnum, fwage
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
data <- mutate(data, age_group = case_when(
  age < 14 ~ "child",
  age >= 14 & age <= 20 ~ "young",
  age > 20 ~ "adult"
))
head(data)
## # A tibble: 6 × 7
##    ...1   age  mnum mwage  fnum fwage age_group
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>    
## 1     1    10   204  30.5   122    35 child    
## 2     2    11   195  37.8   198    38 child    
## 3     3    12   245  43     241    44 child    
## 4     4    13   233  50.5   233    46 child    
## 5     5    14   256  56.5   236    59 young    
## 6     6    15   240  63     215    68 young
glimpse(data)
## Rows: 51
## Columns: 7
## $ ...1      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ age       <dbl> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, …
## $ mnum      <dbl> 204, 195, 245, 233, 256, 240, 204, 141, 164, 135, 92, 180, 1…
## $ mwage     <dbl> 30.5, 37.8, 43.0, 50.5, 56.5, 63.0, 83.5, 88.5, 141.0, 138.3…
## $ fnum      <dbl> 122, 198, 241, 233, 236, 215, 256, 245, 279, 251, 209, 251, …
## $ fwage     <dbl> 35, 38, 44, 46, 59, 68, 72, 78, 90, 98, 97, 96, 102, 100, 10…
## $ age_group <chr> "child", "child", "child", "child", "young", "young", "young…

Exercise 4

data <- read_csv("data/wages1833.csv")
## New names:
## Rows: 51 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (6): ...1, age, mnum, mwage, fnum, fwage
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
data <- mutate(data, wage_difference = abs(mwage - fwage), higher_wage = ifelse(mwage > fwage, 'male', 'female'))
head(data)
## # A tibble: 6 × 8
##    ...1   age  mnum mwage  fnum fwage wage_difference higher_wage
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>           <dbl> <chr>      
## 1     1    10   204  30.5   122    35           4.5   female     
## 2     2    11   195  37.8   198    38           0.200 female     
## 3     3    12   245  43     241    44           1     female     
## 4     4    13   233  50.5   233    46           4.5   male       
## 5     5    14   256  56.5   236    59           2.5   female     
## 6     6    15   240  63     215    68           5     female
glimpse(data)
## Rows: 51
## Columns: 8
## $ ...1            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ age             <dbl> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23…
## $ mnum            <dbl> 204, 195, 245, 233, 256, 240, 204, 141, 164, 135, 92, …
## $ mwage           <dbl> 30.5, 37.8, 43.0, 50.5, 56.5, 63.0, 83.5, 88.5, 141.0,…
## $ fnum            <dbl> 122, 198, 241, 233, 236, 215, 256, 245, 279, 251, 209,…
## $ fwage           <dbl> 35, 38, 44, 46, 59, 68, 72, 78, 90, 98, 97, 96, 102, 1…
## $ wage_difference <dbl> 4.5, 0.2, 1.0, 4.5, 2.5, 5.0, 11.5, 10.5, 51.0, 40.3, …
## $ higher_wage     <chr> "female", "female", "female", "male", "female", "femal…