Simple Example of Linear Model

Last Updated: 14, November, 2024 at 09:02

A first example
Explicit model comparison
Second example
- Relationship between two variables: correlation
- Relationship between two variables: linear model

This file shows the code used for the output on the slides.

A first example

Before starting download the following data sets:

vik_table_9_2.csv
body.csv

Read and create data data

data<-read.csv('data/vik_table_9_2.csv')

# Let's create two variables that could represent the age and weight of penguins
data$age<- 15 - data$X1
data$weight<-data$Y + 45
plot(data$age, data$weight)

Run a simple linear model

Here, we fit a simple linear model to the data. The line model1<-lm(weight ~ age, data=data) fits a linear model where weight is the dependent variable and age is the predictor variable. The line summary(model1) provides the results of the model.

model1<-lm(weight ~ age, data=data) # fit this model: weight = B0 + B1 * age + error
summary(model1)

## 
## Call:
## lm(formula = weight ~ age, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1635 -0.3365  0.1121  0.7804  1.4907 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  43.6308     1.2031  36.267 6.04e-12 ***
## age           0.7757     0.1208   6.421 7.63e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.25 on 10 degrees of freedom
## Multiple R-squared:  0.8048, Adjusted R-squared:  0.7853 
## F-statistic: 41.23 on 1 and 10 DF,  p-value: 7.627e-05

Get the F value components

If we wish to know how the F-value was calculated we can ask for the anova table.

anova(model1)

## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## age        1 64.383  64.383  41.227 7.627e-05 ***
## Residuals 10 15.617   1.562                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explicit model comparison

All model assessment is based on comparing models. Here we explicitly compare compare the simplest model (zero predictors, $y_i = \bar{y}$) with a more complex model (a model with one predictor). Per default, R will compare any model you fit to the simplest model using an F-test. However, here we explicitly fit the simplest model and then compare it to the more complex model.

Simple model

First, run the simple model…

data$average_weight <- mean(data$weight) # Create a new variable which is the average of all weights
simple_model <- lm(weight ~ average_weight, data = data)
summary(simple_model)

## 
## Call:
## lm(formula = weight ~ average_weight, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4.00  -2.25   0.00   2.25   4.00 
## 
## Coefficients: (1 not defined because of singularities)
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     51.0000     0.7785   65.51  1.3e-15 ***
## average_weight       NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.697 on 11 degrees of freedom

More complex model

Let’s run the more complex model (again)…

complex_model<-lm(weight ~ age, data=data)

Compare the models

Compare the following output with that of the of the complex model.

anova(simple_model, complex_model)

## Analysis of Variance Table
## 
## Model 1: weight ~ average_weight
## Model 2: weight ~ age
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     11 80.000                                  
## 2     10 15.617  1    64.383 41.227 7.627e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(complex_model)

## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## age        1 64.383  64.383  41.227 7.627e-05 ***
## Residuals 10 15.617   1.562                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Second example

data <- read.csv('data/body.csv')
data<-data[1:20,]
head(data)

##   Biacromial Biiliac Bitrochanteric ChestDepth ChestDia ElbowDia WristDia
## 1       42.9    26.0           31.5       17.7     28.0     13.1     10.4
## 2       43.7    28.5           33.5       16.9     30.8     14.0     11.8
## 3       40.1    28.2           33.3       20.9     31.7     13.9     10.9
## 4       44.3    29.9           34.0       18.4     28.2     13.9     11.2
## 5       42.5    29.9           34.0       21.5     29.4     15.2     11.6
## 6       43.3    27.0           31.5       19.6     31.3     14.0     11.5
##   KneeDia AnkleDia Shoulder Chest Waist Navel  Hip Thigh Bicep Forearm Knee
## 1    18.8     14.1    106.2  89.5  71.5  74.5 93.5  51.5  32.5    26.0 34.5
## 2    20.6     15.1    110.5  97.0  79.0  86.5 94.8  51.5  34.4    28.0 36.5
## 3    19.7     14.1    115.1  97.5  83.2  82.9 95.0  57.3  33.4    28.8 37.0
## 4    20.9     15.0    104.5  97.0  77.8  78.8 94.0  53.0  31.0    26.2 37.0
## 5    20.7     14.9    107.5  97.5  80.0  82.5 98.5  55.4  32.0    28.4 37.7
## 6    18.8     13.9    119.8  99.9  82.5  80.1 95.3  57.5  33.0    28.0 36.6
##   Calf Ankle Wrist Age Weight Height Gender
## 1 36.5  23.5  16.5  21   65.6  174.0      1
## 2 37.5  24.5  17.0  23   71.8  175.3      1
## 3 37.3  21.9  16.9  28   80.7  193.5      1
## 4 34.8  23.0  16.6  23   72.6  186.5      1
## 5 38.6  24.4  18.0  22   78.8  187.2      1
## 6 36.1  23.5  16.9  21   74.8  181.5      1

Relationship between two variables: correlation

correlation_test<-cor.test(data$Wrist, data$Age)
correlation_test

## 
##  Pearson's product-moment correlation
## 
## data:  data$Wrist and data$Age
## t = 1.0465, df = 18, p-value = 0.3092
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2271038  0.6166544
## sample estimates:
##       cor 
## 0.2394848

correlation_test$estimate**2

##        cor 
## 0.05735295

Relationship between two variables: linear model

Notice how the t-test for the parameter for x gives the same results as the t-test for the correlation test.

model2<-lm(data$Wrist~data$Age)
summary(model2)

## 
## Call:
## lm(formula = data$Wrist ~ data$Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1170 -0.7277 -0.2141  0.7566  1.5516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.73914    1.56466  10.059 8.16e-09 ***
## data$Age     0.06860    0.06555   1.047    0.309    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8297 on 18 degrees of freedom
## Multiple R-squared:  0.05735,    Adjusted R-squared:  0.004984 
## F-statistic: 1.095 on 1 and 18 DF,  p-value: 0.3092