Stepwise Regression: A Practical Approach for Model Selection using R
Stepwise selection, forward selection, and backward elimination are all methods used in the context of building statistical models, specifically regression models, where the goal is to select the most relevant predictors.
In this section, I’ll introduce one by one. Let’s generate one dataset.
df= tibble::tibble(
height= c(86, 102, 104, 99, 97, 97, 109, 104, 99, 91, 97, 99, 107, 104,
97, 97, 102, 99, 97, 104),
stem_biomass= c(351.214, 327.133, 263.747, 436.844, 358.629, 400.098,
421.424, 655.545, 367.028, 424.114, 295.702, 558.76,
480.61, 459.416, 291.962, 391.503, 461.778, 644.39,
488.287, 425.033),
agw=c(30.693, 49.118, 25.473, 29.237, 26.486, 28.547, 24.763, 23.882,
29.215, 28.785, 28.779, 26.71, 23.864, 25.387, 29.463, 28.609,
29.259, 20.889, 27.726, 27.658),
gn= c(14488, 11556, 10389, 23492, 22889, 20322, 16820, 42889, 17564,
20569, 15636, 32545, 23712, 30183, 15980, 20807, 20719, 34173,
23543, 28911),
grain_yield= c(2844.291, 2710.761, 2604.396, 3757.808, 2393.898,
3373.228, 4163.714, 6199.606, 2480.774, 3433.071,
2776.903, 5549.672, 5440.748, 4420.801, 2982.743,
3825.262, 4487.73, 6445.997, 4661.089, 4329.462),
)
head(df, 3)
height stem_biomass agw gn grain_yield
86 351.214 30.693 14488 2844.291
102 327.133 49.118 11556 2710.761
104 263.747 25.473 10389 2604.396
.
.
.
This dataset includes grain yield data, along with measurements of stem biomass, grain weight (agw), and grain number (gn). I would now like to determine which variables are the most critical factors in influencing the final grain yield.
■ Forward selection
In forward selection, variables are added one at a time, starting with the one that shows the strongest individual correlation with the dependent variable. Each subsequent variable selected is the one that provides the most significant improvement to the model fit, given the variables already included.
First, I will fit grain_yield
with each variable, and check the p-value of each linear regression.
summary(lm(grain_yield~height, data=df))
summary(lm(grain_yield~stem_biomass, data=df))
summary(lm(grain_yield~agw, data=df))
summary(lm(grain_yield~gn, data=df))
If variables are few, we can run one by one, but what if there are 100 variables? In this case, fitting one by one would be impossible. So, I’ll use the below code to check p-value.
p_values= sapply(names(df)[-which(names(df)== "grain_yield")], function(col)
{
model= lm(paste("grain_yield ~", col), data = df)
if (ncol(summary(model)$coefficients) >= 4) {
return(summary(model)$coefficients[2, 4])
} else {
return(NA)
}
})
p_values
height: 1.053156e-01
stem_biomass: 1.225464e-09
agw: 2.369808e-02
gn: 5.193683e-06
The linear regression between grain_yield
and stem_biomass
shows the highest p-value. Therefore, I will include stem_biomass
in the model. Next, I will construct models of the form lm(grain_yield ~ stem_biomass + each variable)
, fitting them one by one.
summary(lm(grain_yield~stem_biomass+height, data=df))
summary(lm(grain_yield~stem_biomass+agw, data=df))
summary(lm(grain_yield~stem_biomass+gn, data=df))
but, I’ll use the below code.
variables= setdiff(names(df), c("grain_yield", "stem_biomass"))
models= list()
for (var in variables) {
formula= as.formula(paste("grain_yield ~ stem_biomass +", var))
models[[var]]= lm(formula, data= df)
}
for (var in variables) {
cat("Results for model with", var, ":\n")
model_summary= summary(models[[var]])
coefficients_table= coef(summary(models[[var]]))
print(coefficients_table[, c("Estimate", "Pr(>|t|)")])
cat("\n")
}
Results for model with height :
Estimate Pr(>|t|)
(Intercept) -4616.46524 0.016528184544381
stem_biomass 10.52786 0.000000001280972
height 41.02991 0.034541958269213
Results for model with agw :
Estimate Pr(>|t|)
(Intercept) 63.69489 0.94521601282533
stem_biomass 10.48991 0.00000002761506
agw -20.52979 0.34384107763943
Results for model with gn :
Estimate Pr(>|t|)
(Intercept) -784.585874872 0.12100528523
stem_biomass 11.583992403 0.00009429109
gn -0.008782746 0.77255321232
In the regression analysis of grain_yield
with stem_biomass
, the predictor height
shows the highest p-value. Therefore, I will include height
in the model. My next model will be lm(grain_yield ~ stem_biomass + height + each variable)
, fitting them one by one.
summary(lm(grain_yield~stem_biomass+height+agw, data=df))
summary(lm(grain_yield~stem_biomass+height+gn, data=df))
but, I’ll use the below code.
variables= setdiff(names(df), c("grain_yield", "stem_biomass", "height"))
models= list()
for (var in variables) {
formula= as.formula(paste("grain_yield ~ stem_biomass + height +", var))
models[[var]]= lm(formula, data= df)
}
for (var in variables) {
cat("Results for model with", var, ":\n")
model_summary= summary(models[[var]])
# Extracting and printing coefficients
coefficients_table= coef(summary(models[[var]]))
print(coefficients_table[, c("Estimate", "Pr(>|t|)")]) #
cat("\n")
}
Results for model with agw :
Estimate Pr(>|t|)
(Intercept) -3834.21681 0.06660764193619
stem_biomass 10.12976 0.00000002242056
height 39.76713 0.04215087183397
agw -17.26284 0.37908152809412
Results for model with gn :
Estimate Pr(>|t|)
(Intercept) -4736.04840903 0.01812754502
stem_biomass 11.33642812 0.00004623727
height 41.43274211 0.03798926318
gn -0.01182006 0.66585598379
Now, no variables are significant, so I’ll stop here. Therefore, ‘stem biomass’ and ‘crop height’ are likely more relevant to the final grain yield. Let’s run this model.
summary(lm(formula = grain_yield ~ stem_biomass + height, data= df))
Call:
lm(formula = grain_yield ~ stem_biomass + height, data = df)
Residuals:
Min 1Q Median 3Q Max
-828.7 -223.5 107.3 224.9 607.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4616.4652 1736.1608 -2.659 0.0165 *
stem_biomass 10.5279 0.8913 11.812 1.28e-09 ***
height 41.0299 17.8566 2.298 0.0345 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 402.1 on 17 degrees of freedom
Multiple R-squared: 0.9065, Adjusted R-squared: 0.8955
F-statistic: 82.4 on 2 and 17 DF, p-value: 1.787e-09
Therefore, the model will be: grain yield= -4616.47 + 10.53*stem biomass + 41.03*height
.
Of course, you can perform this fitting process manually using Excel, but using R would be much easier and more convenient. Also, you don’t have to fit the models one by one. R provides code for forward selection. I introduced the above process to explain the principle.
In R, forward selection begins with a model that includes only the intercept. Thus, I will create an intercept-only model.
intercept= lm (grain_yield~ 1, data=df)
Next, I will create another model by fitting all variables with grain yield.
Fit= lm (grain_yield~ ., data=df)
Then, I’ll run the following code.
step (intercept, direction='forward', scope= formula(Fit))
The first thing we need to check is the AIC, which assesses the fit of a model to the data. A lower AIC indicates a model with a better fit. Therefore, only variables with a lower AIC (286.01) will be selected, such as stem biomass. In the second step, with grain_yield ~ stem_biomass
, the AIC is 246.02. Again, only variables with a lower AIC than 246.02 will be selected, which in this case is height. In the third step, for grain_yield ~ stem_biomass + height
, the AIC is 242.61, and no further variables with a lower AIC are found, so the selection process stops here.
Start: AIC=286.01
grain_yield ~ 1
Df Sum of Sq RSS AIC
+ stem_biomass 1 25788950 3601637 246.02
+ gn 1 20384602 9005985 264.35
+ agw 1 7443836 21946751 282.17
+ height 1 4088299 25302288 285.01
<none> 29390587 286.01
Step: AIC=246.02
grain_yield ~ stem_biomass
Df Sum of Sq RSS AIC
+ height 1 853486 2748151 242.61
<none> 3601637 246.02
+ agw 1 190266 3411371 246.94
+ gn 1 18181 3583456 247.92
Step: AIC=242.61
grain_yield ~ stem_biomass + height
Df Sum of Sq RSS AIC
<none> 2748151 242.61
+ agw 1 133721 2614430 243.62
+ gn 1 32848 2715303 244.37
Call:
lm(formula = grain_yield ~ stem_biomass + height, data= df)
Coefficients:
(Intercept) stem_biomass height
-4616.47 10.53 41.03
Therefore, the model will be grain yield = -4616.47 + 10.53*stem biomass + 41.03*height
.
■ Backward Elimination
Conversely, backward elimination starts with all possible predictors included in the model. It systematically removes the least significant variable one at a time, the one whose absence least affects the model fit.
Let’s fitting all variables with grain yield.
summary(lm(grain_yield~., data=df))
Call:
lm(formula = grain_yield ~ ., data = df)
Residuals:
Min 1Q Median 3Q Max
-852.41 -221.25 23.27 277.65 557.77
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.865e+03 1.981e+03 -1.951 0.0699 .
height 4.016e+01 1.831e+01 2.193 0.0445 *
stem_biomass 1.137e+01 2.047e+00 5.553 5.53e-05 ***
agw -2.089e+01 2.009e+01 -1.040 0.3147
gn -1.932e-02 2.775e-02 -0.696 0.4970
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 410.9 on 15 degrees of freedom
Multiple R-squared: 0.9138, Adjusted R-squared: 0.8908
F-statistic: 39.77 on 4 and 15 DF, p-value: 8.134e-08
Grain number shows the lowest p-value, so let’s discard it.
reduced_model1= lm(grain_yield ~ . - gn, data=df)
summary(reduced_model1)
Call:
lm(formula = grain_yield ~ . - gn, data = df)
Residuals:
Min 1Q Median 3Q Max
-835.53 -186.11 81.98 234.98 563.38
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3834.2168 1947.9588 -1.968 0.0666 .
height 39.7671 18.0069 2.208 0.0422 *
stem_biomass 10.1298 0.9983 10.147 2.24e-08 ***
agw -17.2628 19.0828 -0.905 0.3791
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 404.2 on 16 degrees of freedom
Multiple R-squared: 0.911, Adjusted R-squared: 0.8944
F-statistic: 54.62 on 3 and 16 DF, p-value: 1.256e-08
Again!! grain weight shows the lowest p-value, so let’s discard it.
reduced_model2= update(reduced_model1, . ~ . - agw)
summary(reduced_model2)
Call:
lm(formula = grain_yield ~ height + stem_biomass, data = df)
Residuals:
Min 1Q Median 3Q Max
-828.7 -223.5 107.3 224.9 607.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4616.4652 1736.1608 -2.659 0.0165 *
height 41.0299 17.8566 2.298 0.0345 *
stem_biomass 10.5279 0.8913 11.812 1.28e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 402.1 on 17 degrees of freedom
Multiple R-squared: 0.9065, Adjusted R-squared: 0.8955
F-statistic: 82.4 on 2 and 17 DF, p-value: 1.787e-09
Again, height shows the lowest p-value, and it is significant. Therefore, I’ll stop here. Crop height and stem biomass appear to be relevant to the final grain yield.
R provides code for backward elimination. First, I will create a model by fitting all variables with grain yield.
Fit= lm (grain_yield~ ., data=df)
and will run the following code.
step(Fit, direction="backward")
Start: AIC=244.98
grain_yield ~ height + stem_biomass + agw + gn
Df Sum of Sq RSS AIC
- gn 1 81809 2614430 243.62
- agw 1 182681 2715303 244.37
<none> 2532621 244.98
- height 1 811986 3344608 248.54
- stem_biomass 1 5205677 7738298 265.32
Step: AIC=243.62
grain_yield ~ height + stem_biomass + agw
Df Sum of Sq RSS AIC
- agw 1 133721 2748151 242.61
<none> 2614430 243.62
- height 1 796941 3411371 246.94
- stem_biomass 1 16823248 19437678 281.74
Step: AIC=242.61
grain_yield ~ height + stem_biomass
Df Sum of Sq RSS AIC
<none> 2748151 242.61
- height 1 853486 3601637 246.02
- stem_biomass 1 22554137 25302288 285.01
Call:
lm(formula = grain_yield ~ height + stem_biomass, data = df)
Coefficients:
(Intercept) height stem_biomass
-4616.47 41.03 10.53
The outcome is the same as forward selection.
■ Stepwise Selection
In the process described above, forward and backward selection resulted in the same outcomes; however, this might not always be the case. In such instances, stepwise selection would be a good option. Stepwise selection combines elements of both forward selection and backward elimination.
Fit= lm (grain_yield~ ., data=df)
step(Fit, direction="both", scope=formula(Fit))
Start: AIC=244.98
grain_yield ~ height + stem_biomass + agw + gn
Df Sum of Sq RSS AIC
- gn 1 81809 2614430 243.62
- agw 1 182681 2715303 244.37
<none> 2532621 244.98
- height 1 811986 3344608 248.54
- stem_biomass 1 5205677 7738298 265.32
Step: AIC=243.62
grain_yield ~ height + stem_biomass + agw
Df Sum of Sq RSS AIC
- agw 1 133721 2748151 242.61
<none> 2614430 243.62
+ gn 1 81809 2532621 244.98
- height 1 796941 3411371 246.94
- stem_biomass 1 16823248 19437678 281.74
Step: AIC=242.61
grain_yield ~ height + stem_biomass
Df Sum of Sq RSS AIC
<none> 2748151 242.61
+ agw 1 133721 2614430 243.62
+ gn 1 32848 2715303 244.37
- height 1 853486 3601637 246.02
- stem_biomass 1 22554137 25302288 285.01
Call:
lm(formula = grain_yield ~ height + stem_biomass, data = df)
Coefficients:
(Intercept) height stem_biomass
-4616.47 41.03 10.53
Stepwise selection also results in the same outcome. This is because the data size is small.
Here is more big data!!
library (readr)
github= "https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/stepwise_regression_practice_numeric.csv"
df=data.frame(read_csv(url(github), show_col_types= FALSE))
This dataset contains grain yield data and macro-nutrient measurements taken from different tissues (stem, leaf, and head) of a cereal crop at a specific growth stage. I would now like to determine which variables are the most critical factors relevant to the final grain yield.
We know how to proceed with forward, backward, and stepwise selection.
Forward selection
intercept= lm (grain_yield~ 1, data=df)
Fit= lm (grain_yield~ ., data=df)
step(intercept, direction='forward', scope=formula(Fit))
Call:
lm(formula = grain_yield ~ MG_Leaf + CA_Stem + MG_Head + P_Head +
MG_Stem, data = df)
Coefficients:
(Intercept) MG_Leaf CA_Stem MG_Head P_Head MG_Stem
76.67000 0.09146 -7.93317 42.77140 -13.04946 8.97255
Backward elimination
Fit= lm (grain_yield~ ., data=df)
step(Fit, direction="backward", scope=formula(Fit))
Call:
lm(formula = grain_yield ~ K_Head + MG_Head + N_Leaf + P_Leaf +
MG_Stem + CA_Stem, data = df)
Coefficients:
(Intercept) K_Head MG_Head N_Leaf P_Leaf MG_Stem CA_Stem
79.7346 -1.3973 20.6184 0.7051 -8.9931 12.3271 -10.7195
Stepwise selection.
Fit= lm (grain_yield~ ., data=df)
step(Fit, direction="both", scope=formula(Fit))
Call:
lm(formula = grain_yield ~ K_Head + MG_Head + N_Leaf + P_Leaf +
MG_Stem + CA_Stem, data = df)
Coefficients:
(Intercept) K_Head MG_Head N_Leaf P_Leaf MG_Stem CA_Stem
79.7346 -1.3973 20.6184 0.7051 -8.9931 12.3271 -10.7195
In stepwise selection, the model will be grain yield= 79.73 -1.40*K_Head + 20.62*MG_Head + 0.71*N_Leaf -8.99*P_Leaf + 12.33*MG_Stem -10.72*CA_Stem
summary(lm (grain_yield~ K_Head + MG_Head + N_Leaf + P_Leaf + MG_Stem + CA_Stem, data=df))
Call:
lm(formula = grain_yield ~ K_Head + MG_Head + N_Leaf + P_Leaf +
MG_Stem + CA_Stem, data = df)
Residuals:
Min 1Q Median 3Q Max
-19.704 -8.393 1.044 8.178 30.385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.7346 8.2464 9.669 9.38e-13 ***
K_Head -1.3973 0.6251 -2.235 0.030184 *
MG_Head 20.6184 5.1675 3.990 0.000230 ***
N_Leaf 0.7051 0.2606 2.706 0.009453 **
P_Leaf -8.9931 4.3574 -2.064 0.044576 *
MG_Stem 12.3271 2.9510 4.177 0.000127 ***
CA_Stem -10.7195 1.6104 -6.656 2.72e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.47 on 47 degrees of freedom
Multiple R-squared: 0.7631, Adjusted R-squared: 0.7328
F-statistic: 25.23 on 6 and 47 DF, p-value: 3.759e-13
Now, forward and backward selection result in different variables affecting the final grain yield. Therefore, I will use stepwise selection. According to this method, the magnesium content in the cereal spike (head) and in the stem would be most relevant to the final grain yield.