Help Center

R Language

How to perform stepwise variable selection in logistic regression model?

Preparing the data

Data set: PimaIndiansDiabetes2 [in mlbench package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproductibility.

# Load the data and remove NAs

data("PimaIndiansDiabetes2", package = "mlbench")

PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)

# Inspect the data

sample_n(PimaIndiansDiabetes2, 3)

# Split the data into training and test set

set.seed(123)

training.samples <- PimaIndiansDiabetes2$diabetes %>%

createDataPartition(p = 0.8, list = FALSE)

train.data <- PimaIndiansDiabetes2[training.samples, ]

test.data <- PimaIndiansDiabetes2[-training.samples, ]

Computing stepwise logistique regression

The stepwise logistic regression can be easily computed using the R function stepAIC() available in the MASS package. It performs model selection by AIC. It has an option called direction, which can have the following values: “both”, “forward”, “backward” (see Chapter @ref(stepwise-regression)).

Quick start R code

library(MASS)

# Fit the model

model <- glm(diabetes ~., data = train.data, family = binomial) %>%

stepAIC(trace = FALSE)

# Summarize the final selected model

summary(model)

# Make predictions

probabilities <- model %>% predict(test.data, type = "response")

predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")

# Model accuracy

mean(predicted.classes==test.data$diabetes)

Full logistic regression model

Full model incorporating all predictors:

full.model <- glm(diabetes ~., data = train.data, family = binomial)

coef(full.model)

## (Intercept) pregnant glucose pressure triceps insulin

## -9.50372 0.04571 0.04230 -0.00700 0.01858 -0.00159

## mass pedigree age

## 0.04502 0.96845 0.04256

Perform stepwise variable selection

Select the most contributive variables:

library(MASS)

step.model <- full.model %>% stepAIC(trace = FALSE)

coef(step.model)

## (Intercept) glucose mass pedigree age

## -9.5612 0.0379 0.0523 0.9697 0.0529

The function chose a final model in which one variable has been removed from the original full model. Dropped predictor is: triceps.

Compare the full and the stepwise models

Here, we’ll compare the performance of the full and the stepwise logistic models. The best model is defined as the model that has the lowest classification error rate in predicting the class of new test data:

Prediction accuracy of the full logistic regression model:

# Make predictions

probabilities <- full.model %>% predict(test.data, type = "response")

predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")

# Prediction accuracy

observed.classes <- test.data$diabetes

mean(predicted.classes == observed.classes)

## [1] 0.808

Prediction accuracy of the stepwise logistic regression model:

# Make predictions

probabilities <- predict(step.model, test.data, type = "response")

predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")