Predicting the Employee Attrition using Logistic Regression in R
"Talented people will find a way to make a living.
How can you be certain it's with you and your company?"
Attrition of employees has become a critical concern for a company's competitive edge. Finding, hiring, and training new employees is quite costly. It is more cost-effective for a corporation to maintain its current staff. To keep its staff for a longer amount of time, a corporation must maintain a nice working environment.
Whether an employee will stay or quit a firm, his or her response is just binary, i.e., "YES" or "NO." A link between predictor variables and a categorical response variable is analyzed using logistic regression. It's a technique for analyzing a data set with a dependent variable and one or more independent variables in order to predict the result of a binary variable, which means there are only two possibilities. The sigmoid function, also known as the logistic function, produces a 'S' shaped curve that may be used to transfer any real-valued integer to a value between 0 and 1.
p = 1 / 1 + e-y
e - y = (p / p – 1)
y = log (p / p – 1)
log (p / p – 1) = β0 + β1X1 + β2X2 + … + βnXn
Here employee attrition will be the dependent categorical variable so we are using logistic regression to predict and analyze the employee attrition.
I'll show you how to utilize R software to assess employee attrition in five simple stages.
· Exploration of data set
· Data pre-processing
· Dividing the data into two parts "training" and "testing"
· Use the "training data set" to build the model.
· Use the "testing data set" to conduce the accuracy test.
Exploration of Data set
This data set was collected using Kaggle. There are 14999 observations and 10 variables in this dataset. "Attrition(left)" is the dependent variable among ten variables.
Data pre processing
1. Detecting the variable names in the data set
#Variable names in the data set
variable.names(HR)
2. Detecting the summary of all variables in the data set
#Summary of all the variables in the data set (5point summary)
summary(HR)
3. Detect the missing values
Let check whether there is any missing values in the dataset.
missmap(HR,main ="Missing values vs observed")
Result: No missing value was observed
4. Changing data types
To build a good model, we must first change the data type of the character variables to factor.
Department and pay are the two-character variables. 8 and 9 are the column numbers, respectively.
#Converting the data of department and salary into factor to analyse the data
HR [, c (8,9)] =lapply (HR [, c (8,9)], as. factor)
Dividing the data into "training" and "testing"
In every regression analysis, the dataset must be divided into two parts:
· Data set for Training
· Data set for Testing
First, we’ll develop our model with the Training Data set and then verify its accuracy with the Testing Data set.
# Splitting of data set as training data and testing data to analyse further
ran=sample(x=c("Training","Testing"),size=nrow(HR),replace=T,prob=c(0.8,0.2))
TrainingData1=HR[ran=="Training",] #training data set
TestingData1=HR[ran=="Testing",] #testing data set
#no. of rows presents in the training data set
nrow(TrainingData1)
#no. of rows presents in the testing data set
nrow (TestingData1)
Building up the model
We'll next build the model together in a few simple steps, as follows:
· Identify the variables that are independent.
· Build the model by incorporating TRAINING data into the equation.
#Dependent variable is attrition (left)
independent variables=colnames(HR[,1:9])
#Display of all the independent variables in the data set
independent variables
Then, using the "glm" function, we'll add "Training Data" to the equation and create a logistic regression model.
#Incorporation of training data to build the logistic regression model
#Using glm function
model1=glm(HR$left~HR$satisfaction_level+HR$last_evaluation+HR$number_project+HR$average_montly_hours+HR$time_spend_company+HR$Work_accident+HR$promotion_last_5years+HR$Department+HR$salary,data=TrainingData1,family=binomial(link='logit'))
#Summary of the logistic regression model
summary(model1)
As we can see from the table above, employee satisfaction, last evaluation, number of projects, average monthly hours, time spent in the company, work accident, promotion in the last five years, and salary are all crucial factors in influencing employee attrition. There will be less chances of losing an employee if the organisation focuses on these areas.
Confusion Matrix
# Run the test data through the model
res1<-predict (model1,TestingData1, type = "response")
res1
#validate the model - confusion matrix
confusionMatrix<-table(Actual_value=HR$left,predicted_value = res1 >0.5)
confusionMatrix
Interpretation
Our logistic regression model works quite well. This method can be used to analyse any dataset of employee attrition facts, helping the company to gain a competitive edge and effectively manage its resources.
Course: Master of Business Administration
Amrita School of Business, Coimbatore
Amrita Vishwa Vidyapeetham.
"This blog is a part of the assignments done in the course Data Analysis using R and Python"















