Telecom Churn Prediction
By Khulood Nasher
Introduction
What is Customer Churn? Customer churn or customer attrition is the loss of customers by a business when customers stop using the service of the company. We calculate the churn rate by dividing the number of customers that were lost during that time by the number of customers that existed at the beginning of that time.
Why is it important to calculate customer churn?
Because business is built on several customers using its service. Keeping its customers is less expensive than bringing new ones.
More Important Questions:
In this project, I tried to figure out the answers to the following important questions:
Q1: What percentage of customers churn?
Q2: What are the common variables among churn customers?
Q3: What is the predicted percentage of customer churn?
Q4: How to reduce customer churn?
To answer these questions, I followed a data science process of the project cycle i.e. well known as OSMEN where I tried to predict the customers who have a high probability to churn and help to set recommendations to keep them.
About Telco data
- Data were obtained from https://www.kaggle.com/dpr1988/telecom-churn-dataset
- Each row represents a customer, and each column contains the customer’s attributes described in the column Metadata.
- The raw data contains 7043 rows which are customers and 21 columns that are features.
- The “Churn” column is our target.
- Customer ID is a unique value and has 7043 inputs.
- gender: Whether the customer is a male or a female.
- Senior citizen: Whether the customer is a senior citizen or not (1, 0)
- Partner: Whether the customer has a partner or not (Yes, No)
- Dependents: Whether the customer has dependents or not (Yes, No)
- tenure: Number of months the customer has stayed with the company
- PhoneService: Whether the customer has a phone service or not (Yes, No)
- MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
- InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
- OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
- OnlineBackup: Whether the customer has an online backup or not (Yes, No, No internet service)
- DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
- TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
- StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
- StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
- Contract: The contract term of the customer (Month-to-month, One year, Two years)
- PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
- PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- MonthlyCharges: The amount charged to the customer monthly
- TotalCharges: The total amount charged to the customer
- Churn: Whether the customer churned or not (Yes or No)
Explore Data
I investigated every categorical column and I visualized it and I studied every numeric column and I visualized it. To see my project, please click here.
So first, I explored the target column through the pie plot.
Through Exploring the target,’churn’ column, we can see that we have an imbalance target because of approximately 73% of “No”, and 27% of “Yes”. This imbalance will affect our prediction and must be addressed when modeling through different techniques.
Then, I explored the categorical features
gender: it seems to be an equal distribution of males and females concerning churn intention.
I can guess that gender is not an important feature.
senior citizens: There are much fewer senior citizens and there is a larger proportion of senior citizens churning. The churn plot shows more young people are churning.
I think seniorcitizens is an important feature.
Contract: There are many more people who are on a month-to-month contract and a large proportion of this group of people has churned. People with one year contract is less churn. People with two years of contracts are the least people who may churn.
I can tell ‘contract’ is one of the most important features
PaymentMethod: More people are adopting electronic check as a payment method and a large proportion of them have churned.
Important categorical Features
In conclusion, based on our analyses, We can see the more services are added the fewer people churn. Gender is not a good feature, phone service and multiple lines aren’t important.
List of important features are :
seniorcitizen,partner,dependents,internetservice,contract,paperlessbilling, and paymentmethod.
Then, I visualized the numeric features
- We can observe that the greater the totalcharges is, the less churn happened.
- Totalcharges looks like a very important feature concerning churn.
- Totalcharges is skewed to the right, taking logarithm will help in normalizing data.
- monthly charges are fluctuating concerning churn.
- As a customer’s tenure increases, the churn mostly decreases.
Modeling Data
Data preprocessing
Encoding features
Since our target variable (churn) is unbalanced, I tried to build the baseline classification accuracy for X_train through DummyClassifier.
I used the most_frequent strategy of calculating the baseline accuracy.
To prepare the dataset for modeling, we needed to encode categorical features to numbers. This means encoding “Yes”, “No” to 0 and 1 so that algorithm can work with the data. This process is called onehot encoding.
Define X and y and Splitting data to training and testing
Modeling With Random Forest Classifier
I tried many models such as logistic regression, Support vector machine, decision tree. The best model was Random Forest.
So, I modeled with random forest and setting number of estimators =10 while the rest of the parameters were defined by default.
With applying n-estimator=10 by itself I ran into an overfitting problem where the training accuracy was 98.1% while the testing accuracy was 84.6%. I fixed the overfitting of my model by adding more parameters to Randomforest which tuned my model and overcame the overfitting issue. I ran a gridsearch with the following param.
The Gridsearh suggested {‘criterion’: ‘gini’,’max_depth’: 20,’min_samples_leaf’: 15 ‘min_samples_split’: 50,’n_estimators’: 10} as best parameters. So I plug them into Randomforest and I checked the metrics.
Interpreting the Results
Trying the suggested best parameters above, I got better model testing and training accuracy of 86%. The recall of churned customers is 86% also. I have precision, recall, and F1 balance between not churn and yes churn with a percentage of about 86%. The False-negative percentage in this model is 14%. The important features are internetservice_Fiber optic, tenure,paymentmethod_Electronic check,contract_Two year,totalcharges, seniorcitizen_Yes,paperlessbilling_Yes, phoneservice_Yes. I think this is a good model and I will trust it to make my prediction of churn customers
ROC Curve for Random Forest Classifier
The ROC plot shows the true positive rate( the rate that my model predicts the number of customers churned to the true churned customers) against the false positive rate where my model predicts the number of customers is churned but the truth is they don’t churn. The further the curve is away from the middle line and closer to the top left corner, the better this model precision is. AUC is a metric of the model’s accuracy and closer to one is better and closer to 0.5 is bad. My model Auc=0.85 which proves that my model is very close to the best precision.
Conclusion
In conclusion, research in predicting customer churn has proved that it is a very important field and it adds an income to Telecom companies. As we can see, some variables are negatively correlated with the predicted target (Churn), while some others behave positively. A negative correlation means that churn decreases when variable increases.
based on our EDA, We can see the more services are added the fewer people churn. Gender is not a feature, phone service and multiple lines aren’t important.
List of important features are :
seniorcitizen,partner,dependents,internetservice,contract,paperlessbilling, paymentmethod,Totalcharges and Tenure.
By the end of this project I can set the following recommendations:
Recommendation
- Offer incentives for one/two-year contracts.
- Seasonal Incentives for multiple lines on monthly charges when bringing dependents/partners.
- Free services for non-churn such as streaming services
- Stay competitive with internet services.
- Special offers for senior citizens.
- Be Proactive when changing business plans.
Future Work
- Request the Telecom company to provide the zipcodes of churn customers to search for local competitors and try to provide better offers in the zone of churn customers to bring them back.
- Investigate fiber optic internet further with regards to geographical distribution and its charges.
- Neural network modeling for better accuracy
Any suggestions, feel free to reach out khuloodnasher@yahoo.com