AI+ Tutorial
Customer Churn Case Study
Using data to predict the probability of customer churn
In our last two posts, we have spent some time going through evaluation metrics for classification models.
In this post, we will go through a classification problem — Customer Churning. This dataset is part of a Data Science Nigeria (DSN) Pre-Bootcamp hackathon hosted on Zindi.
The goal is to predict customers that are likely to CHURN or stop using the network.
BACKGROUND
Expresso is an African telecommunications company that provides customers with airtime and mobile data bundles. The objective of this challenge is to develop a machine learning model to predict the likelihood of each Expresso customer “churning,” i.e. becoming inactive and not making any transactions for 90 days.
Customer Churn occurs when customers stop using the network. With this dataset we will try to understand customer behavior in order to accurately predict which customers are more likely to churn in the future.
BUSINESS IMPLICATION
These kinds of analyses are important to a service company. It equips companies to proper allocate resources in order to prevent customers leaving the company due to dissatisfaction or other reasons. For instance, the company may decide to give bonuses or reduce service charge altogether in order to keep customers.
In this dataset, the customers who churn are denoted by 1 and those who do not churn by 0. The trick is that the expected solution should not be in zeros or ones, the expected result is the probability of customers churning which can range from 0 to 1 inclusive. This information can be found in the Evaluation section which I advise you always go through before starting any competition so your hard work does not go down the drain.
If your results are not in probabilities, you will find yourself having high error values. Did I mention that the evaluation metric is Logarithmic Loss? If you went through our post Evaluation Metrics I — Classification, you should know that a good classifier has log loss values closer to 0.
This tutorial is beginner friendly is meant to provide a template to approach machine learning problems. We do not tackle processes such as Feature Engineering or go in-depth into feature selection. So don’t sweat it.
However, the winning codes for the competition will be linked below. While this is a straight forward tutorial, in competitions and real life applications, a lot of work is put into the data to draw meaningful insight — emphasis on meaningful. A method commonly used in competitions is ensemble learning. In Ensemble learning, multiple models are used to train the data and then combined through processes such as stacking or binning in order to improve the solution. As they say, two good heads are better than one.
This solution is separated into the following parts:
- Import Libraries
- Load Dataset
- Data preparation and cleaning
- Feature selection
- Model training
- Prediction
- Model Evaluation
DATA COLLECTION
One of the brighter sides of competitions is that we do not have to go through the hassle of obtaining our own data. In fact, it is against the rules to use data other than that strictly outlined for the competition. Data is the back bone of machine learning and as you might have guessed, very important. There are sites with free datasets, often in ready-to-use comma separated formats (csv) such as can be found on Kaggle or Zindi.
Data can be obtained from various sources, through various processes. One of these is data obtained from web pages through a process known as Web Scrapping. Be careful though, there are ethics surrounding scrapping websites especially pertaining private information.
IMPORT LIBRARIES
We will be using python for our data analysis. Python has libraries that are handy in reading data, visualizing data and manipulating data. The key libraries are:
· Numpy
· Pandas
· Matplotlib and
· Seaborn
We also have to import our model. We will be making use of the Logistic Regression model from scikit-learn.
To import libraries simply use the import function and the library name. For ease of use, longer names can be shortened using the import … as… function to import such models in a shorter alternative that is easy to use throughout the program.
Conventionally, numpy is imported as np and pandas as pd. So, wherever you see np, we are referring to numpy and likewise, pd is referring to pandas.
To import a particular module from a library, the from function is used. Importing a full library can be cumbersome and time consuming especially when we only need a little function or module. Hence instead of importing the full sklearn.linear_model package, we simply import what we need — LogisticRegression.
It is advisable to import all libraries at the beginning or top of the code. Nothing special, just neat coding.
DATA PREPARATION AND CLEANING
Data from different sources come in different formats. This can be structured or unstructured. If you carry out web scrapping, you are likely to have unstructured data. Lucky for us, our data is structured, we only have to worry about missing values. Lucky for you, we have missing categorical and numerical values so you get to see us handle these missing values.
We will need to look at some information concerning the dataset to determine if there are null values. One way to check for null values is the df.isnull().sum() function. df stands for dataframe which is the short term for the pandas dataframe the data is read into. The .isnull() function returns Boolean values. If a null value is found, the function returns True. With the .sum() function, we are able to sum all of the True instances to ascertain the total number of null values in a column.
A quicker way is to use the df.info() function which returns a quick summary of the data set. The info() function returns:
- The number of rows in a dataset
- Number of columns
- The data type of each column
- The total non-null values of each column
Of course, if the total number of non-null values do not equal the total number of entries in a column (rows), then we have missing values.
For the purpose of this tutorial we will try two approaches. For the REGION column, we will replace the null values with the mode value of the column while for the TOP_PACK column, we will replace the null values with a constant value.
For the numerical columns with missing data, we will use the fillna function to replace all null values with a constant value: -99999.
DATA EXPLORATION
The purpose of data exploration is to gain key insight into the data — get to know your data better. Data exploration is the process of using visualizations to understand the relationship between features and target variables. Visualizations are fun! They turn a boring notebook into a beautiful one. Common visualization packages include:
- Matplotlib
- Seaborn
- Plotly
- ggplot
- Bokeh
…and so much more. Even pandas can be used for basic visualizations.
FEATURE SELECTION
Features are the input to the model. The features are the columns in the dataset. They are also called attributes. Our work, is to determine which features provide more insight — quality over quantity. A large number of features does not equate to better model prediction.
Another step carried out on features is Feature Engineering. Feature Engineering is the process of extracting features from data. Available features are usually computed in order to form new ones that produce better results. This can be through a number of ways such as: Adding features together, averaging features; multiplication, division, subtraction or even scalar multiplication of the available features.
Feature Engineering is not within the scope of this tutorial, but feature engineering is an important technique in data analysis.
MODEL TRAINING
Time to get our hands dirty with our model. To train our model, we simply fit our model to the training features. This is the models first contact with the data. This is the easy part. We have picked our features (maybe even engineered new ones), all we have to do is to fit the algorithm to the data and voila! Model is trained. This shouldn’t take long but depending on your number of features, it can take a little longer.
Here we use the train_test_split function from sklearn.model_selection to split the train data. Why? The train data is the only part of the dataset that is labeled — supervised classification — and we are making use of the logistics regression algorithm, a supervised classification algorithm. So we train the model with one part and then test it with the test portion of the train data.
PREDICTION
If you thought model fitting was easy, this is way easier. After fitting our data, we can now make predictions using our data. There is a catch however, remember our output should be the probability that a customer churns and not a binary output of 0 or 1.
Here, we will make use of the predict_proba function to return the probability values instead.
MODEL EVALUATION
We are not done yet. Now we have to compare our predictions with our actual values to determine our model’s prediction performance.
The evaluation metric by which we gauge our model’s performance is Log loss. We can import the log loss metric from sklearn.metrics and use it to evaluate our model performance.
Our log_loss value is 0.30****. Not bad… But we can definitely do better. The winning solution had a log_loss value of 0.24.
The accuracy score of the model is 0.859, we can definitely do better.
IMPROVING MODEL PERFORMANCE
Moving on, we can tweak the model parameters. Maybe increase the number of max iterations or plug in some other model parameters. A visit to the sklearn Logarithmic Regression documentation will help.
Another way the accuracy can be improved is through feature selection and engineering. Did we choose the right features in the first place, maybe we can take some out or bring back some that we dropped. The TENURE column can be worked on so that its values are integer. These are some starting points.
Tried the above options and your model is no better? Try Ensemble Learning. You may be surprised by how well your combined algorithms perform compared to one good algorithm.
You may have noticed we merged two datasets — the train and test — and then split them again. This is not necessary, but you must know that whatever operation you carry out on one dataset, you have to carry out on the other. Those are the assumptions we make and our model will only perform accordingly when those assumptions are made. If you keep them apart, you will have to carry out operations doubly.
Winning Solutions:
First Place Solution by Maxprop
Second place Solution by Kolatimi Dave
Bonus:
Solution by Aifenaike
CONCLUSION
In this tutorial, we have applied our knowledge of evaluation metrics. We have determined that although the model does a good job classifying labels, it can be further improved using some of the recommendations outlined above.
We have also looked at a practical example of applying machine learning to business problems. With this we round up our discussion on evaluation metrics for classification models.