AI+ Tutorial

Customer Churn Case Study

Using data to predict the probability of customer churn

Photo by Daniel Fazio on Unsplash



These kinds of analyses are important to a service company. It equips companies to proper allocate resources in order to prevent customers leaving the company due to dissatisfaction or other reasons. For instance, the company may decide to give bonuses or reduce service charge altogether in order to keep customers.

  • Load Dataset
  • Data preparation and cleaning
  • Feature selection
  • Model training
  • Prediction
  • Model Evaluation


One of the brighter sides of competitions is that we do not have to go through the hassle of obtaining our own data. In fact, it is against the rules to use data other than that strictly outlined for the competition. Data is the back bone of machine learning and as you might have guessed, very important. There are sites with free datasets, often in ready-to-use comma separated formats (csv) such as can be found on Kaggle or Zindi.


We will be using python for our data analysis. Python has libraries that are handy in reading data, visualizing data and manipulating data. The key libraries are:


Data from different sources come in different formats. This can be structured or unstructured. If you carry out web scrapping, you are likely to have unstructured data. Lucky for us, our data is structured, we only have to worry about missing values. Lucky for you, we have missing categorical and numerical values so you get to see us handle these missing values.

  • Number of columns
  • The data type of each column
  • The total non-null values of each column


The purpose of data exploration is to gain key insight into the data — get to know your data better. Data exploration is the process of using visualizations to understand the relationship between features and target variables. Visualizations are fun! They turn a boring notebook into a beautiful one. Common visualization packages include:

  • Seaborn
  • Plotly
  • ggplot
  • Bokeh


Features are the input to the model. The features are the columns in the dataset. They are also called attributes. Our work, is to determine which features provide more insight — quality over quantity. A large number of features does not equate to better model prediction.


Time to get our hands dirty with our model. To train our model, we simply fit our model to the training features. This is the models first contact with the data. This is the easy part. We have picked our features (maybe even engineered new ones), all we have to do is to fit the algorithm to the data and voila! Model is trained. This shouldn’t take long but depending on your number of features, it can take a little longer.


If you thought model fitting was easy, this is way easier. After fitting our data, we can now make predictions using our data. There is a catch however, remember our output should be the probability that a customer churns and not a binary output of 0 or 1.


We are not done yet. Now we have to compare our predictions with our actual values to determine our model’s prediction performance.


Moving on, we can tweak the model parameters. Maybe increase the number of max iterations or plug in some other model parameters. A visit to the sklearn Logarithmic Regression documentation will help.

Winning Solutions:

First Place Solution by Maxprop


In this tutorial, we have applied our knowledge of evaluation metrics. We have determined that although the model does a good job classifying labels, it can be further improved using some of the recommendations outlined above.

Data Science Nigeria PH community. Data Science and Artificial Intelligence

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store