A beginner data-scientist insight
Supervised machine learning problems are most likely the first problems you will encounter as you begin your data science career.
In supervised machine learning, we use input (features) and output data (labels) from a training dataset to build a model that will (if all goes well) be able to generalize on new input data and accurately predict new output value.
The two common types of supervised ML problems are Classification and Regression.
Classification and Regression are both predictive models. Predictive models are models trained on historical data in order to predict future behavior — the output label given new input value.
In classification problems, our goal is to build a model that can accurately predict a class label based on selected attributes or features.
Examples of classification problems are:
· Determining survivors of the titanic ship crash
· Classifying emails as spam or not spam
· Classifying (Iris) flower species
Note that the class labels may be two (binary classification) or more (multiclass classification). However, the class labels are usually discrete and limited to a particular number of possible values.
In the Titanic example stated above, the class label ‘survived’ can be represented by the numbers 0 and 1, where 0 indicates those who did not survive and 1 represents those that survive — binary classification.
For the Iris Flower Classification (or any flower in fact), the different species can be represented using [0, 1, 2… n-1], for each specie of the flower — multiclass classification.
Common classification algorithms are:
· Naive Bayes
· Support Vector Machines
· Decision Tree
· K-Nearest Neighbors
· Ensemble Learning etc.
The algorithm for classification problems are outlined below.
1. Create or import dataset
When training models, it is advisable to divide the data into training and testing dataset. This is to reduce the chances of overfitting data and ensure that the model is truly accurate and able to generalize on new data.
2. Import Classifier
This can be any of the aforementioned algorithms. The Scikit-learn library contains a number of machine learning packages that can be used for classification and is widely used in data science and machine learning.
3. Assign the classifier (this is optional)
clf = classifer()
This makes it easier to use the model. The classifier names can be very long, to reduce the need to type out long string of words and shorten it to the simple term, ‘clf’.
4. Fit the model
clf.fit (features_train, labels_train)
The model is fit with the input and output data of the training dataset.
5. Run prediction with the features
To predict values, the classifier.predict() function is run on new input data.
6. Determine accuracy
The accuracy of a classification model are the number of points that were correctly classified divided by the total number of points.
accuracy = (No. of points correctly classified)/(total no. of points classified)
Of course, every algorithm has its own in-built method of determining the accuracy of the model. It is always good to go through the algorithm documentation for more details.
Regression is used to determine the relationship between independent variables and a dependent variable. This relationship is in the form of an equation which can then be applied to new data.
In regression, our prediction value is a continuous number, integer or floating point. The simplest form of regression is linear regression where we assume that the input and output variables have a linear relationship.
Using independent variables (input features), we are able to predict the dependent variable (output).
The equation for a simple linear regression is:
y = mx + b
Where y is the dependent variable or input feature
m is the slope
x is the independent variable or output
b is the intercept or bias
Note that naming conventions vary for data scientist and statisticians but we are referring to the same things.
Examples of regression problems are:
· Predicting the bonus of employees in a company using attributes such as salary, number of years in the company and so on.
· Predicting one’s net worth from their age, race or level of education.
Linear Regression template:
The template for regression is quite similar to that of classification so I will spare you the boring details. Below is the outline for Linear Regression.
1. Import the Regression Algorithm
2. Assign the regression to the variable ‘reg’
3. Run prediction
4. Determine the slope of the regression (m)
5. Determine intercept of the regression (b)
6. Determine error
The error is the difference between the actual value and the value obtained from the model.
We summarize the differences between Classification and Regression under the following subsections
1. OUTPUT OF MODEL
In Supervised classification, our output are discrete class labels. If a continuous value is returned, it has no value in itself but is the probability of an event occurring.
In regression, our output values are continuous values — integer or float.
2. Evaluation of the model
In classification, we evaluate the model by determining the accuracy of the model in correctly predicting output labels while in Regression we determine how close the predicted value is to the actual value. This can be done through the sum of squared error or R-Squared Error.
Always remember to read the documentation of the algorithm you do decide to use.