Using Functions in Data Analysis

Implementing functions in your data analysis process.

Photo by John T on Unsplash

Functions are not a new idea in programming. In fact, most programming languages have built-in functions that allow programmers carry out different tasks.

Functions are sequence of codes written together in blocks in order to perform an action or a series of actions.

Some in-built functions in Python include:

  • min()
  • max()
  • cos(x)
  • sin(x)
  • tan(x)
  • str()
  • float()

and many more.

We can also define our own functions. To define a function, we use the keyword def followed by a function name and parenthesis. Once a function is defined, it can be called several times throughout the program.

Function naming rules are similar to those of variables. The function name cannot begin with a number. However letters, numbers and punctuation marks can be used when naming a function. Avoid assigning the same name to two different functions and also do not have a variable with the same name as your function in the same program. Keywords should not be used as function names.

def some_func2(): 

The empty parentheses indicates that the function above does not take any arguments. To define a function which takes arguments, we simply place parameters between the parenthesis.

def some_func3(a, b):

A function is made up of a header, which is the first line and a body which is indented and can contain any number of codes.

def some_func-4(a,b):
print(a,b)

Why Use Functions

  • Functions increases data science workflow efficiency by eliminating repetition and redundant lines of codes.
  • Making corrections to code is easier as you only have to debug the function and corrections are applied wherever the function is called.
  • Using functions makes the program smaller and compact.
  • Most importantly, functions are reusable and can be ‘called’ multiple times across the program.

Execution of Functions

Python codes are executed line by line. When we define a function, python saves the function and is able to recall it whenever it comes across the function. In this way, we can use one function multiple times during our work flow without having to bother rewriting the code.

To call a function we simply write the function name followed by parentheses and any arguments if necessary.

Functions can be as simple as printing ‘Hello World’ or so complex that they make use of nested loops and other functions.

Creating a Function

To create a function in python, the def keyword is used followed by the function name and parenthesis.

def new_func():
print('Hello World')
new_func()
>> Hello World

This simple function prints ‘Hello World’ whenever it is called.

Function Arguments

Functions can be defined such that they take arguments. Parameters serve as place holders and indicate that the function should be called with values within the parentheses.

def func(name):
print('Hello',name)
func('James')
>> Hello James
func(9)
>> Hello 9

The function above will print ‘Hello’ + whatever argument is placed within the function when it is called.

The number of arguments put into the function matters.

def add(a,b):
total = a + b
print(total)
add(1,3)
>> 4

A Type Error is returned if a function receives fewer arguments than expected as specified during the function definition.

add(2)
>> TypeError: summ() missing 1 required positional argument: 'b'

In the example above, an error occurs because the function expects two values but receives only one.

add(1,3,4)
>> TypeError: summ() takes from 1 to 2 positional arguments but 3 were given

Likewise an error will be returned if you pass in more arguments that required by the function.

If you are uncertain of the number of parameters that will be passed into the function, arbitrary arguments can be passed into the function. To make an arbitrary argument, simply pass an asterisk before the parameter name.

Adding Functions to your Data Science workflow

The aim of using functions is to make our work a little easier. Therefore, not everything needs a function. You should look out for task you have to carry out more than once or twice in your analysis process. Rather than having to copy and paste or rewrite the same block of code, you can easily make it into a function. Some of these tasks include:

Filling Null Values

No one wants to spend so much time going through each column to fill in missing values. What if you have a thousand columns (exaggeration, I know), you will find that this task is repetitive and can become quite cumbersome when you have a large dataset. These are the type of tasks you want to automate, creating a function helps you do just that.

def fillvalues(data):
for col in data.columns:
data[col].fillna(-999, inplace=True)
fillvalues(df)

The simple function above will iterate through all the columns in a data set and replace any missing numerical value with the number -999 or any specified number. This function reduces the need to run several lines of code filling every missing value in a column.

The function can be modified for dataset with categorical data by introducing an if/else statement and specifying the value to be filled.

This function works efficiently with numerical columns because often times, the null values across different columns can be filled with the same value, an outlier such as -9999. For categorical columns, it is common practice to fill missing values with the most frequent value or the modal value. The code will look similar to the example below.

def fillvalues(data):
for col in data.columns:
if data[col].dtypes == float:
data[col].fillna(-99999, inplace=True)
else:
mode_val = data[col].mode()[0]
data[col].fillna(mode_val, inplace=True)
fillvalues(df)

Here I have extracted the modal value of each column and saved it in the variable mode_val. This value will be obtained for each column and will replace missing values. You can fill missing categorical values with a constant value in which situation, you only have to specify the value. You can also replace missing numerical values with the median or mode value of the column. Have fun with it, you can take this sample code and try different datasets or scenarios.

This is a code I have used severally during my data analysis process. It makes it easier for me to loop through the columns and fill in the values automatically.

One precaution I always take is to view the data types of each column in the dataset. This can easily be done using the pandas dataframe info() function. The function returns the number of entries in a dataset (rows), the Columns in the dataset, Non-Null Count and Data type of each row.

With this information, I am able to modify my function to suit the dataset.

Reading Text files

def filereader(file):
f = open(file)
count = 0
for line in f:
count +=1
print(line)
return count

filereader()

The function above will open the specified text file, iterate through its content and print them out after which it returns the number of lines in the text file.

Helper Function

This is a code excerpt from Aifenaike’s Data Science Nigeria Bootcamp Qualification Hackathon on Loan default.

Here, the cross validation loop and other processes are built for training and stacking that will be done throughout our workflow…

This is a more complex function which is used to train stacked models. The full code can be viewed on his GitHub page.

Final Thoughts

Coding re-usable functions is one of our ‘9 Habits of Effective Data Scientists’. Another effective habit is the use of comments and documentation. Comments are important in functions. The more complex our function is, the more comments are needed to serve as a guide. Remember, comments are for your benefit as well as anyone who would seek to read or contribute to your work.

You can use single inline comments or multiple line comments. To write comments across multiple lines simply use three single or double quotation marks. The comments should have such information as what the purpose of the function (function of the function, LOL), function parameters and expected arguments.

def total(a,b):
"""
This function calculates the sum of two arguments
a is an int or float
b is an int or float
"""
# Just adding an in-line comment for fun
return a + b

Writing a function is not the easiest thing to do and can take some debugging to get it right. Bear in mind that one good function can be used multiple times while you only have to debug one set of code.

If you have not been making use of functions, now is a good time to start. If you have been making use of functions, what tasks do you use them for and how has this improved your data science workflow?

Data Science Nigeria PH community. Data Science and Artificial Intelligence

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store