One thing we can all agree is very important for Data Science and Machine Learning is Data. Lots of Data. And with lots of data comes lots of problems.
With the rise in Artificial Intelligence, one term we hear all too often is Big Data. What is Big Data?
Big Data is used to refer to structured and unstructured data of very large volume that are difficult to process using traditional computing devices. However, Big Data is an evolving term and differs among organizations and time of use.
With new computing power and the advent of Cloud computing, what we considered Big Data ten years ago may not be so big. You see, Big Data has more to do with computing power than the size of the data itself.
I remember growing up, 200MB was big data for me —Expensive data. If I tried to download a song, I would check how big it was MB-wise, 2–10 MB did it for me. Anything much bigger and I would just have to wait for radios to play the song. Yes, I was a miser but in my defense, I wasn’t earning any money myself. Now, movie streaming is a thing and I cannot survive a day with just 200MB. In school I spend an average of 3–5 thousand Naira on data , depending on how broke I am. We know those one thousand naira for 1.5GB don’t take us anywhere.
Before I tell you all my life story, what I am trying to say is that Big Data is relative. I bet what a small business considers as Big Data is a small Data for Facebook or Google.
Big Data is defined using 3Vs(or 5Vs).
Big Data is Big in size. Now like I said earlier, a large volume of data in a small company can be peanuts in a big one. The volume of data collected is fast-growing and Big Data may as well become Bigger Data — that was a joke. Haha.
The speed with which the data is obtained. With sensors and other technology we can now get data real-time. Let me use Google for example. When you make a search, you see results in fractions of a second or less — depends on your network speed too — you can even see how fast it takes. In writing this piece of course I had to google what is Big Data and I got About 9,390,000,000 results (0.64 seconds). Talk about BIG and FAST!
Data can be collected from different sources. Data from different sources come in different formats (structured or unstructured) or data types. As they say, variety is the spice of life.
Is out data accurate? Is it truly representative of reality? That is what veracity is — How accurate and trustworthy our data is.
We would not look through millions or billions of data if we didn’t hope to get something out of it — Value. Data exploration, feature extraction, model training and predictions are all done for a purpose, if that purpose is not being fulfilled, the data has no value to us.
Big Data has Big Problems
There are a lot of challenges in handling Big Data. These challenges may stem from the data, algorithm, computing device, data collection point and so all. One of these problems is missing data. When there are missing values in a dataset, one is faced with a couple of options:
- Drop the observations with missing data
- Fill in the missing data
- Assume the missing data are a variable
- Ignore the attribute with missing data
- Ignore instances of missing data
Missing data may stem from failure in data collection point (e.g. Sensor), human error (failing to fill in needed data) etc.
The solutions applied will depend on the amount of missing data, the attributes with the missing data and category of missing data.
When faced with a dataset, one of the first things you should do is check for missing values. If missing values exist, ascertain which columns or features have missing values. If the column is not of interest to the decision-making process then you can simply drop the feature as a whole. Recall that columns and feature are the same in this respect.
In python, you can use the pandas dropna() function to drop observations with null values. However, this method can lead to the loss of a large portion of data and should be used carefully.
I once used the dropna() function on a dataset and half of the data was cut. You should go through all the features for missing values to determine which ones are ‘droppable’ and which you need to use the pandas fillna() function on.
The approach to missing data also depends on the missing data type. In a column of categorical data, missing values can easily be replaced with the mode value — the value that occurs the most. If data type is numerical, the null values can be replaced with the mean or median value of data in the feature column. Classification and regression models can also be applied to imput the values of the missing categorical or numerical values respectively.
Another method is to can replace the null/missing values with constant values. Again, the constant values would depend on the data type of the column or feature.
If the missing data are very little compared to the dataset, you may choose to ignore them or treat them as a variable of the dataset. That is, train the model with the null values as separate values in the dataset.
You may choose compare the model accuracy using the null values, then replacethe null values and fit the model again to determine the effect the null values have on the model accuracy.
Note that for any method you choose to use, errors may abound. Watch out for those. You can also combine a number of methods on a dataset.
Remember, do not be afraid to play with your data.
References and additional reading
- Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido
- Machine Learning Mastery Tutorial- Link