Tools and skills you need in Data Science
There has been a debate on whether Data Science is the ‘sexiest job of the 21st century’ or not. We will not be taking part of that debate — today. Rather, for those looking to get a start in data science and in extension, Machine Learning, or if you are just looking to get ahead in the field, we have curated tools you need as a data scientist.
These tools will be broken down into different categories for convenience — more mine than yours. We will also look at the basic skills a data scientist show possess and how to build an amazing Data Science portfolio.
That sounds like a lot, so let’s get started.
A database is a collection of data. Types of Databases include:
- Hierarchical database
- Network database
- Object oriented databases
- Relational databases and
- NoSQL databases.
The most common database in use, and our focus is the Relational Database.
Structured Query Language (SQL) is the standard language for communicating with a database and the language used by Database Management Systems.
A Database Management System (DBMS) is used to interact with the database through queries. The DBMS can be used to create databases, access database, organize the database, modify the database and extract information from the database. Popular DBMS include MySQL, PostgreSQL, Microsoft SQL Server, Oracle etc.
SQL is fairly easy to learn once you understand its syntax and query format. It is also important to note that the syntax may differ across the various DBMS. The difference is usually very small though and transitioning across the various DBMS should not pose any serious problems.
I have seen articles and videos comparing R and Python, discussing which is better to use for data analysis or which a data scientist should start with. I understand that these questions come from a place of worry and the need to get things right. I would however like to state that I see these languages as tools — a means to an end. There are people who will swear and die by R and likewise for Python.
Whichever you choose, both R and Python have a large community which you can rely on for help. For me, that’s a huge factor — Getting help.
R programming language is very good for statistics and computational analysis, especially when you are working with large large datasets. It also has different data types such as scalars, vectors, matrices, data frames, arrays and lists. R has numerous packages for working with data from pre-processing to data analysis and machine learning.
Integrated Development Environments (IDEs) for running and editing R codes include:
- StatET for R
- ESS etc
Python on the other hand is a general purpose language. When I started learning python, it was not because of data science even though I cannot remember what it was for then (back then I just wanted to learn programming because it was the it thing and I heard python was easy to learn. I hope some of you can relate) and yes, it is easy to learn the basics of Python. Python also has libraries which are very useful when working with data and has gained much ground in machine learning and AI.
Python IDEs include:
- Jupyter Notebook
But do not be fooled by the back and forth between R and Python, there are other programming languages which can be used for data analysis. I’ve discussed this in a previous post and what factors you should consider when deciding with language you want to work with.
BIG DATA TOOLS
Due to the nature of ‘Big Data’, conventional databases are not enough, hence Big Data tools. The field of Big Data deals with large amounts of data that are difficult to process using traditional data analysis tools or techniques.
Big Data can be structured or unstructured and have different data types.
Big Data tools include:
- Data Lakes
Cloud computing has gained ground over the years with the most popular ones being AWS, Google Cloud and Microsoft Azure. These cloud services provide on-demand computer resources such as data storage, servers, databases, networking and computing power reducing/totally removing the need for user management. The services are all run by the company and the user can access them at a price.
The three main types of cloud computing are
- Infrastructure as a Service
- Platform as a service
- Software as a service
The advantage of using these cloud services is that users can work with large amounts of data without having to worry about the computational power of their devices and they also do not need to maintain their own servers.
Cloud computing services also offer little or no-code environments for deploying machine learning models.
These Cloud services promise:
- Scalability and flexibility
- Data protection
- Data accessibility at anytime from anywhere
- Cost efficiency (check prices for cloud service of choice)
- Networking capabilities
HOW TO ‘LEARN’ DATA SCIENCE
How long does it take to become a data scientist?
It is not easy to say because individuals have different learning abilities. I will however give a timeline.
If you do choose to go down the Python path, this can take you any time between a week and a month. Python is quite easy be quite to learn especially because you are not expected to learn all parts of Python. The areas you should focus on are:
- Python Data Structures (integers, float, strings, list, dictionaries)
- Using functions in Python
- Accessing web data with Python
- Python libraries such as Numpy, Pandas, Scipy that can be used to manipulate data
If you dedicate 3–4 hours a week, you should be able to learn all these. I will not recommend any particular course because there are so many good ones available — paid and free.
Next you should learn SQL, SQL is quite easy. Like I said above once you get the syntax of SQL, you can spend between 2–3 weeks learning SQL. Even less. Remember, the syntax may vary for different DBMS but the structure remains the same.
Even more than learning all these, I would advise you to take a refresher course or two on statistics and Linear Algebra.
I find that a lot of times, people focus more on playing with the ‘big tools’ and don’t bother learning the basics. While there are libraries which can handle these aspects for us, it is still good to understand the basics.
This would work quite the same with R although R can prove daunting to beginners but once you get the hang of it. I believe it will be much easier.
Most importantly, I would advise you start working on projects as soon as possible. Working on projects will allow you put what you learn into practice. They say the best way to learn is by doing, right?
Data Science Portfolio
A Data Science Portfolio offers a quick way to show your skill set and expertise before, during and after an interview.
No, I do not expect you to have one of those clear folder bags filled with data science projects. The easiest way to build a portfolio as a data scientist is to have a blog or a GitHub account. They both offer different benefits but with a GitHub account, you do not have to bother yourself with maintaining a blog.
When uploading projects you should make sure to include a summary of the project or documentation. In GitHub, you upload a README explaining the use of your project, how it can be set up (if need be) and any extra work than can be done. Another easy way is to contribute to repositories. You can also set up your GitHub account so that so that it better reflects your work.
However you choose to showcase your work, here are some basic skills you should highlight in your work:
Data cleaning ability
Data in the real world is often messy and unstructured. You should highlight the fact that you are able to work with a wide variety of data. Here you can also show how you get your data be it web scrapping, reading the file from the web (using urllib and the like) or using data in your system.
Also you should show how you deal with missing data as missing data is often something we cannot avoid in datasets. Do you replace them or flat out remove them.
Data Quality Issues: Dealing with Missing Data
One thing we can all agree is very important for Data Science and Machine Learning is Data. Lots of Data. And with lots…
Ability to extract insight from data
There is no particular way to do this. It is going to be a combination of different methods, Exploratory Data Analysis and good old intuition to make sense of the data.
Don’t try and do anything too complex to look like an ‘expert’. Remember to use comments in your code and when possible, use functions. Your codes should be simple and efficient.
You can carry out feature selection and engineering and the fit a model to the data.
Data presentation/ Visualization
Don’t forget to visualize your data using the appropriate charts. There are self visualization tools that do not require any coding such as Power BI or Tableau.
There are also visualization libraries in Python and R that you can use to visualize your data. For more tips on how to visualize your datasets, you can check our previous post: Simple Guide to Data Visualization.
Much of the work done with data, is to draw insight from data. The insight drawn from data is then used to predict future events. To do this, we often have to build models with available data. These models can then be used to predict the future, all things beings equal. This aspect is usually known as Machine Learning. Machine Learning is a broad area and is divided into two major areas: Supervised and Unsupervised.
Machine Learning Libraries
An end-to-end data science project will often begin from a business problem, obtaining data, exploring data and then building and deploying machine learning models with which to carry out predictive analysis.
We end on a good note. Know that there are many resources for learning data science and machine learning. Some of my best Python lessons were taken on YouTube.
If you do decide to join ‘the sexiest job of the 21st century’, you should do it right.