Tools and skills you need in Data Science

Photo by Vadim Sherbakov on Unsplash

There has been a debate on whether Data Science is the ‘sexiest job of the 21st century’ or not. We will not be taking part of that debate — today. Rather, for those looking to get a start in data science and in extension, Machine Learning, or if you are just looking to get ahead in the field, we have curated tools you need as a data scientist.

These tools will be broken down into different categories for convenience — more mine than yours. We will also look at the basic skills a data scientist show possess and how to build an amazing Data Science portfolio.

That sounds like a lot, so let’s get started.

DATABASES

  • Hierarchical database
  • Network database
  • Object oriented databases
  • Relational databases and
  • NoSQL databases.

The most common database in use, and our focus is the Relational Database.

Structured Query Language (SQL) is the standard language for communicating with a database and the language used by Database Management Systems.

A Database Management System (DBMS) is used to interact with the database through queries. The DBMS can be used to create databases, access database, organize the database, modify the database and extract information from the database. Popular DBMS include MySQL, PostgreSQL, Microsoft SQL Server, Oracle etc.

SQL is fairly easy to learn once you understand its syntax and query format. It is also important to note that the syntax may differ across the various DBMS. The difference is usually very small though and transitioning across the various DBMS should not pose any serious problems.

LANGUAGES

Whichever you choose, both R and Python have a large community which you can rely on for help. For me, that’s a huge factor — Getting help.

R programming language is very good for statistics and computational analysis, especially when you are working with large large datasets. It also has different data types such as scalars, vectors, matrices, data frames, arrays and lists. R has numerous packages for working with data from pre-processing to data analysis and machine learning.

Integrated Development Environments (IDEs) for running and editing R codes include:

  • RStudio
  • Rattle
  • StatET for R
  • ESS etc

Python on the other hand is a general purpose language. When I started learning python, it was not because of data science even though I cannot remember what it was for then (back then I just wanted to learn programming because it was the it thing and I heard python was easy to learn. I hope some of you can relate) and yes, it is easy to learn the basics of Python. Python also has libraries which are very useful when working with data and has gained much ground in machine learning and AI.

Python IDEs include:

  • Jupyter Notebook
  • Spyder
  • Pycharm

But do not be fooled by the back and forth between R and Python, there are other programming languages which can be used for data analysis. I’ve discussed this in a previous post and what factors you should consider when deciding with language you want to work with.

BIG DATA TOOLS

Big Data can be structured or unstructured and have different data types.

Big Data tools include:

  • Hadoop
  • Data Lakes
  • Spark
  • Cassandra
  • MondoDB

CLOUD

The three main types of cloud computing are

  • Infrastructure as a Service
  • Platform as a service
  • Software as a service

The advantage of using these cloud services is that users can work with large amounts of data without having to worry about the computational power of their devices and they also do not need to maintain their own servers.

Cloud computing services also offer little or no-code environments for deploying machine learning models.

These Cloud services promise:

  • Scalability and flexibility
  • Data protection
  • Data accessibility at anytime from anywhere
  • Cost efficiency (check prices for cloud service of choice)
  • Networking capabilities

HOW TO ‘LEARN’ DATA SCIENCE

It is not easy to say because individuals have different learning abilities. I will however give a timeline.

If you do choose to go down the Python path, this can take you any time between a week and a month. Python is quite easy be quite to learn especially because you are not expected to learn all parts of Python. The areas you should focus on are:

  • Python Data Structures (integers, float, strings, list, dictionaries)
  • Using functions in Python
  • Accessing web data with Python
  • Python libraries such as Numpy, Pandas, Scipy that can be used to manipulate data

If you dedicate 3–4 hours a week, you should be able to learn all these. I will not recommend any particular course because there are so many good ones available — paid and free.

Next you should learn SQL, SQL is quite easy. Like I said above once you get the syntax of SQL, you can spend between 2–3 weeks learning SQL. Even less. Remember, the syntax may vary for different DBMS but the structure remains the same.

Even more than learning all these, I would advise you to take a refresher course or two on statistics and Linear Algebra.

I find that a lot of times, people focus more on playing with the ‘big tools’ and don’t bother learning the basics. While there are libraries which can handle these aspects for us, it is still good to understand the basics.

This would work quite the same with R although R can prove daunting to beginners but once you get the hang of it. I believe it will be much easier.

Most importantly, I would advise you start working on projects as soon as possible. Working on projects will allow you put what you learn into practice. They say the best way to learn is by doing, right?

Data Science Portfolio

No, I do not expect you to have one of those clear folder bags filled with data science projects. The easiest way to build a portfolio as a data scientist is to have a blog or a GitHub account. They both offer different benefits but with a GitHub account, you do not have to bother yourself with maintaining a blog.

When uploading projects you should make sure to include a summary of the project or documentation. In GitHub, you upload a README explaining the use of your project, how it can be set up (if need be) and any extra work than can be done. Another easy way is to contribute to repositories. You can also set up your GitHub account so that so that it better reflects your work.

However you choose to showcase your work, here are some basic skills you should highlight in your work:

Data cleaning ability

Also you should show how you deal with missing data as missing data is often something we cannot avoid in datasets. Do you replace them or flat out remove them.

Ability to extract insight from data

Don’t try and do anything too complex to look like an ‘expert’. Remember to use comments in your code and when possible, use functions. Your codes should be simple and efficient.

You can carry out feature selection and engineering and the fit a model to the data.

Data presentation/ Visualization

There are also visualization libraries in Python and R that you can use to visualize your data. For more tips on how to visualize your datasets, you can check our previous post: Simple Guide to Data Visualization.

Predictive Analysis

Machine Learning Libraries

  • Tensorflow
  • Keras
  • Pytorch

Conclusion

We end on a good note. Know that there are many resources for learning data science and machine learning. Some of my best Python lessons were taken on YouTube.

If you do decide to join ‘the sexiest job of the 21st century’, you should do it right.

Data Science Nigeria PH community. Data Science and Artificial Intelligence

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store