Understanding your Data in Machine Learning

Understanding data is a critical step in machine learning. Before developing and training a model, it’s essential to gain insights from our data we’re working with. In this article I will try to cover all basic steps that you should perform before proceeding with data. This step mainly involves Data Exploration, Data Visualization, Data Preprocessing, Target Variable Analysis, Correlation Analysis, Outlier Detection, etc.

Table of Contents

But we are not gonna see all that. Here my goal is to walk you through the basic steps that I usually follow whenever I work with any dataset in machine learning problem.

What is Machine Learning Data

It usually refers to the dataset or collection of examples used to train our models, validate, and test machine learning models. It consists of input features (also called independent variables or predictors) and corresponding output values (also known as target variables or dependent variables) and this output variable the model aims to predict.

Whenever you work on any dataset you should ask yourself these following basic questions.

How big is the data ?
How does the data look like ?
What are the data types of columns ?
Are there any missing values ?
How does the data look mathematically ?
Are there duplicate values in data (duplicate rows)?
How is the correlation between columns ?

We will see all these steps one by one through an example. This is the very famous dataset of TITANIC, which you can find on kaggle as well. You can also download it from below link. Here I am assuming that you have little bit knowledge of pandas as well.

Download CSV here TITANIC

How Big is the Data

First thing first, we will import all libraries and preview our data. We are working on train.csv file only.

import pandas as pd
df = pd.read_csv('train.csv')

df.shape

(891, 12)

Here you want to know that what is the shape of your data, means how many rows and columns it contains. There maybe a scenario where you have deal with big set of data and you should know how many rows and columns you are dealing with.

How Does the Data Look Like

Now the second question is how does the data look like, what are the features it has. All this can be known from preview of data.

df.head()

Here it will show you the preview of default top 5 rows of data. You can use sample as well which will give you random rows values in return. Run the below code:

df.sample(5)

Honestly I prefer sample() method, it gives us overall idea about variety of data in dataset.

What are the Data Types of Columns

Whenever you work on any dataset, you should know about the data types of all columns. Let’s see some information about columns.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Here it shows how many entries are there, based on that we can figure out if there is any null values present or not. It also shows the memory usage. You can see the Dtype right, that’s the data type column where all int, float, object types are written. There are certain columns which are numerical and certain columns are categorical.

We can reduce the memory usage as well by changing some of unnecessary columns type. For example, Age column has given float data type, instead it should have be integer. Whoever made this dataset he kept this Age column as float, but you can change it to int. In this small dataset demo, reducing memory usage like this won’t make any difference but while working on big dataset you can save significant amount of memory.

Are There Any Missing Values

Now this step is very important as there might be many missing or null values present in data. You can check missing values from above code as well, but there is another way you can find out missing values.

df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

As you can see we get the idea before hand about number of missing values in each column.

How Does the Data Look Mathematically

Let’s say we want to know what is the maximum value in each column, minimum value in each column, or mean, count and many more things. You can use describe() method.

df.describe()

	PassengerId	Survived	Pclass	           Age	         SibSp	          Parch	        Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

As you can see we get the idea of all summary of mathematical values like, count, mean, std. deviation, min, max, etc. You can derive some insights here. Example, here 25% people has age equal to 20.125 and less.

Are There Duplicate Values in Data (Rows)

Next question should be is there any duplicate rows present in data. Let’s check by following code.

df.duplicated().sum()

If we have duplicate rows, we can drop it. Luckily we do not have any duplicate.

How is the Correlation Between Columns

Finding correlation between columns is important, as they tells us about the relation between any two columns. Correlation basically means if changing values of one thing affects another. Here we are using Pearson correlation coefficient, it has values between -1 to 1.

df.corr(numeric_only = True)

	     PassengerId  Survived	 Pclass	         Age	         SibSp	         Parch	        Fare
PassengerId  1.000000	  -0.005007	-0.035144	0.036847	-0.057527	-0.001652	0.012658
Survived    -0.005007	  1.000000	-0.338481	-0.077221	-0.035322	0.081629	0.257307
Pclass	    -0.035144	  -0.338481	1.000000	-0.369226	0.083081	0.018443	-0.549500
Age	    0.036847	 -0.077221	-0.369226	1.000000	-0.308247	-0.189119	0.096067
SibSp	    -0.057527	 -0.035322	0.083081	-0.308247	1.000000	0.414838	0.159651
Parch	    -0.001652	 0.081629	0.018443	-0.189119	0.414838	1.000000	0.216225
Fare	    0.012658	 0.257307	-0.549500	0.096067	0.159651	0.216225	1.000000

You can also check correlation on any specific column with all other columns using below code.

df.corr(numeric_only = True)['Survived']

That’s it, by doing all these steps you can get good understanding about your data.

Conclusion

By thoroughly understanding your data you can make informed decisions during the modeling process, select appropriate algorithms, and improve the overall performance of your machine learning model. I hope you learned something. Let me know in comments down below. Also check out my next article on EDA process.