Understanding data is a critical step in machine learning. Before developing and training a model, it’s essential to gain insights from our data we’re working with. In this article I will try to cover all basic steps that you should perform before proceeding with data. This step mainly involves Data Exploration, Data Visualization, Data Preprocessing, Target Variable Analysis, Correlation Analysis, Outlier Detection, etc.
But we are not gonna see all that. Here my goal is to walk you through the basic steps that I usually follow whenever I work with any dataset in machine learning problem.
What is Machine Learning Data
It usually refers to the dataset or collection of examples used to train our models, validate, and test machine learning models. It consists of input features (also called independent variables or predictors) and corresponding output values (also known as target variables or dependent variables) and this output variable the model aims to predict.
Whenever you work on any dataset you should ask yourself these following basic questions.
- How big is the data ?
- How does the data look like ?
- What are the data types of columns ?
- Are there any missing values ?
- How does the data look mathematically ?
- Are there duplicate values in data (duplicate rows)?
- How is the correlation between columns ?
We will see all these steps one by one through an example. This is the very famous dataset of TITANIC, which you can find on kaggle as well. You can also download it from below link. Here I am assuming that you have little bit knowledge of pandas as well.
Download CSV here TITANIC
How Big is the Data
First thing first, we will import all libraries and preview our data. We are working on train.csv file only.
import pandas as pd
df = pd.read_csv('train.csv')
df.shape
(891, 12)
Here you want to know that what is the shape of your data, means how many rows and columns it contains. There maybe a scenario where you have deal with big set of data and you should know how many rows and columns you are dealing with.
How Does the Data Look Like
Now the second question is how does the data look like, what are the features it has. All this can be known from preview of data.
df.head()
Here it will show you the preview of default top 5 rows of data. You can use sample as well which will give you random rows values in return. Run the below code:
df.sample(5)
Honestly I prefer sample() method, it gives us overall idea about variety of data in dataset.
What are the Data Types of Columns
Whenever you work on any dataset, you should know about the data types of all columns. Let’s see some information about columns.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Here it shows how many entries are there, based on that we can figure out if there is any null values present or not. It also shows the memory usage. You can see the Dtype right, that’s the data type column where all int, float, object types are written. There are certain columns which are numerical and certain columns are categorical.
We can reduce the memory usage as well by changing some of unnecessary columns type. For example, Age column has given float data type, instead it should have be integer. Whoever made this dataset he kept this Age column as float, but you can change it to int. In this small dataset demo, reducing memory usage like this won’t make any difference but while working on big dataset you can save significant amount of memory.
Are There Any Missing Values
Now this step is very important as there might be many missing or null values present in data. You can check missing values from above code as well, but there is another way you can find out missing values.
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
As you can see we get the idea before hand about number of missing values in each column.
How Does the Data Look Mathematically
Let’s say we want to know what is the maximum value in each column, minimum value in each column, or mean, count and many more things. You can use describe() method.
df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
As you can see we get the idea of all summary of mathematical values like, count, mean, std. deviation, min, max, etc. You can derive some insights here. Example, here 25% people has age equal to 20.125 and less.
Are There Duplicate Values in Data (Rows)
Next question should be is there any duplicate rows present in data. Let’s check by following code.
df.duplicated().sum()
0
If we have duplicate rows, we can drop it. Luckily we do not have any duplicate.
How is the Correlation Between Columns
Finding correlation between columns is important, as they tells us about the relation between any two columns. Correlation basically means if changing values of one thing affects another. Here we are using Pearson correlation coefficient, it has values between -1 to 1.
df.corr(numeric_only = True)
PassengerId Survived Pclass Age SibSp Parch Fare
PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652 0.012658
Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307
Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500
Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067
SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651
Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225
Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000
You can also check correlation on any specific column with all other columns using below code.
df.corr(numeric_only = True)['Survived']
That’s it, by doing all these steps you can get good understanding about your data.
Conclusion
By thoroughly understanding your data you can make informed decisions during the modeling process, select appropriate algorithms, and improve the overall performance of your machine learning model. I hope you learned something. Let me know in comments down below. Also check out my next article on EDA process.