Biva

These techniques allow you to identify relationships, dependencies, and interactions between variables in your dataset. We have already seen univariate analysis in previous article. Here we will explore more on bivariate and multivariate analysis using different plots and draw relations between them.

Hello Everyone, we are on the journey to do EDA, and by plotting various graphs we can perform some analysis. we will see all those one by one.

Download dataset CSV files here.

What is EDA in Machine Learning

It stands for Exploratory Data Analysis. It is an important initial step that we do in data analysis that involves examining and understanding the dataset to gain insights, identify patterns. EDA helps in uncovering the underlying structure of the data and informing modeling decisions.

Before going further we should know types of data we deal with. There are usually two types of data we work with numerical and categorical.

Numerical e.g. : Age, weight, number of students, count of passengers, etc.

Categorical e.g. : Gender, Colors, Grades, country, etc.

For demonstrating my points I will take help of different datasets here like, titanic, flight, iris, tips.

What is Bivariate Analysis

Bivariate analysis includes plotting scatter plot for visualize the relationship between two numerical variables, correlation analysis to calculate linear relationship between two numerical variables, and sometimes box plot as well.

What is Multivariate Analysis

Multivariate analysis includes Heatmaps and Pair Plots to visualize the correlations between multiple numerical variables. Dimensionality reduction of the dataset helps visualize the relationships between variables in a lower-dimensional space which helps in identifying clusters or patterns within the data.

Import all necessary libraries and read datasets.

import pandas as pd
import seaborn as sns

This tips dataset loaded from seaborn.

tips = sns.load_dataset('tips')

Preview of tips

tips.sample(5)
	total_bill	tip	sex	smoker	day	time	size
194	16.58	       4.00	Male	Yes	Thur	Lunch	2
171	15.81	       3.16	Male	Yes	Sat	Dinner	2
181	23.33	       5.65	Male	Yes	Sun	Dinner	2
238	35.83	       4.67	Female	No	Sat	Dinner	3
9	14.78	       3.23	Male	No	Sun	Dinner	2

Read titanic, flight, and iris dataset as well and preview it.

titanic = pd.read_csv('train.csv')
flights = sns.load_dataset('flights')
flights.head()
iris = sns.load_dataset('iris')
iris

1. Scatterplot (Numerical – Numerical)

For tips data This is a good example of multivariate analysis, as here we are going to find some linear relationship between two variables. On scatter plot we usually keep categorical variables on x axis and numerical on y axis. For this particular example, we are having both numerical columns x and y.

sns.scatterplot(x=tips['total_bill'], y=tips['tip'])

You can extent it to multivariate by passing more parameters of variable like below.

sns.scatterplot(x=tips['total_bill'], y=tips['tip'], hue=tips['sex'], style=tips['smoker'], size=tips['size'])

2. Bar Plot (Numerical – Categorical)

Let’s use titanic dataset for this bar plot and draw relation.

sns.barplot(x=titanic['Pclass'], y=titanic['Age'], hue=titanic['Sex'])

3. Box Plot (Numerical – Categorical)

Here we are plotting a graph between male and female where you will find some outliers as well.

sns.boxplot(x=titanic['Sex'], y=titanic['Age'])

Extend it to,

sns.boxplot(x=titanic['Sex'], y=titanic['Age'],hue=titanic['Survived'])

Here you can see male with young age survived more, and in case of female with older age survived more.

4. histplot (Numerical – Categorical)

sns.histplot(titanic[titanic['Survived']==0]['Age'], kde=True)
sns.histplot(titanic[titanic['Survived']==1]['Age'], kde=True)

As you can see in above graph that they tried to save young kids, that’s why survival probability of young kids are little more compare to older ones.

5. HeatMap (Categorical – Categorical)

Let’s say if we wanna find out in each Pclass how many people survived and how many did not.

sns.heatmap(pd.crosstab(titanic['Pclass'],titanic['Survived']))

The below code is calculation percentage of survival of each class.

(titanic.groupby(['Embarked']).mean()['Survived']*100)

6. ClusterMap (Categorical – Categorical)

sns.clustermap(pd.crosstab(titanic['Parch'],titanic['Survived']))

7. Pairplot

sns.pairplot(iris,hue='species')

It shows color encoded plots.

8. Lineplot (Numerical – Numerical)

may_flights = flights.query("month == 'May'")
sns.lineplot(data=may_flights, x="year", y="passengers")

Here it shows every year in month of May how many passengers took flight.

Full code is here.

Conclusion

These are the basic tools you can use to do EDA process. I hope you must learn something. These analysis help in understanding the complexity and relationships within the data, which can guide feature selection, model building. Please check out the whole notebook for full code.

By Akshay Tekam

software developer, Data science enthusiast, content creator.

Leave a Reply

Your email address will not be published. Required fields are marked *