eda

Univariate analysis helps to understand the distribution, central tendency, dispersion, and other key characteristics of each variable in your dataset. It is nothing but analysis of a single column.

Univariate analysis is one of the EDA’s fundamental technique. There are bivariate and multivariate as well. We will plot some cool graphs here and will try to understand dataset.

What is EDA in Machine Learning

It stands for Exploratory Data Analysis. It is an important initial step that we do in data analysis that involves examining and understanding the dataset to gain insights, identify patterns. EDA helps in uncovering the underlying structure of the data and informing modeling decisions.

What is Univariate Analysis

As I stated above it is nothing but analysis of a single column or variable in dataset. The word ‘uni’ means single, and ‘variate’ means variable here. Before going further we should know the types of data we deal with. There are usually two types of data we work with numerical and categorical.

Numerical e.g. : Age, weight, number of students, count of passengers, etc.

Categorical e.g. : Gender, Colors, Grades, country, etc.

So in process of univariate analysis the first question we must ask is which type of data it is, whether it is numerical or categorical.

Let’s understand further steps of univariate analysis using our favorite example of TITANIC dataset. This is a very famous dataset where we will predict the survival of passengers.

Download dataset CSV here TITANIC

Here in dataset all columns indicates information about passengers and Survived column shows whether the passenger survived or not.

Categorical Data

Here we will learn how to deal with categorical variables if you have any. You will find following columns in dataset:

  • PassengerId: ID of each passenger.
  • Survived : Whether the passenger survived or not.
  • Pclass : There were three classes given 1,2, and 3.
  • Name : Name of each passenger.
  • Sex : Gender of each passenger.
  • Age : How old each passenger is.
  • SibSp : Sibling or spouse.
  • Parch : Parent or child.
  • Ticket: Ticket number.
  • Fare : Cost of ticket.
  • Cabin : Cabin was allocated to only 1st class passengers.
  • Embarked : From which pickup point each passenger came.

Now we have to gain knowledge of each column independently using univariate analysis.

We should look for only those columns, which will help us to tell about survival of passengers. So all categorical columns I can find is, Survived, Pclass, Sex, SibSp, Parch, Cabin, Embarked. Here for demo purposes we will work on only few categorical columns Survived, Pclass.

So we will draw a count plot here to get the frequency of each categories in ‘Survived’ column using seaborn.

import pandas as pd

df = pd.read_csv('train.csv')

#count plot
df['Survived'].value_counts().plot(kind = 'bar')

Here it will show you two bars indicating categories. You can also do the same thing by below code as well.

import seaborn as sns

sns.countplot(x=df['Survived'], dodge=False)

So you got the idea that whenever you get any categorical column you will try count plot on it. You can derive some insights from count plots like in case of ‘Sex’. There were more male than female. In case of ‘Pclass’ if you find mean of each category, you will find that why people want to travel through specific class more and stuff like that.

If you want same information in percentage term, you can use piechart as well.

df['Survived'].value_counts().plot(kind = 'pie', autopct='%.2f')

As you can see out of 100% the 62% people died and only 38% people survived.

Numerical Data

Now let’s see how can we plot graphs for numerical data columns as well.

Histogram: It displays the frequency or count of data points falling within specified intervals or bins.

import matplotlib.pyplot as plt
plt.hist(df['Age'])

You can plot same histogram with kde values. It shows probabilities on Y axis.

sns.histplot(df['Age'], kde=True)

Boxplot: It includes the minimum, maximum, median, and quartiles, as well as any potential outliers in the data. Box plots provide a concise summary of the data’s central tendency, spread, and skewness.

sns.boxplot(df['Age'])

You can also calculate min, max, mean, and skewness…etc.

df['Age'].min()
df['Age'].max()
df['Age'].mean()
df['Age'].skew()

Conclusion

By implementing univariate analysis, you gain insights into each variable’s characteristics, identify potential issues or patterns in data. Comment down below if you have any questions about this topic.

By Akshay Tekam

software developer, Data science enthusiast, content creator.

Leave a Reply

Your email address will not be published. Required fields are marked *