data analysis

“Data is the new oil” is a famous saying nowadays. So, how we can use this data to solve our business problems?

Yes, we can get some useful insights from the data to improve and solve our business problems. So again, HOW CAN WE GET USEFUL INSIGHT FROM DATA?

Simple, by analyzing the data. So in this article, we are going to discuss the essential statistical data analysis techniques in Machine Learning.

Data Analysis

In Machine Learning, Data Analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information by informing conclusions and supporting decision making. It is used in many interdisciplinary fields such as Artificial Intelligence, Pattern Recognition, Neural Networks, etc.

In Machine Learning pipeline these data analysis steps comes under data preparation process.

Once you have data available with you, then you should be asking following basic questions to the dataset.

  1. Load data from file and read it
  2. Do preview of the data
  3. Check shape of data
  4. Understand basic information about the data (info)
  5. Describe mathematical statistics between numerical columns
  6. Check if there is any null values
  7. Check for duplicate values
  8. Check value counts of each values in the columns
  9. Check percentage of each category in a column using visualization
  10. Outliers detection using plots

Load Data From File and Read It

import pandas as pd

file_path = "data.csv"
df = pd.read_csv(file_path)

Show Preview of the Data

To preview the data in a pandas DataFrame, you can use the head() method. The head() method displays the first few rows of the DataFrame, giving you a quick overview of the data. Or you can use sample() method which gives you random samples from dataset.

print(df.head())
print(df.sample(5))

Check Shape of Data

To check the shape of a pandas DataFrame, you can use the shape attribute. The shape attribute returns a tuple containing the number of rows and columns in the DataFrame.

print(df.shape)

Understand Basic Information About the Data

To understand basic information about the data in a pandas DataFrame, you can use the info() method. The info() method provides a concise summary of the DataFrame, including the data types of each column, the number of non-null values, and memory usage.

print(df.info())

Describe Mathematical Statistics Between Numerical Columns

To describe the mathematical statistics of numerical columns in a pandas DataFrame, you can use the describe() method. The describe() method provides a summary of descriptive statistics for each numeric column in the DataFrame, including count, mean, standard deviation, minimum, quartiles, and maximum values.

print(df.describe())

Check if There is any Null Values

To check if there are any null values in a pandas DataFrame, you can use the isnull() method. The isnull() method returns a DataFrame of the same shape as the original one, where each element is a Boolean value indicating whether it is a null value (True) or not (False). After that, you can use the any() method to check if there are any True values in the DataFrame, which would indicate the presence of null values. We can also go for sum() method.

df.isnull().any()

Or

df.isnull().sum()

Check for Duplicate Values

To check for duplicate values in a pandas DataFrame, you can use the duplicated() method.

df.duplicated().any()

Or

df.duplicated().sum()

Check Value Counts of Each Values in the Columns

To check the value counts of each unique value in each column of a pandas DataFrame, you can use the value_counts() method on each column. The value_counts() method returns a Series containing the count of each unique value in the column.

df[column_name].value_counts()

You can also put for loop and print it for every columns.

for column in data.columns:
    print(data[column].value_counts())

Check Percentage of Each Category in a Column Using Visualization

We can use Pie Charts.

import matplotlib.pyplot as plt
plt.pie(df['column_name'].value_counts(), autopct="%0.2f")
plt.show()

Outliers Detection Using Plots

Outliers are data points that significantly differ from the majority of the data. To detect outliers using plots, you can use various graphical methods such as box plots, scatter plots, and histograms. These plots can help visualize the distribution of data and identify observations that fall far away from the typical values.

Box Plot:

import matplotlib.pyplot as plt

df.boxplot()
plt.title("Box Plot - Outlier Detection")
plt.show()

Scatter Plot:

Let’s say there is a relationship between age and salary.

plt.figure(figsize=(8, 6))
plt.scatter(df['Age'], df['Salary'])
plt.title("Scatter Plot - Outlier Detection")
plt.xlabel("Age")
plt.ylabel("Salary")
plt.show()

Histogram:

A histogram shows the frequency distribution of a numerical variable. Outliers can be seen as isolated bars far away from the main distribution.

plt.figure(figsize=(8, 6))
plt.hist(df['Age'], bins=10)
plt.title("Histogram - Outlier Detection")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

There are many more ways to do basic analysis of datasets using visual and non-visual analysis.

By Akshay Tekam

software developer, Data science enthusiast, content creator.

Leave a Reply

Your email address will not be published. Required fields are marked *