correlation

Correlation analysis is a powerful statistical technique used to examine the relationships between variables in a dataset. It helps us understand how variables are related and provides insights into their dependencies. In this article, we will explore how to create a correlation matrix using the pandas library in Python. By leveraging pandas’ functionalities, we can easily calculate and visualize correlations to gain valuable insights from our data.

A statistical method called correlation can be used to show the relationship between two variables. The correlation matrix is produced by using the â€¯â€‹df.corr()​  method of Pandas. The pairwise correlation of each column in the dataframe is determined using this. Any â€¯â€‹na​  values are immediately disregarded. It is disregarded for any columns in the dataframe with non-numeric data types. 

What is Correlation

Correlation is a measure of the statistical relationship between two variables. It indicates how changes in one variable are associated with changes in another. The correlation coefficient, typically denoted by the symbol “r,” ranges from -1 to +1. A positive correlation (r > 0) suggests a direct relationship, while a negative correlation (r < 0) indicates an inverse relationship. A correlation value close to 0 signifies a weak or no relationship between the variables.

Creating a Correlation Matrix with Pandas

Here’s a step-by-step guide:

Step 1: Import the necessary libraries.

Step 2: Load the dataset.

Step 3: Calculate the correlation matrix.

Step 4: Visualize the correlation matrix.

Step 5: Interpret the correlation matrix.

Example 1:  

import pandas as pd 

data = {'A': [45, 37, 42], 
        'B': [38, 31, 26], 
        'C': [10, 15, 17] 
        } 

df = pd.DataFrame(data) 
  
corrMatrix = df.corr() 
corrMatrix

               A	       B	      C
A	1.000000	0.458388	-0.583324
B	0.458388	1.000000	-0.989268
C	-0.583324	-0.989268	1.000000

Values at the diagonal shows the correlation of a variable with itself, hence diagonal shows the correlation â€¯â€‹1​ 

Example 2: 

import pandas as pd 

data = {'A': [45, 37, 42, 50], 
        'B': [38, 31, 26, 90], 
        'C': [10, 15, 17, 100], 
        'D': [60, 99, 23, 56], 
        'E': [76, 98, 78, 90] 
        } 
  
df = pd.DataFrame(data) 
  
corrMatrix = df.corr() 
corrMatrix
	       A	       B	       C	        D	      E
A	1.000000	0.830705	0.769591	-0.440535	-0.324389
B	0.830705	1.000000	0.972514	-0.007424	0.256854
C	0.769591	0.972514	1.000000	-0.092702	0.309316
D	-0.440535	-0.007424	-0.092702	1.000000	0.771163
E	-0.324389	0.256854	0.309316	0.771163	1.000000

Mathematical Formula for Correlation

r = (Σ((X – μX) * (Y – μY))) / (sqrt(Σ((X – μX)^2) * Σ((Y – μY)^2)))

In this formula:

  • Σ denotes the summation symbol, which represents the sum of all the values within the parentheses.
  • X and Y are the variables for which we want to calculate the correlation.
  • μX and μY are the means (averages) of the X and Y variables, respectively.
  • (X – μX) and (Y – μY) represent the deviations of each value from their respective means.
  • sqrt() denotes the square root function.

To compute the correlation coefficient using this formula, you need to calculate the mean of both variables and then sum the products of the deviations from the mean. This sum is divided by the square root of the product of the sum of squared deviations for both variables.

The resulting correlation coefficient, “r,” will range from -1 to +1. A value close to +1 indicates a strong positive correlation, while a value close to -1 suggests a strong negative correlation. A value of 0 signifies no linear correlation between the variables.

Correlation Heatmap

A heatmap that displays a 2D correlation matrix between two discrete dimensions and uses colored cells to represent data from typically a monochromatic scale is called a correlation heatmap. The first dimension’s values are displayed as the table’s rows, while the second dimension’s values are displayed as columns. The percentage of measurements that match the dimensional value is shown in the cell’s color. Because they show differences and variance in the same data and make patterns easy to comprehend, correlation heatmaps are perfect for data analysis. A color bar helps a correlation heatmap, like a conventional heatmap, by making the data more legible and understandable. 

Example : 

import seaborn as sns 
import pandas as pd 

glue = sns.load_dataset("glue").pivot("Model", "Task", "Score") 
sns.heatmap(glue)

Conclusion

Correlation matrices play a crucial role in data analysis, allowing us to understand the relationships between variables. With the help of pandas, we can easily calculate and visualize correlation matrices, enabling us to make data-driven decisions and gain insights into our dataset. By understanding the relationships within our data, we can make informed predictions, identify patterns, and uncover valuable information. With pandas flexibility and intuitive syntax, exploring correlations becomes a straightforward process that empowers us to extract meaningful insights from our data.

Remember, correlation does not imply causation. While correlations provide valuable insights, additional analysis and domain knowledge are necessary to draw meaningful conclusions and make informed decisions based on the correlations observed.

By Akshay Tekam

software developer, Data science enthusiast, content creator.

Leave a Reply

Your email address will not be published. Required fields are marked *