Understanding Index in Pandas

In pandas, the index is a fundamental component of a DataFrame or Series. It provides a way to uniquely identify and reference individual rows or elements in the data structure. The index serves as a labeled axis along which data can be accessed, aligned, and manipulated.

Table of Contents

Also Check Out: Basics Of Pandas

Index is very important tool in pandas. It is used to organize the data and to provide us fast access to data. In here, time for data-access are compared for the data with and without indexing. For topic, Jupyter notebook is used as ‘%%timeit’ is very easy to use in it to compare the time required for various access-operations.

Note: CSV files can be downloaded from below link

Link: https://bitbucket.org/pythondsp/pandasguide/downloads/

Creating Index

import pandas as pd
cast = pd.read_csv('cast.csv', index_col=None)
cast.head()

                   title	year	name	  type	character	        n
0	  Closet Monster	2015	Buffy #1  actor	Buffy 4	               31.0
1	 Suuri illusioni	1985	Homo $	  actor	Guests	               22.0
2	Battle of the Sexes	2017	$hutter	  actor	Bobby Riggs Fan	       10.0
3	Secret in Their Eyes	2015	$hutter	  actor	2002 Dodger Fan	       NaN
4	      Steve Jobs	2015	$hutter	  actor	1988 Opera House Patron	NaN

The %%time magic command is used in Jupyter Notebook or IPython to measure the execution time of a code cell. When this command is placed at the beginning of a cell, it records the time taken to execute the entire cell and displays the result. It is a convenient way to quickly assess the performance of code and identify any potential bottlenecks.

%%time
# data access without indexing
cast[cast['title']=='Macbeth']

CPU times: total: 15.6 ms
Wall time: 22.1 ms

          title	year	      name	    type	character	n
12868	Macbeth	2015	Darren Adamson	    actor	Soldier	        NaN
22302	Macbeth	1916	Spottiswoode Aitken actor	Duncan	        4.0
25855	Macbeth	1948	Robert Alan	   actor	Third Murderer	NaN
26990	Macbeth	2016	John Albasiny	   actor	Doctor	        NaN
38090	Macbeth	1948	William Alland	   actor	Second Murderer	18.0
40639	Macbeth	1997	Stevie Allen	   actor	Murderer	21.0
60543	Macbeth	2014	Moyo Akand?	   actress	Witch	        NaN
63776	Macbeth	1916	Mary Alden	   actress	Lady Macduff	6.0

CPU times: This indicates the total CPU time spent executing the code, including both user time and system time. It is a measure of the actual time the CPU was active for the code execution. In out example only total time is shown.
Wall time: This represents the wall-clock time or the real-world time taken to execute the code. It includes the CPU time as well as any time spent waiting for I/O operations or other external factors.

‘%%timeit’ can be used for more precise results as it run the shell various times and display the average time; but it will not show the output of the shell.

%%timeit
# data access without indexing
cast[cast['title']=='Macbeth']

12.8 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Let’s break down the result of %%timeit:

12.8 ms: This is the average time taken per loop iteration. It indicates the average execution time of the code snippet.
± 2.07 ms: This represents the standard deviation, which provides a measure of the variability or spread of the execution times across multiple runs. A higher standard deviation indicates more variability in the execution times.
per loop: This specifies the number of times the code snippet was executed in a loop during each measurement. It indicates the number of times the code was repeated to calculate the average time.
(mean ± std. dev. of 7 runs, 10 loops each): This specifies the number of times the code snippet was measured (7 runs) and the number of times it was executed in a loop during each measurement (10 loops). In this case, the average time was calculated based on 7 runs, with 10 loops per run.

‘set_index’ can be used to create an index for the data. Note that, in below code, ‘title’ is set at index, therefore index-numbers are replaced by ‘title’ (see the first column).

# below line will not work for multiple index
# c = cast.set_index('title')
c = cast.set_index(['title'])
c.head(4)

	              year	 name	   type	   character	         n
title					
Closet Monster	      2015	Buffy #1   actor   Buffy 4	        31.0
Suuri illusioni	      1985	Homo $	   actor   Guests	        22.0
Battle of the Sexes   2017	$hutter	   actor   Bobby Riggs Fan	10.0
Secret in Their Eyes  2015	$hutter	   actor   2002 Dodger Fan	NaN

To use the above indexing, ‘.loc’ should be used for fast operations.

%%time
# data access with indexing
# note that there is minor performance improvement
c.loc['Macbeth']

CPU times: total: 15.6 ms
Wall time: 51.4 ms

         year	            name	type	character	n
title					
Macbeth	2015	    Darren Adamson	actor	Soldier	        NaN
Macbeth	1916	Spottiswoode Aitken	actor	Duncan	        4.0
Macbeth	1948	Robert Alan	        actor	Third Murderer	NaN
Macbeth	2016	John Albasiny	        actor	Doctor	        NaN
Macbeth	1948	William Alland	        actor	Second Murderer	18.0
Macbeth	1997	Stevie Allen	        actor	Murderer	21.0
Macbeth	2014	Moyo Akand?	        actress	Witch	        NaN
Macbeth	1916	Mary Alden	        actress	Lady Macduff	6.0

%%timeit
# data access with indexing
# note that there is minor performance improvement
c.loc['Macbeth']

We can see that, there is performance improvement using indexing, because speed will
increase further if the index are in sorted order.

Next, we will sort the index and perform the filter operation:

cs = cast.set_index(['title']).sort_index()
cs.tail(4)

%%time
# data access with indexing
# note that there is huge performance improvement
cs.loc['Macbeth']

CPU times: total: 46.9 ms
Wall time: 55 ms

Multiple Index

Further, we can have multiple indexes in the data

# data with two index i.e. title and n
cm = cast.set_index(['title', 'n']).sort_index()
cm.tail(30)
cm.loc['Macbeth']

	 year	         name	     type	character
n				
4.0	1916	Spottiswoode Aitken  actor	Duncan
6.0	1916	Mary Alden	     actress	Lady Macduff
18.0	1948	William Alland	     actor	Second Murderer
21.0	1997	Stevie Allen	     actor	Murderer
NaN	2015	Darren Adamson	     actor	Soldier
NaN	1948	Robert Alan	     actor	Third Murderer
NaN	2016	John Albasiny	     actor	Doctor
NaN	2014	Moyo Akand?	     actress	Witch

In above result, ‘title’ is removed from the index list, which represents that there is one more level of index, which can be used for filtering. Lets filter the data again with second index as well.

# show Macbeth with ranking 4-18
cm.loc['Macbeth'].loc[4:18]

	 year	            name	type	character
n				
4.0	1916	Spottiswoode Aitken	actor	Duncan
6.0	1916	Mary Alden	        actress	Lady Macduff
18.0	1948	William Alland	        actor	Second Murderer

If there is only one match data, then Series will return (instead of DataFrame)

# show Macbeth with ranking 4
cm.loc['Macbeth'].loc[4]

year                        1916
name         Spottiswoode Aitken
type                       actor
character                 Duncan
Name: 4.0, dtype: object

Reset Index

Index can be reset using ‘reset_index’ command. Let’s look at the ‘cm’ DataFrame again.

In ‘cm’ DataFrame, there are two index; and one of these i.e. n is removed using ‘reset_index’ command.

# remove 'n' from index
cm = cm.reset_index('n')
cm.head(2)

	                  n	year	    name	type	character
title					
#1 Serial Killer	17.0	2013	Michael Alton	actor	Detective Roberts
#DigitalLivesMatter	NaN	2016	Rashan Ali	actress	News Reporter

Also Check Out: Basics Of Pandas

Conclusion

In conclusion, the index is a crucial component in pandas that provides a labeled axis for referencing and manipulating data in a DataFrame or Series. Here are the key points to remember about the index in pandas:

The index serves as a unique identifier for rows or elements in a DataFrame or Series, allowing for efficient data access, alignment, and manipulation.
The default index is an integer-based index that starts from 0 and increments by 1 for each row. However, you can define custom indexes using different types, such as integers, dates, or hierarchical levels.
The index is immutable, meaning it cannot be modified once created. This ensures data integrity and enables efficient operations that rely on indexing, such as alignment and merging.
Label-based indexing using .loc[] and integer-based indexing using .iloc[] allow you to retrieve rows, columns, or individual elements based on their index labels or positions.
The index plays a crucial role in aligning data during operations between multiple DataFrames or Series, matching rows based on their index labels.
You can set an existing column as the index using the set_index() method, providing a meaningful label or unique identifier for the rows.
Pandas provides various types of indexes, such as Int64Index, DatetimeIndex, and MultiIndex, to handle different types of data and enable advanced indexing and slicing operations.

Understanding and utilizing the index effectively in pandas allows for efficient data exploration, analysis, and manipulation. It enables you to leverage the power of labeled indexing and align data seamlessly, unlocking the full potential of pandas for data manipulation and analysis tasks.