In pandas, the index is a fundamental component of a DataFrame or Series. It provides a way to uniquely identify and reference individual rows or elements in the data structure. The index serves as a labeled axis along which data can be accessed, aligned, and manipulated.
Also Check Out: Basics Of Pandas
Index is very important tool in pandas. It is used to organize the data and to provide us fast access to data. In here, time for data-access are compared for the data with and without indexing. For topic, Jupyter notebook is used as ‘%%timeit’ is very easy to use in it to compare the time required for various access-operations.
Note: CSV files can be downloaded from below link
Link: https://bitbucket.org/pythondsp/pandasguide/downloads/
Creating Index
import pandas as pd
cast = pd.read_csv('cast.csv', index_col=None)
cast.head()
title year name type character n
0 Closet Monster 2015 Buffy #1 actor Buffy 4 31.0
1 Suuri illusioni 1985 Homo $ actor Guests 22.0
2 Battle of the Sexes 2017 $hutter actor Bobby Riggs Fan 10.0
3 Secret in Their Eyes 2015 $hutter actor 2002 Dodger Fan NaN
4 Steve Jobs 2015 $hutter actor 1988 Opera House Patron NaN
The %%time magic command is used in Jupyter Notebook or IPython to measure the execution time of a code cell. When this command is placed at the beginning of a cell, it records the time taken to execute the entire cell and displays the result. It is a convenient way to quickly assess the performance of code and identify any potential bottlenecks.
%%time
# data access without indexing
cast[cast['title']=='Macbeth']
CPU times: total: 15.6 ms
Wall time: 22.1 ms
title year name type character n
12868 Macbeth 2015 Darren Adamson actor Soldier NaN
22302 Macbeth 1916 Spottiswoode Aitken actor Duncan 4.0
25855 Macbeth 1948 Robert Alan actor Third Murderer NaN
26990 Macbeth 2016 John Albasiny actor Doctor NaN
38090 Macbeth 1948 William Alland actor Second Murderer 18.0
40639 Macbeth 1997 Stevie Allen actor Murderer 21.0
60543 Macbeth 2014 Moyo Akand? actress Witch NaN
63776 Macbeth 1916 Mary Alden actress Lady Macduff 6.0
- CPU times: This indicates the total CPU time spent executing the code, including both user time and system time. It is a measure of the actual time the CPU was active for the code execution. In out example only total time is shown.
- Wall time: This represents the wall-clock time or the real-world time taken to execute the code. It includes the CPU time as well as any time spent waiting for I/O operations or other external factors.
‘%%timeit’ can be used for more precise results as it run the shell various times and display the average time; but it will not show the output of the shell.
%%timeit
# data access without indexing
cast[cast['title']=='Macbeth']
12.8 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Let’s break down the result of %%timeit:
- 12.8 ms: This is the average time taken per loop iteration. It indicates the average execution time of the code snippet.
- ± 2.07 ms: This represents the standard deviation, which provides a measure of the variability or spread of the execution times across multiple runs. A higher standard deviation indicates more variability in the execution times.
- per loop: This specifies the number of times the code snippet was executed in a loop during each measurement. It indicates the number of times the code was repeated to calculate the average time.
- (mean ± std. dev. of 7 runs, 10 loops each): This specifies the number of times the code snippet was measured (7 runs) and the number of times it was executed in a loop during each measurement (10 loops). In this case, the average time was calculated based on 7 runs, with 10 loops per run.
‘set_index’ can be used to create an index for the data. Note that, in below code, ‘title’ is set at index, therefore index-numbers are replaced by ‘title’ (see the first column).
# below line will not work for multiple index
# c = cast.set_index('title')
c = cast.set_index(['title'])
c.head(4)
year name type character n
title
Closet Monster 2015 Buffy #1 actor Buffy 4 31.0
Suuri illusioni 1985 Homo $ actor Guests 22.0
Battle of the Sexes 2017 $hutter actor Bobby Riggs Fan 10.0
Secret in Their Eyes 2015 $hutter actor 2002 Dodger Fan NaN
To use the above indexing, ‘.loc’ should be used for fast operations.
%%time
# data access with indexing
# note that there is minor performance improvement
c.loc['Macbeth']
CPU times: total: 15.6 ms
Wall time: 51.4 ms
year name type character n
title
Macbeth 2015 Darren Adamson actor Soldier NaN
Macbeth 1916 Spottiswoode Aitken actor Duncan 4.0
Macbeth 1948 Robert Alan actor Third Murderer NaN
Macbeth 2016 John Albasiny actor Doctor NaN
Macbeth 1948 William Alland actor Second Murderer 18.0
Macbeth 1997 Stevie Allen actor Murderer 21.0
Macbeth 2014 Moyo Akand? actress Witch NaN
Macbeth 1916 Mary Alden actress Lady Macduff 6.0
%%timeit
# data access with indexing
# note that there is minor performance improvement
c.loc['Macbeth']
We can see that, there is performance improvement using indexing, because speed will
increase further if the index are in sorted order.
Next, we will sort the index and perform the filter operation:
cs = cast.set_index(['title']).sort_index()
cs.tail(4)
%%time
# data access with indexing
# note that there is huge performance improvement
cs.loc['Macbeth']
CPU times: total: 46.9 ms
Wall time: 55 ms
Multiple Index
Further, we can have multiple indexes in the data
# data with two index i.e. title and n
cm = cast.set_index(['title', 'n']).sort_index()
cm.tail(30)
cm.loc['Macbeth']
year name type character
n
4.0 1916 Spottiswoode Aitken actor Duncan
6.0 1916 Mary Alden actress Lady Macduff
18.0 1948 William Alland actor Second Murderer
21.0 1997 Stevie Allen actor Murderer
NaN 2015 Darren Adamson actor Soldier
NaN 1948 Robert Alan actor Third Murderer
NaN 2016 John Albasiny actor Doctor
NaN 2014 Moyo Akand? actress Witch
In above result, ‘title’ is removed from the index list, which represents that there is one more level of index, which can be used for filtering. Lets filter the data again with second index as well.
# show Macbeth with ranking 4-18
cm.loc['Macbeth'].loc[4:18]
year name type character
n
4.0 1916 Spottiswoode Aitken actor Duncan
6.0 1916 Mary Alden actress Lady Macduff
18.0 1948 William Alland actor Second Murderer
If there is only one match data, then Series will return (instead of DataFrame)
# show Macbeth with ranking 4
cm.loc['Macbeth'].loc[4]
year 1916
name Spottiswoode Aitken
type actor
character Duncan
Name: 4.0, dtype: object
Reset Index
Index can be reset using ‘reset_index’ command. Let’s look at the ‘cm’ DataFrame again.
In ‘cm’ DataFrame, there are two index; and one of these i.e. n is removed using ‘reset_index’ command.
# remove 'n' from index
cm = cm.reset_index('n')
cm.head(2)
n year name type character
title
#1 Serial Killer 17.0 2013 Michael Alton actor Detective Roberts
#DigitalLivesMatter NaN 2016 Rashan Ali actress News Reporter
Also Check Out: Basics Of Pandas
Conclusion
In conclusion, the index is a crucial component in pandas that provides a labeled axis for referencing and manipulating data in a DataFrame or Series. Here are the key points to remember about the index in pandas:
- The index serves as a unique identifier for rows or elements in a DataFrame or Series, allowing for efficient data access, alignment, and manipulation.
- The default index is an integer-based index that starts from 0 and increments by 1 for each row. However, you can define custom indexes using different types, such as integers, dates, or hierarchical levels.
- The index is immutable, meaning it cannot be modified once created. This ensures data integrity and enables efficient operations that rely on indexing, such as alignment and merging.
- Label-based indexing using .loc[] and integer-based indexing using .iloc[] allow you to retrieve rows, columns, or individual elements based on their index labels or positions.
- The index plays a crucial role in aligning data during operations between multiple DataFrames or Series, matching rows based on their index labels.
- You can set an existing column as the index using the set_index() method, providing a meaningful label or unique identifier for the rows.
- Pandas provides various types of indexes, such as Int64Index, DatetimeIndex, and MultiIndex, to handle different types of data and enable advanced indexing and slicing operations.
Understanding and utilizing the index effectively in pandas allows for efficient data exploration, analysis, and manipulation. It enables you to leverage the power of labeled indexing and align data seamlessly, unlocking the full potential of pandas for data manipulation and analysis tasks.