Understanding Groupby in Pandas

In pandas, the groupby() function is used to create groups of data based on one or more columns in a DataFrame. It is a powerful operation that enables various data aggregation and analysis tasks.

Table of Contents

Data can be grouped by columns-headers. Further, custom formats can be defined to group the various elements of the DataFrame.

Also Check Out: Basics Of Pandas

Here’s an overview of the groupby() function in pandas:

grouped = df.groupby(by, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, dropna=True)

Parameters

by: Specifies the column(s) or label(s) by which the DataFrame should be grouped. It can be a single column name or a list of column names.
axis: Specifies the axis to perform the grouping on. By default, it is set to 0, indicating grouping by rows. Use axis=1 for grouping by columns.
level: If the DataFrame has a multi-level index, you can specify the level(s) to group by.
as_index: Specifies whether to use the grouped column(s) as the index of the resulting DataFrame. By default, it is set to True.
sort: Specifies whether to sort the resulting groups by the group keys. By default, it is set to True.
group_keys: Specifies whether to include the group keys in the resulting DataFrame index. By default, it is set to True.
squeeze: Specifies whether to return a DataFrame or a Series when there is only one group. By default, it is set to False.
observed: Specifies whether to include only observed values in the resulting groups for categorical data. By default, it is set to False.
dropna: Specifies whether to exclude groups with missing values from the grouping. By default, it is set to True.

Note: CSV files can be downloaded from below link

Link: https://bitbucket.org/pythondsp/pandasguide/downloads/

GroupBy with Column-Names

Here Count Values, the value of movies/year were counted using ‘count_values()’ method. Same can be
achieve by ‘groupby’ method as well. The ‘groupby’ command return an object, and we need to an additional functionality to it to get some results. For example, in below code, data is grouped by ‘year’ and then size() command is used. The size() option counts the total number for rows for each year; therefore the result of below code is same as ‘count_values()’ command.

cg = c.groupby(['year']).size()
cg.plot()
plt.show()

Further, groupby option can take multiple parameters for grouping. For example, we want to group the
movies of the actor ‘Aaron Abrams’ based on year.

c = casts
cf = c[c['name'] == 'Aaron Abrams']
cf.groupby(['year']).size().head()

year
2003    2
2004    2
2005    2
2006    1
2007    2
dtype: int64

Above list shows that year-2003 is found in two rows with name-entry as ‘Aaron Abrams’. In the other word, he did 2 movies in 2003.

Next, we want to see the list of movies as well, then we can pass two parameters in the list as shown below:

cf.groupby(['year', 'title']).size().head()

year  title                               
2003  The In-Laws                             1
      The Visual Bible: The Gospel of John    1
2004  Resident Evil: Apocalypse               1
      Siblings                                1
2005  Cinderella Man                          1
dtype: int64

In above code, the groupby operation is performed on the ‘year’ first and then on ‘title’. In the other word, first all the movies are grouped by year. After that, the result of this groupby is again grouped based on titles. Note that, first group command arranged the year in order i.e. 2003, 2004 and 2005 etc. then next group command arranged the title in alphabetical order.

Next, we want to do grouping based on maximum ratings in a year; i.e. we want to group the items by year and see the maximum rating in those years:

c.groupby(['year']).n.max().head()

year
1912     6.0
1913    14.0
1914    39.0
1915    14.0
1916    35.0
Name: n, dtype: float64

Above results show that the maximum rating in year 1912 is 6 for Aaron Abrams.

Similarly, we can check for the minimum rating:

c.groupby(['year']).n.min().head()

year
1912    6.0
1913    1.0
1914    1.0
1915    1.0
1916    1.0
Name: n, dtype: float64

Lastly, we want to check the mean rating each year:

c.groupby(['year']).n.mean().head()

year
1912 6.000000
1913 4.142857
1914 7.085106
1915 4.236111
1916 5.037736
Name: n, dtype: float64

GroupBy with Custom Field

Suppose we want to group the data based on decades, then we need to create a custom groupby field.

# decade conversion : 1985//10 = 198, 198*10 = 1980
decade = c['year']//10*10
c_dec = c.groupby(decade).n.size()
c_dec.head()

year
1910     669
1920    1121
1930    3448
1940    3997
1950    3892
Name: n, dtype: int64

Above results shows the total number of movies in each decade.

What is Unstack in Pandas

Before understanding the unstack, let’s consider one case from cast.csv file. In following code, the data is grouped by decade and type i.e. actor and actress.

Note: CSV files can be downloaded from below link

Link: https://bitbucket.org/pythondsp/pandasguide/downloads/

c = casts
c.groupby( [c['year']//10*10, 'type'] ).size().head(8)

year  type   
1910  actor       384
      actress     285
1920  actor       710
      actress     411
1930  actor      2628
      actress     820
1940  actor      3014
      actress     983
dtype: int64

Now we want to compare and plot the total number of actors and actresses in each decade. One solution to this problem is to grab even and odd rows separately and plot the data, which is quite complicated operation if types has more varieties e.g. new-actor, new-actress and teen-actors etc. A simple solution to such problem is the ‘unstack’, which allows to create a new DataFrame based on the grouped Dataframe, as shown below.

Since we want a plot based on actors and actress, therefore first we need to group the data based on ‘type’ as below:

c = casts
c_decade = c.groupby( ['type', c['year']//10*10] ).size()
c_decade

type     year
actor    1910      384
         1920      710
         1930     2628
         1940     3014
         1950     2877
         1960     2775
         1970     3044
         1980     3565
         1990     5108
         2000    10368
         2010    15523
         2020        4
actress  1910      285
         1920      411
         1930      820
         1940      983
         1950     1015
         1960      968
         1970     1299
         1980     1989
         1990     2544
         2000     5831
         2010     8853
         2020        3
dtype: int64

Now we can create a new DataFrame using ‘unstack’ command. The ‘unstack’ command creates a new
DataFrame based on index.

c_decade.unstack()

year	1910	1920	1930	1940	1950	1960	1970	1980	1990  ...
type												
actor	384	710	2628	3014	2877	2775	3044	3565	5108 ...
actress	285	411	820	983	1015	968	1299	1989	2544 ...

c_decade.unstack().plot()
plt.show()
c_decade.unstack().plot(kind='bar')
plt.show()

To plot the data side by side, use unstack(0) option as shown below (by default unstack(-1) is used)

c_decade.unstack(0)

type	actor	actress
year		
1910	384	285
1920	710	411
1930	2628	820
1940	3014	983
1950	2877	1015
1960	2775	968
1970	3044	1299
1980	3565	1989
1990	5108	2544
2000	10368	5831
2010	15523	8853
2020	4	3

c_decade.unstack(0).plot(kind='bar')
plt.show()

Conclusion

In conclusion, the groupby() function in pandas is a powerful tool for grouping data in a DataFrame based on one or more columns. The groupby() function in pandas is a fundamental tool for data analysis and exploration. It enables you to gain insights into the underlying patterns and relationships within your data by grouping and aggregating it based on specific criteria.