Understanding Merge in Pandas

In pandas, the merge() function is used to combine two or more DataFrames based on a common column or index. It allows you to perform database-style joins on the data, similar to the SQL merge operation.

Table of Contents

Usually, different data of same project are available in various files. To get the useful information from these files, we need to combine these files. Also, we need to merge to different data in the same file to get some specific information. In this article, we will understand these two merges i.e. merge with different file and merge with same file.

Also Check Out: Basics Of Pandas

Here’s an overview of the merge() function in pandas:

merged_df = pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Parameters

left and right: The DataFrames to be merged.
how: Specifies the type of merge to perform. It can be 'inner' (default), 'outer', ‘left’, or ‘right’.
on: The column(s) to merge on. If the column(s) have the same name in both DataFrames, you can specify a single column name as a string. For merging on multiple columns, provide a list of column names.
left_on and right_on: Column(s) in the left and right DataFrames to merge on, respectively, when the column names are different.
left_index and right_index: Specifies whether to use the index of the left or right DataFrame as the merge key.
sort: Specifies whether to sort the resulting DataFrame by the merge key(s).
suffixes: Specifies the suffixes to add to overlapping column names in case of a column name conflict.
copy: Specifies whether to create a new copy of the data or perform the merge in place.
indicator: Specifies whether to include a special column indicating the source of each row (e.g., ‘both’ for rows present in both DataFrames).
validate: Specifies whether to check if merge keys are present in both DataFrames (‘one_to_one’, ‘one_to_many’, ‘many_to_one’, or ‘many_to_many’).

Note: CSV files can be downloaded from below link

Link: https://bitbucket.org/pythondsp/pandasguide/downloads/

Merge with Different Files

We will merge the data of two table i.e. ‘release_dates.csv’ and ‘cast.csv’. The ‘release_dates.csv’
file contains the release date of movies in different countries.

First, load the ‘release_dates.csv’ file, which contains the release dates of some of the movies, listed in
‘cast.csv’. Following are the content of ‘release_dates.csv’ file:

release = pd.read_csv('release_dates.csv', index_col=None)
release.head()

	                title	year	country	       date
0	#73, Shaanthi Nivaasa	2007	India	       2007-06-15
1	#Beings	                2015	Romania	       2015-01-29
2	#Declimax	        2018	Netherlands    2018-01-21
3	#Ewankosau saranghaeyo	2015	Philippines    2015-01-21
4	#Horror	                2015	USA	       2015-11-20

casts.head()

                     title	year	name	type	character	        n
0	    Closet Monster	2015	Buffy # actor	Buffy 4	               31.0
1	   Suuri illusioni	1985	Homo $	actor	Guests	               22.0
2	Battle of the Sexes	2017	$hutter	actor	Bobby Riggs Fan	       10.0
3	Secret in Their Eyes	2015	$hutter	actor	2002 Dodger Fan	        NaN
4	Steve Jobs	        2015	$hutter	actor	1988 Opera House Patron	NaN

Let’s we want to see the release date of the movie ‘Amelia’. For this first, filter out the Amelia from the
DataFrame ‘cast’ as below. There are only two entries for the movie Amelia.

c_amelia = casts[ casts['title'] == 'Amelia']
c_amelia.head()

	title	year	       name	 type	character	n
5767	Amelia	2009	Aaron Abrams	actor	Slim Gordon	8.0
23319	Amelia	2009	Jeremy Akerman	actor	Sheriff	        19.0

Next, we will see the entries of movie ‘Amelia’ in release dates as below. In the below result, we can see that there are two different release years for the movie i.e. 1966 and 2009.

release [ release['title'] == 'Amelia' ].head()

	title	year	country	    date
20543	Amelia	1966	Mexico	    1966-03-10
20544	Amelia	2009	Canada	    2009-10-23
20545	Amelia	2009	USA	    2009-10-23
20546	Amelia	2009	Australia   2009-11-12
20547	Amelia	2009	Singapore   2009-11-12

Since there is not entry for Amelia-1966 in casts DataFrame, therefore merge command will not merge the Amelia-1966 release dates. In following results, we can see that only Amelia 2009 release dates are merges with casts DataFrame.

c_amelia.merge(release).head()

	title	year	name	        type	character	n	country	        date
0	Amelia	2009	Aaron Abrams	actor	Slim Gordon	8.0	Canada	        2009-10-23
1	Amelia	2009	Aaron Abrams	actor	Slim Gordon	8.0	USA	        2009-10-23
2	Amelia	2009	Aaron Abrams	actor	Slim Gordon	8.0	Australia	2009-11-12
3	Amelia	2009	Aaron Abrams	actor	Slim Gordon	8.0	Singapore	2009-11-12
4	Amelia	2009	Aaron Abrams	actor	Slim Gordon	8.0	Ireland	        2009-11-13

Merge Table with Itself

In pandas, you can merge a table with itself, also known as a self-join, to combine rows based on common values in one or more columns. This can be useful when you want to compare or analyze relationships within a single DataFrame.

Suppose, we want see the list of co-actors in the movies. For this, we need to merge the table with itself based on the title and year, as shown below. In the below code, co-star for actor ‘Aaron Abrams’ are displayed.

First, filter out the results for ‘Aaron Abrams’:

c = casts[ casts['name']=='Aaron Abrams' ]
c.head(2)

	           title	year	       name	type	character	n
5765	   #FromJennifer	2017	Aaron Abrams	actor	Ralph Sinclair	NaN
5766	388 Arletta Avenue	2011	Aaron Abrams	actor	Alex	        4.0

Next, to find the co-stars, merge the DataFrame with itself based on ‘title’ and ‘year’ i.e. for being a co-star, the name of the movie and the year must be same. Note that ‘casts’ is used inside the bracket instead of c.

c.merge(casts, on=['title', 'year']).head()

                title	   year	 name_x	       type_x	character_x	n_x	name_y	         type_y	character_y	n_y
0	#FromJennifer	   2017	 Aaron Abrams	actor	Ralph Sinclair	NaN	Aaron Abrams	  actor	Ralph Sinclair	NaN
1	#FromJennifer	   2017	 Aaron Abrams	actor	Ralph Sinclair	NaN	Christian Ackerman actor	Simon	NaN
2	388 Arletta Avenue 2011	 Aaron Abrams	actor	Alex	        4.0	Graham Abbey	  actor	Officer #2	8.0
3	388 Arletta Avenue 2011	 Aaron Abrams	actor	Alex	        4.0	Aaron Abrams	  actor	Alex	        4.0
4	Amelia	           2009	 Aaron Abrams	actor	Slim Gordon	8.0	Aaron Abrams	  actor	Slim Gordon	8.0

The problem with above joining is that it displays the ‘Aaron Abrams’ as his co-actor as well (see first row). This problem can be avoided as below.

c_costar = c.merge (casts, on=['title', 'year'])
c_costar = c_costar[c_costar['name_y'] != 'Aaron Abrams']
c_costar.head()

            title	year	name_x	      type_x	character_x	n_x	name_y	         type_y	character_y	n_y
1    #FromJennifer	2017	Aaron Abrams  actor	Ralph Sinclair	NaN	Christian Ackerman actor  Simon	        NaN
2    388 Arletta Avenue	2011	Aaron Abrams  actor	Alex	        4.0	Graham Abbey	  actor	  Officer #2	8.0
5    Amelia	        2009	Aaron Abrams  actor	Slim Gordon	8.0	Jeremy Akerman	  actor	  Sheriff	19.0
8    Cinderella Man	2005	Aaron Abrams  actor	1928 Fan	67.0	Nick Alachiotis	  actor	 Baer Cornerman	38.0
9    Cinderella Man	2005	Aaron Abrams  actor	1928 Fan	67.0	Nick Alachiotis	  actor	 Undercard Boxer - Feldman	38.0

Conclusion

By performing a self-join, you can gain insights into hierarchical relationships, compare values within the same DataFrame, or analyze patterns and dependencies within the data. It allows you to leverage the power of joins to explore and understand the relationships between rows within a single DataFrame.