In pandas, the merge() function is used to combine two or more DataFrames based on a common column or index. It allows you to perform database-style joins on the data, similar to the SQL merge operation.
Usually, different data of same project are available in various files. To get the useful information from these files, we need to combine these files. Also, we need to merge to different data in the same file to get some specific information. In this article, we will understand these two merges i.e. merge with different file and merge with same file.
Also Check Out: Basics Of Pandas
Here’s an overview of the merge() function in pandas:
merged_df = pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
Parameters
- left and right: The DataFrames to be merged.
- how: Specifies the type of merge to perform. It can be
'
inner'
(default),'
outer'
, ‘left’, or ‘right’. - on: The column(s) to merge on. If the column(s) have the same name in both DataFrames, you can specify a single column name as a string. For merging on multiple columns, provide a list of column names.
- left_on and right_on: Column(s) in the left and right DataFrames to merge on, respectively, when the column names are different.
- left_index and right_index: Specifies whether to use the index of the left or right DataFrame as the merge key.
- sort: Specifies whether to sort the resulting DataFrame by the merge key(s).
- suffixes: Specifies the suffixes to add to overlapping column names in case of a column name conflict.
- copy: Specifies whether to create a new copy of the data or perform the merge in place.
- indicator: Specifies whether to include a special column indicating the source of each row (e.g., ‘both’ for rows present in both DataFrames).
- validate: Specifies whether to check if merge keys are present in both DataFrames (‘one_to_one’, ‘one_to_many’, ‘many_to_one’, or ‘many_to_many’).
Note: CSV files can be downloaded from below link
Link: https://bitbucket.org/pythondsp/pandasguide/downloads/
Merge with Different Files
We will merge the data of two table i.e. ‘release_dates.csv’ and ‘cast.csv’. The ‘release_dates.csv’
file contains the release date of movies in different countries.
First, load the ‘release_dates.csv’ file, which contains the release dates of some of the movies, listed in
‘cast.csv’. Following are the content of ‘release_dates.csv’ file:
release = pd.read_csv('release_dates.csv', index_col=None)
release.head()
title year country date
0 #73, Shaanthi Nivaasa 2007 India 2007-06-15
1 #Beings 2015 Romania 2015-01-29
2 #Declimax 2018 Netherlands 2018-01-21
3 #Ewankosau saranghaeyo 2015 Philippines 2015-01-21
4 #Horror 2015 USA 2015-11-20
casts.head()
title year name type character n
0 Closet Monster 2015 Buffy # actor Buffy 4 31.0
1 Suuri illusioni 1985 Homo $ actor Guests 22.0
2 Battle of the Sexes 2017 $hutter actor Bobby Riggs Fan 10.0
3 Secret in Their Eyes 2015 $hutter actor 2002 Dodger Fan NaN
4 Steve Jobs 2015 $hutter actor 1988 Opera House Patron NaN
Let’s we want to see the release date of the movie ‘Amelia’. For this first, filter out the Amelia from the
DataFrame ‘cast’ as below. There are only two entries for the movie Amelia.
c_amelia = casts[ casts['title'] == 'Amelia']
c_amelia.head()
title year name type character n
5767 Amelia 2009 Aaron Abrams actor Slim Gordon 8.0
23319 Amelia 2009 Jeremy Akerman actor Sheriff 19.0
Next, we will see the entries of movie ‘Amelia’ in release dates as below. In the below result, we can see that there are two different release years for the movie i.e. 1966 and 2009.
release [ release['title'] == 'Amelia' ].head()
title year country date
20543 Amelia 1966 Mexico 1966-03-10
20544 Amelia 2009 Canada 2009-10-23
20545 Amelia 2009 USA 2009-10-23
20546 Amelia 2009 Australia 2009-11-12
20547 Amelia 2009 Singapore 2009-11-12
Since there is not entry for Amelia-1966 in casts DataFrame, therefore merge command will not merge the Amelia-1966 release dates. In following results, we can see that only Amelia 2009 release dates are merges with casts DataFrame.
c_amelia.merge(release).head()
title year name type character n country date
0 Amelia 2009 Aaron Abrams actor Slim Gordon 8.0 Canada 2009-10-23
1 Amelia 2009 Aaron Abrams actor Slim Gordon 8.0 USA 2009-10-23
2 Amelia 2009 Aaron Abrams actor Slim Gordon 8.0 Australia 2009-11-12
3 Amelia 2009 Aaron Abrams actor Slim Gordon 8.0 Singapore 2009-11-12
4 Amelia 2009 Aaron Abrams actor Slim Gordon 8.0 Ireland 2009-11-13
Merge Table with Itself
In pandas, you can merge a table with itself, also known as a self-join, to combine rows based on common values in one or more columns. This can be useful when you want to compare or analyze relationships within a single DataFrame.
Suppose, we want see the list of co-actors in the movies. For this, we need to merge the table with itself based on the title and year, as shown below. In the below code, co-star for actor ‘Aaron Abrams’ are displayed.
First, filter out the results for ‘Aaron Abrams’:
c = casts[ casts['name']=='Aaron Abrams' ]
c.head(2)
title year name type character n
5765 #FromJennifer 2017 Aaron Abrams actor Ralph Sinclair NaN
5766 388 Arletta Avenue 2011 Aaron Abrams actor Alex 4.0
Next, to find the co-stars, merge the DataFrame with itself based on ‘title’ and ‘year’ i.e. for being a co-star, the name of the movie and the year must be same. Note that ‘casts’ is used inside the bracket instead of c.
c.merge(casts, on=['title', 'year']).head()
title year name_x type_x character_x n_x name_y type_y character_y n_y
0 #FromJennifer 2017 Aaron Abrams actor Ralph Sinclair NaN Aaron Abrams actor Ralph Sinclair NaN
1 #FromJennifer 2017 Aaron Abrams actor Ralph Sinclair NaN Christian Ackerman actor Simon NaN
2 388 Arletta Avenue 2011 Aaron Abrams actor Alex 4.0 Graham Abbey actor Officer #2 8.0
3 388 Arletta Avenue 2011 Aaron Abrams actor Alex 4.0 Aaron Abrams actor Alex 4.0
4 Amelia 2009 Aaron Abrams actor Slim Gordon 8.0 Aaron Abrams actor Slim Gordon 8.0
The problem with above joining is that it displays the ‘Aaron Abrams’ as his co-actor as well (see first row). This problem can be avoided as below.
c_costar = c.merge (casts, on=['title', 'year'])
c_costar = c_costar[c_costar['name_y'] != 'Aaron Abrams']
c_costar.head()
title year name_x type_x character_x n_x name_y type_y character_y n_y
1 #FromJennifer 2017 Aaron Abrams actor Ralph Sinclair NaN Christian Ackerman actor Simon NaN
2 388 Arletta Avenue 2011 Aaron Abrams actor Alex 4.0 Graham Abbey actor Officer #2 8.0
5 Amelia 2009 Aaron Abrams actor Slim Gordon 8.0 Jeremy Akerman actor Sheriff 19.0
8 Cinderella Man 2005 Aaron Abrams actor 1928 Fan 67.0 Nick Alachiotis actor Baer Cornerman 38.0
9 Cinderella Man 2005 Aaron Abrams actor 1928 Fan 67.0 Nick Alachiotis actor Undercard Boxer - Feldman 38.0
Conclusion
By performing a self-join, you can gain insights into hierarchical relationships, compare values within the same DataFrame, or analyze patterns and dependencies within the data. It allows you to leverage the power of joins to explore and understand the relationships between rows within a single DataFrame.