Cfrgtkky

Question

How to perform a (LEFT|RIGHT|FULL) (INNER|OUTER) join with pandas?

How do I add NaNs for missing rows after merge?

How do I get rid of NaNs after merging?

Can I merge on the index?

How do I merge multiple DataFrames?

merge? join? concat? update? Who? What? Why?!

... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

This QnA is meant to be the next instalment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).

Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.

Note regarding duplicate closures: Older, upvotes posts can be closed in favour of newer, more canonical, or up-to-date QnAs as per meta's policy; see case 1, case 2, case 3 and case 4 on MSE. — Dec 8 at 3:13
Information on (efficiently) computing the cartesian product of two (or more) DataFrames can be found at this link: Performant cartesian product (CROSS JOIN) of two pandas DataFrames — Dec 10 at 3:39

score 67 · Accepted Answer · 2018-12-12 07:21:25Z

This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.

In particular, here's what this post will go through:

The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
- merging with different column names
- avoiding duplicate merge key column in output

Merging with index under different conditions
- effectively using your named index
- merge key as the index of one and column of another

Multiway merges on columns and indexes (unique and non-unique)

Notable alternatives to merge and join

What this post will not go through:

Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.

Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note

Most examples default to FULL INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.

Enough Talk, just show me how to use `merge`!

Setup

np.random.seed(0)

left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})    

right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})



left



  key     value

0   A  1.494079

1   B -0.205158

2   C  0.313068

3   D -0.854096



right



  key     value

0   B -2.552990

1   D  0.653619

2   E  0.864436

3   F -0.742165

For the sake of simplicity, the key column has the same name (for now).

To perform a FULL INNER JOIN, call pd.merge specifying the left DataFrame, the right DataFrame, and the join key.

pd.merge(left, right, on='key')



  key   value_x   value_y

0   B  0.400157  1.867558

1   D  2.240893 -0.977278

In more recent versions of pandas (v0.21 or so), merge is now a first order function, so you can call DataFrame.merge.

left.merge(right, on='key')

# Or, if you want to be explicit

# left.merge(right, on='key', how='inner')



  key   value_x   value_y

0   B  0.400157  1.867558

1   D  2.240893 -0.977278

A LEFT OUTER JOIN would be done by specifying how='left'.

left.merge(right, on='key', how='left')



  key   value_x   value_y

0   A  1.764052       NaN

1   B  0.400157  1.867558

2   C  0.978738       NaN

3   D  2.240893 -0.977278

Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.

And similarly, specify how='right'...

left.merge(right, on='key', how='right')



  key   value_x   value_y

0   B  0.400157  1.867558

1   D  2.240893 -0.977278

2   E       NaN  0.950088

3   F       NaN -0.151357

...for a RIGHT OUTER JOIN. Here, keys from right are used, and missing data from left is replaced by NaN.

Finally, specify how='outer'...

left.merge(right, on='key', how='outer')



  key   value_x   value_y

0   A  1.764052       NaN

1   B  0.400157  1.867558

2   C  0.978738       NaN

3   D  2.240893 -0.977278

4   E       NaN  0.950088

5   F       NaN -0.151357

...for a FULL OUTER JOIN. This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarises these various merges nicely:

enter image description here

Other JOINs - LEFT, RIGHT, ANTI-JOINs

If you need LEFT JOINs and RIGHT JOINs in two steps.

For LEFT JOIN, start by performing a LEFT OUTER JOIN and then filtering rows coming from left only,

(left.merge(right, on='key', how='left', indicator=True)

     .query('_merge == "left_only"')

     .drop('_merge', 1))



  key   value_x  value_y

0   A  1.764052      NaN

2   C  0.978738      NaN

Where,

left.merge(right, on='key', how='left', indicator=True)



  key   value_x   value_y     _merge

0   A  1.764052       NaN  left_only

1   B  0.400157  1.867558       both

2   C  0.978738       NaN  left_only

3   D  2.240893 -0.977278       both

And similarly, for a RIGHT JOIN,

(left.merge(right, on='key', how='right', indicator=True)

     .query('_merge == "right_only"')

     .drop('_merge', 1))



  key  value_x   value_y

2   E      NaN  0.950088

3   F      NaN -0.151357

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN), you can do this in similar fashion—

(left.merge(right, on='key', how='outer', indicator=True)

     .query('_merge != "both"')

     .drop('_merge', 1))



  key   value_x   value_y

0   A  1.764052       NaN

2   C  0.978738       NaN

4   E       NaN  0.950088

5   F       NaN -0.151357

Different names for key columns

If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:

left2 = left.rename({'key':'keyLeft'}, axis=1)

right2 = right.rename({'key':'keyRight'}, axis=1)



left2



  keyLeft     value

0       A  0.706573

1       B  0.010500

2       C  1.785870

3       D  0.126912



right2



  keyRight     value

0        B  0.401989

1        D  1.883151

2        E -1.347759

3        F -1.270485

left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')



  keyLeft   value_x keyRight   value_y

0       B  0.010500        B  0.401989

1       D  0.126912        D  1.883151

Avoiding duplicate key column in output

When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.

left3 = left2.set_index('keyLeft')

left3.merge(right2, left_index=True, right_on='keyRight')



    value_x keyRight   value_y

0  0.010500        B  0.401989

1  0.126912        D  1.883151

Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.

Merging only a single column from one of the DataFrames

For example, consider

right3 = right.assign(newcol=np.arange(len(right)))

right3

  key     value  newcol

0   B  1.867558       0

1   D -0.977278       1

2   E  0.950088       2

3   F -0.151357       3

If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:

left.merge(right3[['key', 'newcol']], on='key')



  key     value  newcol

0   B  0.400157       0

1   D  2.240893       1

If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))

left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))



  key     value  newcol

0   A  1.764052     NaN

1   B  0.400157     0.0

2   C  0.978738     NaN

3   D  2.240893     1.0

As mentioned, this is similar to, but faster than

left.merge(right3[['key', 'newcol']], on='key', how='left')



  key     value  newcol

0   A  1.764052     NaN

1   B  0.400157     0.0

2   C  0.978738     NaN

3   D  2.240893     1.0

Merging on multiple columns

To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).

left.merge(right, on=['key1', 'key2'] ...)

Or, in the event the names are different,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])

Other useful merge* operations and functions

Merging a DataFrame with Series on index: See this answer.

Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.

pd.merge_ordered is a useful function for ordered JOINs.

pd.merge_asof (read: merge_asOf) is useful for approximate joins.

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specs.

Index-based *-JOIN (+ index-column `merge`s)

Setup

left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])    

right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])

left.index.name = right.index.name = 'idxkey'



left

           value

idxkey          

A       2.269755

B      -1.454366

C       0.045759

D      -0.187184



right

           value

idxkey          

B       1.532779

D       1.469359

E       0.154947

F       0.378163

Typically, a merge on index would look like this:

left.merge(right, left_index=True, right_index=True)



         value_x   value_y

idxkey                    

B       0.410599  0.761038

D       1.454274  0.121675

Support for index names

If your index is named, then v0.23 users can also specify the level name to on (or left_on and right_on as necessary).

left.merge(right, on='idxkey')



         value_x   value_y

idxkey                    

B       0.410599  0.761038

D       1.454274  0.121675

Merge with index of one, columns of another

It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,

left.merge(right, left_on='key1', right_index=True)

Or vice versa (right_on=... and left_index=True`).

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)

right2



  colkey     value

0      B  1.222445

1      D  0.208275

2      E  0.976639

3      F  0.356366



left.merge(right2, left_index=True, right_on='colkey')



    value_x colkey   value_y

0 -1.070753      B  1.222445

1 -0.403177      D  0.208275

In this special case, the index for left is named, so you can also use the index name with left_on, like this:

left.merge(right2, left_on='idxkey', right_on='colkey')



    value_x colkey   value_y

0 -1.070753      B  1.222445

1 -0.403177      D  0.208275

DataFrame.join

Besides these, there is another succinct option. You can use DataFrame.join which defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.

left.join(right, how='inner', lsuffix='_x', rsuffix='_y')



         value_x   value_y

idxkey                    

B       0.410599  0.761038

D       1.454274  0.121675

Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:

left.join(right)

ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')

Since the column names are the same. This would not be a problem if they were differently named.

left.rename(columns={'value':'leftvalue'}).join(right, how='inner')



        leftvalue     value

idxkey                     

B       -1.454366  1.532779

D       -0.187184  1.469359

pd.concat

Lastly, as an alternative for index-based joins, you can use pd.concat:

pd.concat([left, right], axis=1, sort=False, join='inner')



           value     value

idxkey                    

B      -1.980796  1.230291

D       0.156349  1.202380

Omit join='inner' if you need a FULL OUTER JOIN (the default):

pd.concat([left, right], axis=1, sort=False)



      value     value

A -0.887786       NaN

B -1.980796  1.230291

C -0.347912       NaN

D  0.156349  1.202380

E       NaN -0.387327

F       NaN -0.302303

For more information, see this canonical post on pd.concat by @piRSquared.

Generalizing: `merge`ing multiple DataFrames

Setup

A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    

B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})

C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})

dfs = [A, B, C]

Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:

A.merge(B, on='key').merge(C, on='key')



  key    valueA    valueB  valueC

0   D  0.922207 -1.099401     1.0

However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames. To do this, one often used simple trick is with functools.reduce, and you can use it to achieve a FULL INNER JOIN like so:

from functools import reduce

reduce(pd.merge, dfs)



  key    valueA    valueB  valueC

0   D  0.465662  1.488252     1.0

Note that every column besides the "key" column should be differently named for this to work out-of-box. Otherwise, you may need to use a lambda.

For a FULL OUTER JOIN, you can curry pd.merge using functools.partial:

from functools import partial

outer_merge = partial(pd.merge, how='outer')

reduce(outer_merge, dfs)



  key    valueA    valueB  valueC

0   A  0.056165       NaN     NaN

1   B -1.165150 -1.536244     NaN

2   C  0.900826       NaN     1.0

3   D  0.465662  1.488252     1.0

4   E       NaN  1.895889     1.0

5   F       NaN  1.178780     NaN

6   J       NaN       NaN     1.0

As you may have noticed, this is quite powerful—you can also use it to control column names during merge. Simply add more keyword arguments as needed:

partial(pd.merge, how='outer', left_index=True, right_on=...)

The alternative: pd.concat

If your column values are unique, then it makes sense to use pd.concat, this is faster than a two-at-a-time multi-way merge.

pd.concat([

    df.set_index('key') for df in dfs], axis=1, join='inner'

).reset_index()



  key    valueA    valueB  valueC

0   D  0.465662  1.488252     1.0

Multiway merge on unique indexes

If you are merging multiple DataFrames on unique indexes, you should once again prefer pd.concat for better performance.

# Note, the "key" column values are unique, so the index is unique.

A2 = A.set_index('key')

B2 = B.set_index('key')

C2 = C.set_index('key')

dfs2 = [A2, B2, C2]

pd.concat(dfs2, axis=1, sort=False, join='inner')



       valueA    valueB  valueC

key                            

D    0.922207 -1.099401     1.0

As always, omit join='inner' for a FULL OUTER JOIN.

Multiway merge on indexes with duplicates
concat is fast, but has its shortcomings. It cannot handle duplicates.

A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})

pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')

ValueError: Shape of passed values is (3, 4), indices imply (3, 2)

In this situation, join is the best option, since it can handle non-unique indexes (join calls merge under the hood).

# For inner join. For left join, pass `pd.DataFrame.join` directly to `reduce`. 

inner_join = partial(pd.DataFrame.join, how='inner')

reduce(inner_join, [A3.set_index('key'), B2, C2])



       valueA    valueB  valueC

key                            

D   -0.674333 -1.099401     1.0

D    0.031831 -1.099401     1.0

This should be a canonical post as these kinds of questions keep popping up and sometimes never marked as duplicate +1 — Dec 7 at 9:54
@jpp As an alternative to separate canonical posts, another option (that I've seen done before) would be to split each topic into separate answers, and have a table of contents in the question. Would that make more sense? — Dec 8 at 3:24
@AlexRiley The thought has been niggling at my mind over the past few days, and I have been giving serious thought into splitting this behemoth into three smaller canonicals - one on merging basics, another on index based merge, and the third on computing the cartesian product. I will be more than happy to create new canonicals and re-mark duplicates as appropriate. — Dec 9 at 16:26
Looks great! Thanks for taking the feedback on board and acting on it so quickly - the new post is (again) a great addition to SO. It's tricky to know how to best split the topic of merging into useful/usable posts, but I think your suggestion of basics/index/Cartesian-product is the right way to go. — Dec 10 at 11:31
@S.V I don't think align is appropriate for this post, but I can explain how it works and hopefully you'll understand why. align is used to make the indexes of two DataFrames compatible with each other by inserting NaNs for missing keys. It's almost like both frames are reindexed on the union of the indices. The return value of left.align(right) is (left_aligned, right_aligned). Hope that makes sense. — 2 days ago

score 67 · Accepted Answer · 2018-12-12 07:21:25Z

This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.

In particular, here's what this post will go through:

The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
- merging with different column names
- avoiding duplicate merge key column in output

Merging with index under different conditions
- effectively using your named index
- merge key as the index of one and column of another

Multiway merges on columns and indexes (unique and non-unique)

Notable alternatives to merge and join

What this post will not go through:

Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.

Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note

Most examples default to FULL INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.

Enough Talk, just show me how to use `merge`!

Setup

np.random.seed(0)

left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})    

right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})



left



  key     value

0   A  1.494079

1   B -0.205158

2   C  0.313068

3   D -0.854096



right



  key     value

0   B -2.552990

1   D  0.653619

2   E  0.864436

3   F -0.742165

For the sake of simplicity, the key column has the same name (for now).

To perform a FULL INNER JOIN, call pd.merge specifying the left DataFrame, the right DataFrame, and the join key.

pd.merge(left, right, on='key')



  key   value_x   value_y

0   B  0.400157  1.867558

1   D  2.240893 -0.977278

In more recent versions of pandas (v0.21 or so), merge is now a first order function, so you can call DataFrame.merge.

left.merge(right, on='key')

# Or, if you want to be explicit

# left.merge(right, on='key', how='inner')



  key   value_x   value_y

0   B  0.400157  1.867558

1   D  2.240893 -0.977278

A LEFT OUTER JOIN would be done by specifying how='left'.

left.merge(right, on='key', how='left')



  key   value_x   value_y

0   A  1.764052       NaN

1   B  0.400157  1.867558

2   C  0.978738       NaN

3   D  2.240893 -0.977278

Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.

And similarly, specify how='right'...

left.merge(right, on='key', how='right')



  key   value_x   value_y

0   B  0.400157  1.867558

1   D  2.240893 -0.977278

2   E       NaN  0.950088

3   F       NaN -0.151357

...for a RIGHT OUTER JOIN. Here, keys from right are used, and missing data from left is replaced by NaN.

Finally, specify how='outer'...

left.merge(right, on='key', how='outer')



  key   value_x   value_y

0   A  1.764052       NaN

1   B  0.400157  1.867558

2   C  0.978738       NaN

3   D  2.240893 -0.977278

4   E       NaN  0.950088

5   F       NaN -0.151357

...for a FULL OUTER JOIN. This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarises these various merges nicely:

enter image description here

Other JOINs - LEFT, RIGHT, ANTI-JOINs

If you need LEFT JOINs and RIGHT JOINs in two steps.

For LEFT JOIN, start by performing a LEFT OUTER JOIN and then filtering rows coming from left only,

(left.merge(right, on='key', how='left', indicator=True)

     .query('_merge == "left_only"')

     .drop('_merge', 1))



  key   value_x  value_y

0   A  1.764052      NaN

2   C  0.978738      NaN

Where,

left.merge(right, on='key', how='left', indicator=True)



  key   value_x   value_y     _merge

0   A  1.764052       NaN  left_only

1   B  0.400157  1.867558       both

2   C  0.978738       NaN  left_only

3   D  2.240893 -0.977278       both

And similarly, for a RIGHT JOIN,

(left.merge(right, on='key', how='right', indicator=True)

     .query('_merge == "right_only"')

     .drop('_merge', 1))



  key  value_x   value_y

2   E      NaN  0.950088

3   F      NaN -0.151357

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN), you can do this in similar fashion—

(left.merge(right, on='key', how='outer', indicator=True)

     .query('_merge != "both"')

     .drop('_merge', 1))



  key   value_x   value_y

0   A  1.764052       NaN

2   C  0.978738       NaN

4   E       NaN  0.950088

5   F       NaN -0.151357

Different names for key columns

If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:

left2 = left.rename({'key':'keyLeft'}, axis=1)

right2 = right.rename({'key':'keyRight'}, axis=1)



left2



  keyLeft     value

0       A  0.706573

1       B  0.010500

2       C  1.785870

3       D  0.126912



right2



  keyRight     value

0        B  0.401989

1        D  1.883151

2        E -1.347759

3        F -1.270485

left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')



  keyLeft   value_x keyRight   value_y

0       B  0.010500        B  0.401989

1       D  0.126912        D  1.883151

Avoiding duplicate key column in output

When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.

left3 = left2.set_index('keyLeft')

left3.merge(right2, left_index=True, right_on='keyRight')



    value_x keyRight   value_y

0  0.010500        B  0.401989

1  0.126912        D  1.883151

Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.

Merging only a single column from one of the DataFrames

For example, consider

right3 = right.assign(newcol=np.arange(len(right)))

right3

  key     value  newcol

0   B  1.867558       0

1   D -0.977278       1

2   E  0.950088       2

3   F -0.151357       3

If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:

left.merge(right3[['key', 'newcol']], on='key')



  key     value  newcol

0   B  0.400157       0

1   D  2.240893       1

If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))

left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))



  key     value  newcol

0   A  1.764052     NaN

1   B  0.400157     0.0

2   C  0.978738     NaN

3   D  2.240893     1.0

As mentioned, this is similar to, but faster than

left.merge(right3[['key', 'newcol']], on='key', how='left')



  key     value  newcol

0   A  1.764052     NaN

1   B  0.400157     0.0

2   C  0.978738     NaN

3   D  2.240893     1.0

Merging on multiple columns

To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).

left.merge(right, on=['key1', 'key2'] ...)

Or, in the event the names are different,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])

Other useful merge* operations and functions

Merging a DataFrame with Series on index: See this answer.

Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.

pd.merge_ordered is a useful function for ordered JOINs.

pd.merge_asof (read: merge_asOf) is useful for approximate joins.

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specs.

Index-based *-JOIN (+ index-column `merge`s)

Setup

left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])    

right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])

left.index.name = right.index.name = 'idxkey'



left

           value

idxkey          

A       2.269755

B      -1.454366

C       0.045759

D      -0.187184



right

           value

idxkey          

B       1.532779

D       1.469359

E       0.154947

F       0.378163

Typically, a merge on index would look like this:

left.merge(right, left_index=True, right_index=True)



         value_x   value_y

idxkey                    

B       0.410599  0.761038

D       1.454274  0.121675

Support for index names

If your index is named, then v0.23 users can also specify the level name to on (or left_on and right_on as necessary).

left.merge(right, on='idxkey')



         value_x   value_y

idxkey                    

B       0.410599  0.761038

D       1.454274  0.121675

Merge with index of one, columns of another

It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,

left.merge(right, left_on='key1', right_index=True)

Or vice versa (right_on=... and left_index=True`).

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)

right2



  colkey     value

0      B  1.222445

1      D  0.208275

2      E  0.976639

3      F  0.356366



left.merge(right2, left_index=True, right_on='colkey')



    value_x colkey   value_y

0 -1.070753      B  1.222445

1 -0.403177      D  0.208275

In this special case, the index for left is named, so you can also use the index name with left_on, like this:

left.merge(right2, left_on='idxkey', right_on='colkey')



    value_x colkey   value_y

0 -1.070753      B  1.222445

1 -0.403177      D  0.208275

DataFrame.join

Besides these, there is another succinct option. You can use DataFrame.join which defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.

left.join(right, how='inner', lsuffix='_x', rsuffix='_y')



         value_x   value_y

idxkey                    

B       0.410599  0.761038

D       1.454274  0.121675

Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:

left.join(right)

ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')

Since the column names are the same. This would not be a problem if they were differently named.

left.rename(columns={'value':'leftvalue'}).join(right, how='inner')



        leftvalue     value

idxkey                     

B       -1.454366  1.532779

D       -0.187184  1.469359

pd.concat

Lastly, as an alternative for index-based joins, you can use pd.concat:

pd.concat([left, right], axis=1, sort=False, join='inner')



           value     value

idxkey                    

B      -1.980796  1.230291

D       0.156349  1.202380

Omit join='inner' if you need a FULL OUTER JOIN (the default):

pd.concat([left, right], axis=1, sort=False)



      value     value

A -0.887786       NaN

B -1.980796  1.230291

C -0.347912       NaN

D  0.156349  1.202380

E       NaN -0.387327

F       NaN -0.302303

For more information, see this canonical post on pd.concat by @piRSquared.

Generalizing: `merge`ing multiple DataFrames

Setup

A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    

B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})

C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})

dfs = [A, B, C]

Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:

A.merge(B, on='key').merge(C, on='key')



  key    valueA    valueB  valueC

0   D  0.922207 -1.099401     1.0

However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames. To do this, one often used simple trick is with functools.reduce, and you can use it to achieve a FULL INNER JOIN like so:

from functools import reduce

reduce(pd.merge, dfs)



  key    valueA    valueB  valueC

0   D  0.465662  1.488252     1.0

Note that every column besides the "key" column should be differently named for this to work out-of-box. Otherwise, you may need to use a lambda.

For a FULL OUTER JOIN, you can curry pd.merge using functools.partial:

from functools import partial

outer_merge = partial(pd.merge, how='outer')

reduce(outer_merge, dfs)



  key    valueA    valueB  valueC

0   A  0.056165       NaN     NaN

1   B -1.165150 -1.536244     NaN

2   C  0.900826       NaN     1.0

3   D  0.465662  1.488252     1.0

4   E       NaN  1.895889     1.0

5   F       NaN  1.178780     NaN

6   J       NaN       NaN     1.0

As you may have noticed, this is quite powerful—you can also use it to control column names during merge. Simply add more keyword arguments as needed:

partial(pd.merge, how='outer', left_index=True, right_on=...)

The alternative: pd.concat

If your column values are unique, then it makes sense to use pd.concat, this is faster than a two-at-a-time multi-way merge.

pd.concat([

    df.set_index('key') for df in dfs], axis=1, join='inner'

).reset_index()



  key    valueA    valueB  valueC

0   D  0.465662  1.488252     1.0

Multiway merge on unique indexes

If you are merging multiple DataFrames on unique indexes, you should once again prefer pd.concat for better performance.

# Note, the "key" column values are unique, so the index is unique.

A2 = A.set_index('key')

B2 = B.set_index('key')

C2 = C.set_index('key')

dfs2 = [A2, B2, C2]

pd.concat(dfs2, axis=1, sort=False, join='inner')



       valueA    valueB  valueC

key                            

D    0.922207 -1.099401     1.0

As always, omit join='inner' for a FULL OUTER JOIN.

Multiway merge on indexes with duplicates
concat is fast, but has its shortcomings. It cannot handle duplicates.

A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})

pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')

ValueError: Shape of passed values is (3, 4), indices imply (3, 2)

In this situation, join is the best option, since it can handle non-unique indexes (join calls merge under the hood).

# For inner join. For left join, pass `pd.DataFrame.join` directly to `reduce`. 

inner_join = partial(pd.DataFrame.join, how='inner')

reduce(inner_join, [A3.set_index('key'), B2, C2])



       valueA    valueB  valueC

key                            

D   -0.674333 -1.099401     1.0

D    0.031831 -1.099401     1.0

This should be a canonical post as these kinds of questions keep popping up and sometimes never marked as duplicate +1 — Dec 7 at 9:54
@jpp As an alternative to separate canonical posts, another option (that I've seen done before) would be to split each topic into separate answers, and have a table of contents in the question. Would that make more sense? — Dec 8 at 3:24
@AlexRiley The thought has been niggling at my mind over the past few days, and I have been giving serious thought into splitting this behemoth into three smaller canonicals - one on merging basics, another on index based merge, and the third on computing the cartesian product. I will be more than happy to create new canonicals and re-mark duplicates as appropriate. — Dec 9 at 16:26
Looks great! Thanks for taking the feedback on board and acting on it so quickly - the new post is (again) a great addition to SO. It's tricky to know how to best split the topic of merging into useful/usable posts, but I think your suggestion of basics/index/Cartesian-product is the right way to go. — Dec 10 at 11:31
@S.V I don't think align is appropriate for this post, but I can explain how it works and hopefully you'll understand why. align is used to make the indexes of two DataFrames compatible with each other by inserting NaNs for missing keys. It's almost like both frames are reindexed on the union of the indices. The return value of left.align(right) is (left_aligned, right_aligned). Hope that makes sense. — 2 days ago

搜尋此網誌

Cfrgtkky

Pandas Merging 101

This question has an open bounty worth +500
reputation from piRSquared ending in 3 days.

This question has an open bounty worth +500
reputation from piRSquared ending in 3 days.

This question has an open bounty worth +500
reputation from piRSquared ending in 3 days.

This question has an open bounty worth +500
reputation from piRSquared ending in 3 days.

1 Answer
1

Enough Talk, just show me how to use `merge`!

Index-based *-JOIN (+ index-column `merge`s)

Generalizing: `merge`ing multiple DataFrames

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Enough Talk, just show me how to use `merge`!

Index-based *-JOIN (+ index-column `merge`s)

Generalizing: `merge`ing multiple DataFrames

Enough Talk, just show me how to use `merge`!

Index-based *-JOIN (+ index-column `merge`s)

Generalizing: `merge`ing multiple DataFrames

Enough Talk, just show me how to use `merge`!

Index-based *-JOIN (+ index-column `merge`s)

Generalizing: `merge`ing multiple DataFrames

Enough Talk, just show me how to use `merge`!

Index-based *-JOIN (+ index-column `merge`s)

Generalizing: `merge`ing multiple DataFrames

Post as a guest

Popular posts from this blog

mysqli_query(): Empty query in /home/lucindabrummitt/public_html/blog/wp-includes/wp-db.php on line 1924

Multiple Hosts connection in Hyperledger Fabric

How to send String Array data to Server using php in android

Pandas Merging 101

This question has an open bounty worth +500 reputation from piRSquared ending in 3 days.

This question has an open bounty worth +500 reputation from piRSquared ending in 3 days.

This question has an open bounty worth +500 reputation from piRSquared ending in 3 days.

This question has an open bounty worth +500 reputation from piRSquared ending in 3 days.

1 Answer 1

Enough Talk, just show me how to use merge!

Index-based *-JOIN (+ index-column merges)

Generalizing: mergeing multiple DataFrames

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Enough Talk, just show me how to use merge!

Index-based *-JOIN (+ index-column merges)

Generalizing: mergeing multiple DataFrames

Enough Talk, just show me how to use merge!

Index-based *-JOIN (+ index-column merges)

Generalizing: mergeing multiple DataFrames

Enough Talk, just show me how to use merge!

Index-based *-JOIN (+ index-column merges)

Generalizing: mergeing multiple DataFrames

Enough Talk, just show me how to use merge!

Index-based *-JOIN (+ index-column merges)

Generalizing: mergeing multiple DataFrames

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

mysqli_query(): Empty query in /home/lucindabrummitt/public_html/blog/wp-includes/wp-db.php on line 1924

Multiple Hosts connection in Hyperledger Fabric

How to send String Array data to Server using php in android

This question has an open bounty worth +500
reputation from piRSquared ending in 3 days.

This question has an open bounty worth +500
reputation from piRSquared ending in 3 days.

This question has an open bounty worth +500
reputation from piRSquared ending in 3 days.

This question has an open bounty worth +500
reputation from piRSquared ending in 3 days.

1 Answer
1

Enough Talk, just show me how to use `merge`!

Index-based *-JOIN (+ index-column `merge`s)

Generalizing: `merge`ing multiple DataFrames

1 Answer
1

1 Answer
1

Enough Talk, just show me how to use `merge`!

Index-based *-JOIN (+ index-column `merge`s)

Generalizing: `merge`ing multiple DataFrames

Enough Talk, just show me how to use `merge`!

Index-based *-JOIN (+ index-column `merge`s)

Generalizing: `merge`ing multiple DataFrames

Enough Talk, just show me how to use `merge`!

Index-based *-JOIN (+ index-column `merge`s)

Generalizing: `merge`ing multiple DataFrames

Enough Talk, just show me how to use `merge`!

Index-based *-JOIN (+ index-column `merge`s)

Generalizing: `merge`ing multiple DataFrames