Pandas Merging 101
up vote
59
down vote
favorite
- How to perform a (
LEFT
|RIGHT
|FULL
) (INNER
|OUTER
) join with pandas? - How do I add NaNs for missing rows after merge?
- How do I get rid of NaNs after merging?
- Can I merge on the index?
- How do I merge multiple DataFrames?
merge
?join
?concat
?update
? Who? What? Why?!
... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This QnA is meant to be the next instalment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
python pandas join merge
This question has an open bounty worth +500
reputation from piRSquared ending in 3 days.
One or more of the answers is exemplary and worthy of an additional bounty.
This Q&A provides a needed reference for the many disparate questions regarding merging and joining. It is well written and comprehensive.
add a comment |
up vote
59
down vote
favorite
- How to perform a (
LEFT
|RIGHT
|FULL
) (INNER
|OUTER
) join with pandas? - How do I add NaNs for missing rows after merge?
- How do I get rid of NaNs after merging?
- Can I merge on the index?
- How do I merge multiple DataFrames?
merge
?join
?concat
?update
? Who? What? Why?!
... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This QnA is meant to be the next instalment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
python pandas join merge
This question has an open bounty worth +500
reputation from piRSquared ending in 3 days.
One or more of the answers is exemplary and worthy of an additional bounty.
This Q&A provides a needed reference for the many disparate questions regarding merging and joining. It is well written and comprehensive.
1
Note regarding duplicate closures: Older, upvotes posts can be closed in favour of newer, more canonical, or up-to-date QnAs as per meta's policy; see case 1, case 2, case 3 and case 4 on MSE.
– coldspeed
Dec 8 at 3:13
Information on (efficiently) computing the cartesian product of two (or more) DataFrames can be found at this link: Performant cartesian product (CROSS JOIN) of two pandas DataFrames
– coldspeed
Dec 10 at 3:39
add a comment |
up vote
59
down vote
favorite
up vote
59
down vote
favorite
- How to perform a (
LEFT
|RIGHT
|FULL
) (INNER
|OUTER
) join with pandas? - How do I add NaNs for missing rows after merge?
- How do I get rid of NaNs after merging?
- Can I merge on the index?
- How do I merge multiple DataFrames?
merge
?join
?concat
?update
? Who? What? Why?!
... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This QnA is meant to be the next instalment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
python pandas join merge
- How to perform a (
LEFT
|RIGHT
|FULL
) (INNER
|OUTER
) join with pandas? - How do I add NaNs for missing rows after merge?
- How do I get rid of NaNs after merging?
- Can I merge on the index?
- How do I merge multiple DataFrames?
merge
?join
?concat
?update
? Who? What? Why?!
... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This QnA is meant to be the next instalment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
python pandas join merge
python pandas join merge
edited Dec 10 at 3:12
asked Dec 6 at 6:41
coldspeed
114k18104182
114k18104182
This question has an open bounty worth +500
reputation from piRSquared ending in 3 days.
One or more of the answers is exemplary and worthy of an additional bounty.
This Q&A provides a needed reference for the many disparate questions regarding merging and joining. It is well written and comprehensive.
This question has an open bounty worth +500
reputation from piRSquared ending in 3 days.
One or more of the answers is exemplary and worthy of an additional bounty.
This Q&A provides a needed reference for the many disparate questions regarding merging and joining. It is well written and comprehensive.
1
Note regarding duplicate closures: Older, upvotes posts can be closed in favour of newer, more canonical, or up-to-date QnAs as per meta's policy; see case 1, case 2, case 3 and case 4 on MSE.
– coldspeed
Dec 8 at 3:13
Information on (efficiently) computing the cartesian product of two (or more) DataFrames can be found at this link: Performant cartesian product (CROSS JOIN) of two pandas DataFrames
– coldspeed
Dec 10 at 3:39
add a comment |
1
Note regarding duplicate closures: Older, upvotes posts can be closed in favour of newer, more canonical, or up-to-date QnAs as per meta's policy; see case 1, case 2, case 3 and case 4 on MSE.
– coldspeed
Dec 8 at 3:13
Information on (efficiently) computing the cartesian product of two (or more) DataFrames can be found at this link: Performant cartesian product (CROSS JOIN) of two pandas DataFrames
– coldspeed
Dec 10 at 3:39
1
1
Note regarding duplicate closures: Older, upvotes posts can be closed in favour of newer, more canonical, or up-to-date QnAs as per meta's policy; see case 1, case 2, case 3 and case 4 on MSE.
– coldspeed
Dec 8 at 3:13
Note regarding duplicate closures: Older, upvotes posts can be closed in favour of newer, more canonical, or up-to-date QnAs as per meta's policy; see case 1, case 2, case 3 and case 4 on MSE.
– coldspeed
Dec 8 at 3:13
Information on (efficiently) computing the cartesian product of two (or more) DataFrames can be found at this link: Performant cartesian product (CROSS JOIN) of two pandas DataFrames
– coldspeed
Dec 10 at 3:39
Information on (efficiently) computing the cartesian product of two (or more) DataFrames can be found at this link: Performant cartesian product (CROSS JOIN) of two pandas DataFrames
– coldspeed
Dec 10 at 3:39
add a comment |
1 Answer
1
active
oldest
votes
up vote
67
down vote
accepted
This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
- merging with different column names
- avoiding duplicate merge key column in output
- Merging with index under different conditions
- effectively using your named index
- merge key as the index of one and column of another
- Multiway merges on columns and indexes (unique and non-unique)
- Notable alternatives to
merge
andjoin
What this post will not go through:
- Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
- Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
Note
Most examples default to FULL INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.
Enough Talk, just show me how to use merge
!
Setup
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.494079
1 B -0.205158
2 C 0.313068
3 D -0.854096
right
key value
0 B -2.552990
1 D 0.653619
2 E 0.864436
3 F -0.742165
For the sake of simplicity, the key column has the same name (for now).
To perform a FULL INNER JOIN, call pd.merge
specifying the left DataFrame, the right DataFrame, and the join key.
pd.merge(left, right, on='key')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
In more recent versions of pandas (v0.21 or so), merge
is now a first order function, so you can call DataFrame.merge
.
left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
A LEFT OUTER JOIN would be done by specifying how='left'
.
left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left'
, then only keys from left
are used, and missing data from right
is replaced by NaN.
And similarly, specify how='right'
...
left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
...for a RIGHT OUTER JOIN. Here, keys from right
are used, and missing data from left
is replaced by NaN.
Finally, specify how='outer'
...
left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
...for a FULL OUTER JOIN. This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarises these various merges nicely:
Other JOINs - LEFT, RIGHT, ANTI-JOINs
If you need LEFT JOINs and RIGHT JOINs in two steps.
For LEFT JOIN, start by performing a LEFT OUTER JOIN and then filtering rows coming from left
only,
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
And similarly, for a RIGHT JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN), you can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
If the key columns are named differently—for example, left
has keyLeft
, and right
has keyRight
instead of key
—then you will have to specify left_on
and right_on
as arguments instead of on
:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 0.706573
1 B 0.010500
2 C 1.785870
3 D 0.126912
right2
keyRight value
0 B 0.401989
1 D 1.883151
2 E -1.347759
3 F -1.270485
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.010500 B 0.401989
1 D 0.126912 D 1.883151
Avoiding duplicate key column in output
When merging on keyLeft
from left
and keyRight
from right
, if you only want either of the keyLeft
or keyRight
(but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.010500 B 0.401989
1 0.126912 D 1.883151
Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
), you'll notice keyLeft
is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
Merging only a single column from one of the DataFrames
For example, consider
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map
:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
To join on more than one column, specify a list for on
(or left_on
and right_on
, as appropriate).
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful merge*
operations and functions
Merging a DataFrame with Series on index: See this answer.Besides
merge
,DataFrame.update
andDataFrame.combine_first
are also used in certain cases to update one DataFrame with another.pd.merge_ordered
is a useful function for ordered JOINs.pd.merge_asof
(read: merge_asOf) is useful for approximate joins.
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge
, join
, and concat
as well as the links to the function specs.
Index-based *-JOIN (+ index-column merge
s)
Setup
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A 2.269755
B -1.454366
C 0.045759
D -0.187184
right
value
idxkey
B 1.532779
D 1.469359
E 0.154947
F 0.378163
Typically, a merge on index would look like this:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Support for index names
If your index is named, then v0.23 users can also specify the level name to on
(or left_on
and right_on
as necessary).
left.merge(right, on='idxkey')
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Merge with index of one, columns of another
It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,
left.merge(right, left_on='key1', right_index=True)
Or vice versa (right_on=...
and left_index=True`).
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 1.222445
1 D 0.208275
2 E 0.976639
3 F 0.356366
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -1.070753 B 1.222445
1 -0.403177 D 0.208275
In this special case, the index for left
is named, so you can also use the index name with left_on
, like this:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -1.070753 B 1.222445
1 -0.403177 D 0.208275
DataFrame.join
Besides these, there is another succinct option. You can use DataFrame.join
which defaults to joins on the index. DataFrame.join
does a LEFT OUTER JOIN by default, so how='inner'
is necessary here.
left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Note that I needed to specify the lsuffix
and rsuffix
arguments since join
would otherwise error out:
left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -1.454366 1.532779
D -0.187184 1.469359
pd.concat
Lastly, as an alternative for index-based joins, you can use pd.concat
:
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -1.980796 1.230291
D 0.156349 1.202380
Omit join='inner'
if you need a FULL OUTER JOIN (the default):
pd.concat([left, right], axis=1, sort=False)
value value
A -0.887786 NaN
B -1.980796 1.230291
C -0.347912 NaN
D 0.156349 1.202380
E NaN -0.387327
F NaN -0.302303
For more information, see this canonical post on pd.concat
by @piRSquared.
Generalizing: merge
ing multiple DataFrames
Setup
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge
calls:
A.merge(B, on='key').merge(C, on='key')
key valueA valueB valueC
0 D 0.922207 -1.099401 1.0
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames. To do this, one often used simple trick is with functools.reduce
, and you can use it to achieve a FULL INNER JOIN like so:
from functools import reduce
reduce(pd.merge, dfs)
key valueA valueB valueC
0 D 0.465662 1.488252 1.0
Note that every column besides the "key" column should be differently named for this to work out-of-box. Otherwise, you may need to use a lambda
.
For a FULL OUTER JOIN, you can curry pd.merge
using functools.partial
:
from functools import partial
outer_merge = partial(pd.merge, how='outer')
reduce(outer_merge, dfs)
key valueA valueB valueC
0 A 0.056165 NaN NaN
1 B -1.165150 -1.536244 NaN
2 C 0.900826 NaN 1.0
3 D 0.465662 1.488252 1.0
4 E NaN 1.895889 1.0
5 F NaN 1.178780 NaN
6 J NaN NaN 1.0
As you may have noticed, this is quite powerful—you can also use it to control column names during merge. Simply add more keyword arguments as needed:
partial(pd.merge, how='outer', left_index=True, right_on=...)
The alternative: pd.concat
If your column values are unique, then it makes sense to use pd.concat
, this is faster than a two-at-a-time multi-way merge.
pd.concat([
df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 0.465662 1.488252 1.0
Multiway merge on unique indexes
If you are merging multiple DataFrames on unique indexes, you should once again prefer pd.concat
for better performance.
# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 0.922207 -1.099401 1.0
As always, omit join='inner'
for a FULL OUTER JOIN.
Multiway merge on indexes with duplicatesconcat
is fast, but has its shortcomings. It cannot handle duplicates.
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, join
is the best option, since it can handle non-unique indexes (join
calls merge
under the hood).
# For inner join. For left join, pass `pd.DataFrame.join` directly to `reduce`.
inner_join = partial(pd.DataFrame.join, how='inner')
reduce(inner_join, [A3.set_index('key'), B2, C2])
valueA valueB valueC
key
D -0.674333 -1.099401 1.0
D 0.031831 -1.099401 1.0
2
This should be a canonical post as these kinds of questions keep popping up and sometimes never marked as duplicate +1
– EdChum
Dec 7 at 9:54
@jpp As an alternative to separate canonical posts, another option (that I've seen done before) would be to split each topic into separate answers, and have a table of contents in the question. Would that make more sense?
– coldspeed
Dec 8 at 3:24
3
@AlexRiley The thought has been niggling at my mind over the past few days, and I have been giving serious thought into splitting this behemoth into three smaller canonicals - one on merging basics, another on index based merge, and the third on computing the cartesian product. I will be more than happy to create new canonicals and re-mark duplicates as appropriate.
– coldspeed
Dec 9 at 16:26
1
Looks great! Thanks for taking the feedback on board and acting on it so quickly - the new post is (again) a great addition to SO. It's tricky to know how to best split the topic of merging into useful/usable posts, but I think your suggestion of basics/index/Cartesian-product is the right way to go.
– Alex Riley
Dec 10 at 11:31
1
@S.V I don't thinkalign
is appropriate for this post, but I can explain how it works and hopefully you'll understand why.align
is used to make the indexes of two DataFrames compatible with each other by inserting NaNs for missing keys. It's almost like both frames are reindexed on the union of the indices. The return value ofleft.align(right)
is(left_aligned, right_aligned)
. Hope that makes sense.
– coldspeed
2 days ago
|
show 2 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53645882%2fpandas-merging-101%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
67
down vote
accepted
This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
- merging with different column names
- avoiding duplicate merge key column in output
- Merging with index under different conditions
- effectively using your named index
- merge key as the index of one and column of another
- Multiway merges on columns and indexes (unique and non-unique)
- Notable alternatives to
merge
andjoin
What this post will not go through:
- Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
- Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
Note
Most examples default to FULL INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.
Enough Talk, just show me how to use merge
!
Setup
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.494079
1 B -0.205158
2 C 0.313068
3 D -0.854096
right
key value
0 B -2.552990
1 D 0.653619
2 E 0.864436
3 F -0.742165
For the sake of simplicity, the key column has the same name (for now).
To perform a FULL INNER JOIN, call pd.merge
specifying the left DataFrame, the right DataFrame, and the join key.
pd.merge(left, right, on='key')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
In more recent versions of pandas (v0.21 or so), merge
is now a first order function, so you can call DataFrame.merge
.
left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
A LEFT OUTER JOIN would be done by specifying how='left'
.
left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left'
, then only keys from left
are used, and missing data from right
is replaced by NaN.
And similarly, specify how='right'
...
left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
...for a RIGHT OUTER JOIN. Here, keys from right
are used, and missing data from left
is replaced by NaN.
Finally, specify how='outer'
...
left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
...for a FULL OUTER JOIN. This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarises these various merges nicely:
Other JOINs - LEFT, RIGHT, ANTI-JOINs
If you need LEFT JOINs and RIGHT JOINs in two steps.
For LEFT JOIN, start by performing a LEFT OUTER JOIN and then filtering rows coming from left
only,
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
And similarly, for a RIGHT JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN), you can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
If the key columns are named differently—for example, left
has keyLeft
, and right
has keyRight
instead of key
—then you will have to specify left_on
and right_on
as arguments instead of on
:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 0.706573
1 B 0.010500
2 C 1.785870
3 D 0.126912
right2
keyRight value
0 B 0.401989
1 D 1.883151
2 E -1.347759
3 F -1.270485
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.010500 B 0.401989
1 D 0.126912 D 1.883151
Avoiding duplicate key column in output
When merging on keyLeft
from left
and keyRight
from right
, if you only want either of the keyLeft
or keyRight
(but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.010500 B 0.401989
1 0.126912 D 1.883151
Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
), you'll notice keyLeft
is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
Merging only a single column from one of the DataFrames
For example, consider
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map
:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
To join on more than one column, specify a list for on
(or left_on
and right_on
, as appropriate).
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful merge*
operations and functions
Merging a DataFrame with Series on index: See this answer.Besides
merge
,DataFrame.update
andDataFrame.combine_first
are also used in certain cases to update one DataFrame with another.pd.merge_ordered
is a useful function for ordered JOINs.pd.merge_asof
(read: merge_asOf) is useful for approximate joins.
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge
, join
, and concat
as well as the links to the function specs.
Index-based *-JOIN (+ index-column merge
s)
Setup
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A 2.269755
B -1.454366
C 0.045759
D -0.187184
right
value
idxkey
B 1.532779
D 1.469359
E 0.154947
F 0.378163
Typically, a merge on index would look like this:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Support for index names
If your index is named, then v0.23 users can also specify the level name to on
(or left_on
and right_on
as necessary).
left.merge(right, on='idxkey')
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Merge with index of one, columns of another
It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,
left.merge(right, left_on='key1', right_index=True)
Or vice versa (right_on=...
and left_index=True`).
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 1.222445
1 D 0.208275
2 E 0.976639
3 F 0.356366
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -1.070753 B 1.222445
1 -0.403177 D 0.208275
In this special case, the index for left
is named, so you can also use the index name with left_on
, like this:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -1.070753 B 1.222445
1 -0.403177 D 0.208275
DataFrame.join
Besides these, there is another succinct option. You can use DataFrame.join
which defaults to joins on the index. DataFrame.join
does a LEFT OUTER JOIN by default, so how='inner'
is necessary here.
left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Note that I needed to specify the lsuffix
and rsuffix
arguments since join
would otherwise error out:
left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -1.454366 1.532779
D -0.187184 1.469359
pd.concat
Lastly, as an alternative for index-based joins, you can use pd.concat
:
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -1.980796 1.230291
D 0.156349 1.202380
Omit join='inner'
if you need a FULL OUTER JOIN (the default):
pd.concat([left, right], axis=1, sort=False)
value value
A -0.887786 NaN
B -1.980796 1.230291
C -0.347912 NaN
D 0.156349 1.202380
E NaN -0.387327
F NaN -0.302303
For more information, see this canonical post on pd.concat
by @piRSquared.
Generalizing: merge
ing multiple DataFrames
Setup
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge
calls:
A.merge(B, on='key').merge(C, on='key')
key valueA valueB valueC
0 D 0.922207 -1.099401 1.0
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames. To do this, one often used simple trick is with functools.reduce
, and you can use it to achieve a FULL INNER JOIN like so:
from functools import reduce
reduce(pd.merge, dfs)
key valueA valueB valueC
0 D 0.465662 1.488252 1.0
Note that every column besides the "key" column should be differently named for this to work out-of-box. Otherwise, you may need to use a lambda
.
For a FULL OUTER JOIN, you can curry pd.merge
using functools.partial
:
from functools import partial
outer_merge = partial(pd.merge, how='outer')
reduce(outer_merge, dfs)
key valueA valueB valueC
0 A 0.056165 NaN NaN
1 B -1.165150 -1.536244 NaN
2 C 0.900826 NaN 1.0
3 D 0.465662 1.488252 1.0
4 E NaN 1.895889 1.0
5 F NaN 1.178780 NaN
6 J NaN NaN 1.0
As you may have noticed, this is quite powerful—you can also use it to control column names during merge. Simply add more keyword arguments as needed:
partial(pd.merge, how='outer', left_index=True, right_on=...)
The alternative: pd.concat
If your column values are unique, then it makes sense to use pd.concat
, this is faster than a two-at-a-time multi-way merge.
pd.concat([
df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 0.465662 1.488252 1.0
Multiway merge on unique indexes
If you are merging multiple DataFrames on unique indexes, you should once again prefer pd.concat
for better performance.
# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 0.922207 -1.099401 1.0
As always, omit join='inner'
for a FULL OUTER JOIN.
Multiway merge on indexes with duplicatesconcat
is fast, but has its shortcomings. It cannot handle duplicates.
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, join
is the best option, since it can handle non-unique indexes (join
calls merge
under the hood).
# For inner join. For left join, pass `pd.DataFrame.join` directly to `reduce`.
inner_join = partial(pd.DataFrame.join, how='inner')
reduce(inner_join, [A3.set_index('key'), B2, C2])
valueA valueB valueC
key
D -0.674333 -1.099401 1.0
D 0.031831 -1.099401 1.0
2
This should be a canonical post as these kinds of questions keep popping up and sometimes never marked as duplicate +1
– EdChum
Dec 7 at 9:54
@jpp As an alternative to separate canonical posts, another option (that I've seen done before) would be to split each topic into separate answers, and have a table of contents in the question. Would that make more sense?
– coldspeed
Dec 8 at 3:24
3
@AlexRiley The thought has been niggling at my mind over the past few days, and I have been giving serious thought into splitting this behemoth into three smaller canonicals - one on merging basics, another on index based merge, and the third on computing the cartesian product. I will be more than happy to create new canonicals and re-mark duplicates as appropriate.
– coldspeed
Dec 9 at 16:26
1
Looks great! Thanks for taking the feedback on board and acting on it so quickly - the new post is (again) a great addition to SO. It's tricky to know how to best split the topic of merging into useful/usable posts, but I think your suggestion of basics/index/Cartesian-product is the right way to go.
– Alex Riley
Dec 10 at 11:31
1
@S.V I don't thinkalign
is appropriate for this post, but I can explain how it works and hopefully you'll understand why.align
is used to make the indexes of two DataFrames compatible with each other by inserting NaNs for missing keys. It's almost like both frames are reindexed on the union of the indices. The return value ofleft.align(right)
is(left_aligned, right_aligned)
. Hope that makes sense.
– coldspeed
2 days ago
|
show 2 more comments
up vote
67
down vote
accepted
This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
- merging with different column names
- avoiding duplicate merge key column in output
- Merging with index under different conditions
- effectively using your named index
- merge key as the index of one and column of another
- Multiway merges on columns and indexes (unique and non-unique)
- Notable alternatives to
merge
andjoin
What this post will not go through:
- Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
- Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
Note
Most examples default to FULL INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.
Enough Talk, just show me how to use merge
!
Setup
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.494079
1 B -0.205158
2 C 0.313068
3 D -0.854096
right
key value
0 B -2.552990
1 D 0.653619
2 E 0.864436
3 F -0.742165
For the sake of simplicity, the key column has the same name (for now).
To perform a FULL INNER JOIN, call pd.merge
specifying the left DataFrame, the right DataFrame, and the join key.
pd.merge(left, right, on='key')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
In more recent versions of pandas (v0.21 or so), merge
is now a first order function, so you can call DataFrame.merge
.
left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
A LEFT OUTER JOIN would be done by specifying how='left'
.
left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left'
, then only keys from left
are used, and missing data from right
is replaced by NaN.
And similarly, specify how='right'
...
left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
...for a RIGHT OUTER JOIN. Here, keys from right
are used, and missing data from left
is replaced by NaN.
Finally, specify how='outer'
...
left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
...for a FULL OUTER JOIN. This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarises these various merges nicely:
Other JOINs - LEFT, RIGHT, ANTI-JOINs
If you need LEFT JOINs and RIGHT JOINs in two steps.
For LEFT JOIN, start by performing a LEFT OUTER JOIN and then filtering rows coming from left
only,
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
And similarly, for a RIGHT JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN), you can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
If the key columns are named differently—for example, left
has keyLeft
, and right
has keyRight
instead of key
—then you will have to specify left_on
and right_on
as arguments instead of on
:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 0.706573
1 B 0.010500
2 C 1.785870
3 D 0.126912
right2
keyRight value
0 B 0.401989
1 D 1.883151
2 E -1.347759
3 F -1.270485
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.010500 B 0.401989
1 D 0.126912 D 1.883151
Avoiding duplicate key column in output
When merging on keyLeft
from left
and keyRight
from right
, if you only want either of the keyLeft
or keyRight
(but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.010500 B 0.401989
1 0.126912 D 1.883151
Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
), you'll notice keyLeft
is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
Merging only a single column from one of the DataFrames
For example, consider
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map
:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
To join on more than one column, specify a list for on
(or left_on
and right_on
, as appropriate).
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful merge*
operations and functions
Merging a DataFrame with Series on index: See this answer.Besides
merge
,DataFrame.update
andDataFrame.combine_first
are also used in certain cases to update one DataFrame with another.pd.merge_ordered
is a useful function for ordered JOINs.pd.merge_asof
(read: merge_asOf) is useful for approximate joins.
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge
, join
, and concat
as well as the links to the function specs.
Index-based *-JOIN (+ index-column merge
s)
Setup
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A 2.269755
B -1.454366
C 0.045759
D -0.187184
right
value
idxkey
B 1.532779
D 1.469359
E 0.154947
F 0.378163
Typically, a merge on index would look like this:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Support for index names
If your index is named, then v0.23 users can also specify the level name to on
(or left_on
and right_on
as necessary).
left.merge(right, on='idxkey')
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Merge with index of one, columns of another
It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,
left.merge(right, left_on='key1', right_index=True)
Or vice versa (right_on=...
and left_index=True`).
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 1.222445
1 D 0.208275
2 E 0.976639
3 F 0.356366
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -1.070753 B 1.222445
1 -0.403177 D 0.208275
In this special case, the index for left
is named, so you can also use the index name with left_on
, like this:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -1.070753 B 1.222445
1 -0.403177 D 0.208275
DataFrame.join
Besides these, there is another succinct option. You can use DataFrame.join
which defaults to joins on the index. DataFrame.join
does a LEFT OUTER JOIN by default, so how='inner'
is necessary here.
left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Note that I needed to specify the lsuffix
and rsuffix
arguments since join
would otherwise error out:
left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -1.454366 1.532779
D -0.187184 1.469359
pd.concat
Lastly, as an alternative for index-based joins, you can use pd.concat
:
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -1.980796 1.230291
D 0.156349 1.202380
Omit join='inner'
if you need a FULL OUTER JOIN (the default):
pd.concat([left, right], axis=1, sort=False)
value value
A -0.887786 NaN
B -1.980796 1.230291
C -0.347912 NaN
D 0.156349 1.202380
E NaN -0.387327
F NaN -0.302303
For more information, see this canonical post on pd.concat
by @piRSquared.
Generalizing: merge
ing multiple DataFrames
Setup
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge
calls:
A.merge(B, on='key').merge(C, on='key')
key valueA valueB valueC
0 D 0.922207 -1.099401 1.0
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames. To do this, one often used simple trick is with functools.reduce
, and you can use it to achieve a FULL INNER JOIN like so:
from functools import reduce
reduce(pd.merge, dfs)
key valueA valueB valueC
0 D 0.465662 1.488252 1.0
Note that every column besides the "key" column should be differently named for this to work out-of-box. Otherwise, you may need to use a lambda
.
For a FULL OUTER JOIN, you can curry pd.merge
using functools.partial
:
from functools import partial
outer_merge = partial(pd.merge, how='outer')
reduce(outer_merge, dfs)
key valueA valueB valueC
0 A 0.056165 NaN NaN
1 B -1.165150 -1.536244 NaN
2 C 0.900826 NaN 1.0
3 D 0.465662 1.488252 1.0
4 E NaN 1.895889 1.0
5 F NaN 1.178780 NaN
6 J NaN NaN 1.0
As you may have noticed, this is quite powerful—you can also use it to control column names during merge. Simply add more keyword arguments as needed:
partial(pd.merge, how='outer', left_index=True, right_on=...)
The alternative: pd.concat
If your column values are unique, then it makes sense to use pd.concat
, this is faster than a two-at-a-time multi-way merge.
pd.concat([
df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 0.465662 1.488252 1.0
Multiway merge on unique indexes
If you are merging multiple DataFrames on unique indexes, you should once again prefer pd.concat
for better performance.
# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 0.922207 -1.099401 1.0
As always, omit join='inner'
for a FULL OUTER JOIN.
Multiway merge on indexes with duplicatesconcat
is fast, but has its shortcomings. It cannot handle duplicates.
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, join
is the best option, since it can handle non-unique indexes (join
calls merge
under the hood).
# For inner join. For left join, pass `pd.DataFrame.join` directly to `reduce`.
inner_join = partial(pd.DataFrame.join, how='inner')
reduce(inner_join, [A3.set_index('key'), B2, C2])
valueA valueB valueC
key
D -0.674333 -1.099401 1.0
D 0.031831 -1.099401 1.0
2
This should be a canonical post as these kinds of questions keep popping up and sometimes never marked as duplicate +1
– EdChum
Dec 7 at 9:54
@jpp As an alternative to separate canonical posts, another option (that I've seen done before) would be to split each topic into separate answers, and have a table of contents in the question. Would that make more sense?
– coldspeed
Dec 8 at 3:24
3
@AlexRiley The thought has been niggling at my mind over the past few days, and I have been giving serious thought into splitting this behemoth into three smaller canonicals - one on merging basics, another on index based merge, and the third on computing the cartesian product. I will be more than happy to create new canonicals and re-mark duplicates as appropriate.
– coldspeed
Dec 9 at 16:26
1
Looks great! Thanks for taking the feedback on board and acting on it so quickly - the new post is (again) a great addition to SO. It's tricky to know how to best split the topic of merging into useful/usable posts, but I think your suggestion of basics/index/Cartesian-product is the right way to go.
– Alex Riley
Dec 10 at 11:31
1
@S.V I don't thinkalign
is appropriate for this post, but I can explain how it works and hopefully you'll understand why.align
is used to make the indexes of two DataFrames compatible with each other by inserting NaNs for missing keys. It's almost like both frames are reindexed on the union of the indices. The return value ofleft.align(right)
is(left_aligned, right_aligned)
. Hope that makes sense.
– coldspeed
2 days ago
|
show 2 more comments
up vote
67
down vote
accepted
up vote
67
down vote
accepted
This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
- merging with different column names
- avoiding duplicate merge key column in output
- Merging with index under different conditions
- effectively using your named index
- merge key as the index of one and column of another
- Multiway merges on columns and indexes (unique and non-unique)
- Notable alternatives to
merge
andjoin
What this post will not go through:
- Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
- Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
Note
Most examples default to FULL INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.
Enough Talk, just show me how to use merge
!
Setup
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.494079
1 B -0.205158
2 C 0.313068
3 D -0.854096
right
key value
0 B -2.552990
1 D 0.653619
2 E 0.864436
3 F -0.742165
For the sake of simplicity, the key column has the same name (for now).
To perform a FULL INNER JOIN, call pd.merge
specifying the left DataFrame, the right DataFrame, and the join key.
pd.merge(left, right, on='key')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
In more recent versions of pandas (v0.21 or so), merge
is now a first order function, so you can call DataFrame.merge
.
left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
A LEFT OUTER JOIN would be done by specifying how='left'
.
left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left'
, then only keys from left
are used, and missing data from right
is replaced by NaN.
And similarly, specify how='right'
...
left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
...for a RIGHT OUTER JOIN. Here, keys from right
are used, and missing data from left
is replaced by NaN.
Finally, specify how='outer'
...
left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
...for a FULL OUTER JOIN. This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarises these various merges nicely:
Other JOINs - LEFT, RIGHT, ANTI-JOINs
If you need LEFT JOINs and RIGHT JOINs in two steps.
For LEFT JOIN, start by performing a LEFT OUTER JOIN and then filtering rows coming from left
only,
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
And similarly, for a RIGHT JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN), you can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
If the key columns are named differently—for example, left
has keyLeft
, and right
has keyRight
instead of key
—then you will have to specify left_on
and right_on
as arguments instead of on
:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 0.706573
1 B 0.010500
2 C 1.785870
3 D 0.126912
right2
keyRight value
0 B 0.401989
1 D 1.883151
2 E -1.347759
3 F -1.270485
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.010500 B 0.401989
1 D 0.126912 D 1.883151
Avoiding duplicate key column in output
When merging on keyLeft
from left
and keyRight
from right
, if you only want either of the keyLeft
or keyRight
(but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.010500 B 0.401989
1 0.126912 D 1.883151
Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
), you'll notice keyLeft
is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
Merging only a single column from one of the DataFrames
For example, consider
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map
:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
To join on more than one column, specify a list for on
(or left_on
and right_on
, as appropriate).
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful merge*
operations and functions
Merging a DataFrame with Series on index: See this answer.Besides
merge
,DataFrame.update
andDataFrame.combine_first
are also used in certain cases to update one DataFrame with another.pd.merge_ordered
is a useful function for ordered JOINs.pd.merge_asof
(read: merge_asOf) is useful for approximate joins.
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge
, join
, and concat
as well as the links to the function specs.
Index-based *-JOIN (+ index-column merge
s)
Setup
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A 2.269755
B -1.454366
C 0.045759
D -0.187184
right
value
idxkey
B 1.532779
D 1.469359
E 0.154947
F 0.378163
Typically, a merge on index would look like this:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Support for index names
If your index is named, then v0.23 users can also specify the level name to on
(or left_on
and right_on
as necessary).
left.merge(right, on='idxkey')
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Merge with index of one, columns of another
It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,
left.merge(right, left_on='key1', right_index=True)
Or vice versa (right_on=...
and left_index=True`).
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 1.222445
1 D 0.208275
2 E 0.976639
3 F 0.356366
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -1.070753 B 1.222445
1 -0.403177 D 0.208275
In this special case, the index for left
is named, so you can also use the index name with left_on
, like this:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -1.070753 B 1.222445
1 -0.403177 D 0.208275
DataFrame.join
Besides these, there is another succinct option. You can use DataFrame.join
which defaults to joins on the index. DataFrame.join
does a LEFT OUTER JOIN by default, so how='inner'
is necessary here.
left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Note that I needed to specify the lsuffix
and rsuffix
arguments since join
would otherwise error out:
left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -1.454366 1.532779
D -0.187184 1.469359
pd.concat
Lastly, as an alternative for index-based joins, you can use pd.concat
:
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -1.980796 1.230291
D 0.156349 1.202380
Omit join='inner'
if you need a FULL OUTER JOIN (the default):
pd.concat([left, right], axis=1, sort=False)
value value
A -0.887786 NaN
B -1.980796 1.230291
C -0.347912 NaN
D 0.156349 1.202380
E NaN -0.387327
F NaN -0.302303
For more information, see this canonical post on pd.concat
by @piRSquared.
Generalizing: merge
ing multiple DataFrames
Setup
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge
calls:
A.merge(B, on='key').merge(C, on='key')
key valueA valueB valueC
0 D 0.922207 -1.099401 1.0
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames. To do this, one often used simple trick is with functools.reduce
, and you can use it to achieve a FULL INNER JOIN like so:
from functools import reduce
reduce(pd.merge, dfs)
key valueA valueB valueC
0 D 0.465662 1.488252 1.0
Note that every column besides the "key" column should be differently named for this to work out-of-box. Otherwise, you may need to use a lambda
.
For a FULL OUTER JOIN, you can curry pd.merge
using functools.partial
:
from functools import partial
outer_merge = partial(pd.merge, how='outer')
reduce(outer_merge, dfs)
key valueA valueB valueC
0 A 0.056165 NaN NaN
1 B -1.165150 -1.536244 NaN
2 C 0.900826 NaN 1.0
3 D 0.465662 1.488252 1.0
4 E NaN 1.895889 1.0
5 F NaN 1.178780 NaN
6 J NaN NaN 1.0
As you may have noticed, this is quite powerful—you can also use it to control column names during merge. Simply add more keyword arguments as needed:
partial(pd.merge, how='outer', left_index=True, right_on=...)
The alternative: pd.concat
If your column values are unique, then it makes sense to use pd.concat
, this is faster than a two-at-a-time multi-way merge.
pd.concat([
df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 0.465662 1.488252 1.0
Multiway merge on unique indexes
If you are merging multiple DataFrames on unique indexes, you should once again prefer pd.concat
for better performance.
# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 0.922207 -1.099401 1.0
As always, omit join='inner'
for a FULL OUTER JOIN.
Multiway merge on indexes with duplicatesconcat
is fast, but has its shortcomings. It cannot handle duplicates.
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, join
is the best option, since it can handle non-unique indexes (join
calls merge
under the hood).
# For inner join. For left join, pass `pd.DataFrame.join` directly to `reduce`.
inner_join = partial(pd.DataFrame.join, how='inner')
reduce(inner_join, [A3.set_index('key'), B2, C2])
valueA valueB valueC
key
D -0.674333 -1.099401 1.0
D 0.031831 -1.099401 1.0
This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
- merging with different column names
- avoiding duplicate merge key column in output
- Merging with index under different conditions
- effectively using your named index
- merge key as the index of one and column of another
- Multiway merges on columns and indexes (unique and non-unique)
- Notable alternatives to
merge
andjoin
What this post will not go through:
- Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
- Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
Note
Most examples default to FULL INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.
Enough Talk, just show me how to use merge
!
Setup
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.494079
1 B -0.205158
2 C 0.313068
3 D -0.854096
right
key value
0 B -2.552990
1 D 0.653619
2 E 0.864436
3 F -0.742165
For the sake of simplicity, the key column has the same name (for now).
To perform a FULL INNER JOIN, call pd.merge
specifying the left DataFrame, the right DataFrame, and the join key.
pd.merge(left, right, on='key')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
In more recent versions of pandas (v0.21 or so), merge
is now a first order function, so you can call DataFrame.merge
.
left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
A LEFT OUTER JOIN would be done by specifying how='left'
.
left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left'
, then only keys from left
are used, and missing data from right
is replaced by NaN.
And similarly, specify how='right'
...
left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
...for a RIGHT OUTER JOIN. Here, keys from right
are used, and missing data from left
is replaced by NaN.
Finally, specify how='outer'
...
left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
...for a FULL OUTER JOIN. This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarises these various merges nicely:
Other JOINs - LEFT, RIGHT, ANTI-JOINs
If you need LEFT JOINs and RIGHT JOINs in two steps.
For LEFT JOIN, start by performing a LEFT OUTER JOIN and then filtering rows coming from left
only,
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
And similarly, for a RIGHT JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN), you can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
If the key columns are named differently—for example, left
has keyLeft
, and right
has keyRight
instead of key
—then you will have to specify left_on
and right_on
as arguments instead of on
:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 0.706573
1 B 0.010500
2 C 1.785870
3 D 0.126912
right2
keyRight value
0 B 0.401989
1 D 1.883151
2 E -1.347759
3 F -1.270485
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.010500 B 0.401989
1 D 0.126912 D 1.883151
Avoiding duplicate key column in output
When merging on keyLeft
from left
and keyRight
from right
, if you only want either of the keyLeft
or keyRight
(but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.010500 B 0.401989
1 0.126912 D 1.883151
Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
), you'll notice keyLeft
is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
Merging only a single column from one of the DataFrames
For example, consider
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map
:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
To join on more than one column, specify a list for on
(or left_on
and right_on
, as appropriate).
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful merge*
operations and functions
Merging a DataFrame with Series on index: See this answer.Besides
merge
,DataFrame.update
andDataFrame.combine_first
are also used in certain cases to update one DataFrame with another.pd.merge_ordered
is a useful function for ordered JOINs.pd.merge_asof
(read: merge_asOf) is useful for approximate joins.
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge
, join
, and concat
as well as the links to the function specs.
Index-based *-JOIN (+ index-column merge
s)
Setup
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A 2.269755
B -1.454366
C 0.045759
D -0.187184
right
value
idxkey
B 1.532779
D 1.469359
E 0.154947
F 0.378163
Typically, a merge on index would look like this:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Support for index names
If your index is named, then v0.23 users can also specify the level name to on
(or left_on
and right_on
as necessary).
left.merge(right, on='idxkey')
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Merge with index of one, columns of another
It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,
left.merge(right, left_on='key1', right_index=True)
Or vice versa (right_on=...
and left_index=True`).
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 1.222445
1 D 0.208275
2 E 0.976639
3 F 0.356366
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -1.070753 B 1.222445
1 -0.403177 D 0.208275
In this special case, the index for left
is named, so you can also use the index name with left_on
, like this:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -1.070753 B 1.222445
1 -0.403177 D 0.208275
DataFrame.join
Besides these, there is another succinct option. You can use DataFrame.join
which defaults to joins on the index. DataFrame.join
does a LEFT OUTER JOIN by default, so how='inner'
is necessary here.
left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B 0.410599 0.761038
D 1.454274 0.121675
Note that I needed to specify the lsuffix
and rsuffix
arguments since join
would otherwise error out:
left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -1.454366 1.532779
D -0.187184 1.469359
pd.concat
Lastly, as an alternative for index-based joins, you can use pd.concat
:
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -1.980796 1.230291
D 0.156349 1.202380
Omit join='inner'
if you need a FULL OUTER JOIN (the default):
pd.concat([left, right], axis=1, sort=False)
value value
A -0.887786 NaN
B -1.980796 1.230291
C -0.347912 NaN
D 0.156349 1.202380
E NaN -0.387327
F NaN -0.302303
For more information, see this canonical post on pd.concat
by @piRSquared.
Generalizing: merge
ing multiple DataFrames
Setup
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge
calls:
A.merge(B, on='key').merge(C, on='key')
key valueA valueB valueC
0 D 0.922207 -1.099401 1.0
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames. To do this, one often used simple trick is with functools.reduce
, and you can use it to achieve a FULL INNER JOIN like so:
from functools import reduce
reduce(pd.merge, dfs)
key valueA valueB valueC
0 D 0.465662 1.488252 1.0
Note that every column besides the "key" column should be differently named for this to work out-of-box. Otherwise, you may need to use a lambda
.
For a FULL OUTER JOIN, you can curry pd.merge
using functools.partial
:
from functools import partial
outer_merge = partial(pd.merge, how='outer')
reduce(outer_merge, dfs)
key valueA valueB valueC
0 A 0.056165 NaN NaN
1 B -1.165150 -1.536244 NaN
2 C 0.900826 NaN 1.0
3 D 0.465662 1.488252 1.0
4 E NaN 1.895889 1.0
5 F NaN 1.178780 NaN
6 J NaN NaN 1.0
As you may have noticed, this is quite powerful—you can also use it to control column names during merge. Simply add more keyword arguments as needed:
partial(pd.merge, how='outer', left_index=True, right_on=...)
The alternative: pd.concat
If your column values are unique, then it makes sense to use pd.concat
, this is faster than a two-at-a-time multi-way merge.
pd.concat([
df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 0.465662 1.488252 1.0
Multiway merge on unique indexes
If you are merging multiple DataFrames on unique indexes, you should once again prefer pd.concat
for better performance.
# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 0.922207 -1.099401 1.0
As always, omit join='inner'
for a FULL OUTER JOIN.
Multiway merge on indexes with duplicatesconcat
is fast, but has its shortcomings. It cannot handle duplicates.
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, join
is the best option, since it can handle non-unique indexes (join
calls merge
under the hood).
# For inner join. For left join, pass `pd.DataFrame.join` directly to `reduce`.
inner_join = partial(pd.DataFrame.join, how='inner')
reduce(inner_join, [A3.set_index('key'), B2, C2])
valueA valueB valueC
key
D -0.674333 -1.099401 1.0
D 0.031831 -1.099401 1.0
edited yesterday
answered Dec 6 at 6:41
coldspeed
114k18104182
114k18104182
2
This should be a canonical post as these kinds of questions keep popping up and sometimes never marked as duplicate +1
– EdChum
Dec 7 at 9:54
@jpp As an alternative to separate canonical posts, another option (that I've seen done before) would be to split each topic into separate answers, and have a table of contents in the question. Would that make more sense?
– coldspeed
Dec 8 at 3:24
3
@AlexRiley The thought has been niggling at my mind over the past few days, and I have been giving serious thought into splitting this behemoth into three smaller canonicals - one on merging basics, another on index based merge, and the third on computing the cartesian product. I will be more than happy to create new canonicals and re-mark duplicates as appropriate.
– coldspeed
Dec 9 at 16:26
1
Looks great! Thanks for taking the feedback on board and acting on it so quickly - the new post is (again) a great addition to SO. It's tricky to know how to best split the topic of merging into useful/usable posts, but I think your suggestion of basics/index/Cartesian-product is the right way to go.
– Alex Riley
Dec 10 at 11:31
1
@S.V I don't thinkalign
is appropriate for this post, but I can explain how it works and hopefully you'll understand why.align
is used to make the indexes of two DataFrames compatible with each other by inserting NaNs for missing keys. It's almost like both frames are reindexed on the union of the indices. The return value ofleft.align(right)
is(left_aligned, right_aligned)
. Hope that makes sense.
– coldspeed
2 days ago
|
show 2 more comments
2
This should be a canonical post as these kinds of questions keep popping up and sometimes never marked as duplicate +1
– EdChum
Dec 7 at 9:54
@jpp As an alternative to separate canonical posts, another option (that I've seen done before) would be to split each topic into separate answers, and have a table of contents in the question. Would that make more sense?
– coldspeed
Dec 8 at 3:24
3
@AlexRiley The thought has been niggling at my mind over the past few days, and I have been giving serious thought into splitting this behemoth into three smaller canonicals - one on merging basics, another on index based merge, and the third on computing the cartesian product. I will be more than happy to create new canonicals and re-mark duplicates as appropriate.
– coldspeed
Dec 9 at 16:26
1
Looks great! Thanks for taking the feedback on board and acting on it so quickly - the new post is (again) a great addition to SO. It's tricky to know how to best split the topic of merging into useful/usable posts, but I think your suggestion of basics/index/Cartesian-product is the right way to go.
– Alex Riley
Dec 10 at 11:31
1
@S.V I don't thinkalign
is appropriate for this post, but I can explain how it works and hopefully you'll understand why.align
is used to make the indexes of two DataFrames compatible with each other by inserting NaNs for missing keys. It's almost like both frames are reindexed on the union of the indices. The return value ofleft.align(right)
is(left_aligned, right_aligned)
. Hope that makes sense.
– coldspeed
2 days ago
2
2
This should be a canonical post as these kinds of questions keep popping up and sometimes never marked as duplicate +1
– EdChum
Dec 7 at 9:54
This should be a canonical post as these kinds of questions keep popping up and sometimes never marked as duplicate +1
– EdChum
Dec 7 at 9:54
@jpp As an alternative to separate canonical posts, another option (that I've seen done before) would be to split each topic into separate answers, and have a table of contents in the question. Would that make more sense?
– coldspeed
Dec 8 at 3:24
@jpp As an alternative to separate canonical posts, another option (that I've seen done before) would be to split each topic into separate answers, and have a table of contents in the question. Would that make more sense?
– coldspeed
Dec 8 at 3:24
3
3
@AlexRiley The thought has been niggling at my mind over the past few days, and I have been giving serious thought into splitting this behemoth into three smaller canonicals - one on merging basics, another on index based merge, and the third on computing the cartesian product. I will be more than happy to create new canonicals and re-mark duplicates as appropriate.
– coldspeed
Dec 9 at 16:26
@AlexRiley The thought has been niggling at my mind over the past few days, and I have been giving serious thought into splitting this behemoth into three smaller canonicals - one on merging basics, another on index based merge, and the third on computing the cartesian product. I will be more than happy to create new canonicals and re-mark duplicates as appropriate.
– coldspeed
Dec 9 at 16:26
1
1
Looks great! Thanks for taking the feedback on board and acting on it so quickly - the new post is (again) a great addition to SO. It's tricky to know how to best split the topic of merging into useful/usable posts, but I think your suggestion of basics/index/Cartesian-product is the right way to go.
– Alex Riley
Dec 10 at 11:31
Looks great! Thanks for taking the feedback on board and acting on it so quickly - the new post is (again) a great addition to SO. It's tricky to know how to best split the topic of merging into useful/usable posts, but I think your suggestion of basics/index/Cartesian-product is the right way to go.
– Alex Riley
Dec 10 at 11:31
1
1
@S.V I don't think
align
is appropriate for this post, but I can explain how it works and hopefully you'll understand why. align
is used to make the indexes of two DataFrames compatible with each other by inserting NaNs for missing keys. It's almost like both frames are reindexed on the union of the indices. The return value of left.align(right)
is (left_aligned, right_aligned)
. Hope that makes sense.– coldspeed
2 days ago
@S.V I don't think
align
is appropriate for this post, but I can explain how it works and hopefully you'll understand why. align
is used to make the indexes of two DataFrames compatible with each other by inserting NaNs for missing keys. It's almost like both frames are reindexed on the union of the indices. The return value of left.align(right)
is (left_aligned, right_aligned)
. Hope that makes sense.– coldspeed
2 days ago
|
show 2 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53645882%2fpandas-merging-101%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Note regarding duplicate closures: Older, upvotes posts can be closed in favour of newer, more canonical, or up-to-date QnAs as per meta's policy; see case 1, case 2, case 3 and case 4 on MSE.
– coldspeed
Dec 8 at 3:13
Information on (efficiently) computing the cartesian product of two (or more) DataFrames can be found at this link: Performant cartesian product (CROSS JOIN) of two pandas DataFrames
– coldspeed
Dec 10 at 3:39