Pandas de-duplication and return list of indexes that were duplicates
I have a pandas dataframe with 500k rows. Structured like this, where the document
column are strings:
document_id document
0 0 Here is our forecast
1 1 Traveling to have a business meeting takes the...
2 2 test successful. way to go!!!
3 3 Randy, Can you send me a schedule of the salar...
4 4 Let's shoot for Tuesday at 11:45.
When I de-dupe the dataframe based on the contents of the document column using df.drop_duplicates(subset='document')
, I end up with half the number of documents.
Now that I have my original dataframe and a second dataframe with the unique set of document
values, I would like to compare the two to get a list of document_id
's that are duplicates.
For example, if the associated document
for document_id
4, 93, and 275 are all 'Let's shoot for Tuesday at 11:45.', then how do I get a dataframe with document
in one column, and list of associated duplicate document_id
's in another column?
document_ids document
...
4 [4, 93, 275] Let's shoot for Tuesday at 11:45.
I know that I could use a for loop, and compare each document every other document in the dataframe, and save all matches, but I am trying to avoid iterating over 500k lines multiple times. What instead is the most pythonic way of going about this?
python pandas dataframe
add a comment |
I have a pandas dataframe with 500k rows. Structured like this, where the document
column are strings:
document_id document
0 0 Here is our forecast
1 1 Traveling to have a business meeting takes the...
2 2 test successful. way to go!!!
3 3 Randy, Can you send me a schedule of the salar...
4 4 Let's shoot for Tuesday at 11:45.
When I de-dupe the dataframe based on the contents of the document column using df.drop_duplicates(subset='document')
, I end up with half the number of documents.
Now that I have my original dataframe and a second dataframe with the unique set of document
values, I would like to compare the two to get a list of document_id
's that are duplicates.
For example, if the associated document
for document_id
4, 93, and 275 are all 'Let's shoot for Tuesday at 11:45.', then how do I get a dataframe with document
in one column, and list of associated duplicate document_id
's in another column?
document_ids document
...
4 [4, 93, 275] Let's shoot for Tuesday at 11:45.
I know that I could use a for loop, and compare each document every other document in the dataframe, and save all matches, but I am trying to avoid iterating over 500k lines multiple times. What instead is the most pythonic way of going about this?
python pandas dataframe
2
I think you should go back to your original duplicatedDataFrame
and dodf.groupby('document').document_id.apply(list)
– ALollz
Nov 20 '18 at 21:19
add a comment |
I have a pandas dataframe with 500k rows. Structured like this, where the document
column are strings:
document_id document
0 0 Here is our forecast
1 1 Traveling to have a business meeting takes the...
2 2 test successful. way to go!!!
3 3 Randy, Can you send me a schedule of the salar...
4 4 Let's shoot for Tuesday at 11:45.
When I de-dupe the dataframe based on the contents of the document column using df.drop_duplicates(subset='document')
, I end up with half the number of documents.
Now that I have my original dataframe and a second dataframe with the unique set of document
values, I would like to compare the two to get a list of document_id
's that are duplicates.
For example, if the associated document
for document_id
4, 93, and 275 are all 'Let's shoot for Tuesday at 11:45.', then how do I get a dataframe with document
in one column, and list of associated duplicate document_id
's in another column?
document_ids document
...
4 [4, 93, 275] Let's shoot for Tuesday at 11:45.
I know that I could use a for loop, and compare each document every other document in the dataframe, and save all matches, but I am trying to avoid iterating over 500k lines multiple times. What instead is the most pythonic way of going about this?
python pandas dataframe
I have a pandas dataframe with 500k rows. Structured like this, where the document
column are strings:
document_id document
0 0 Here is our forecast
1 1 Traveling to have a business meeting takes the...
2 2 test successful. way to go!!!
3 3 Randy, Can you send me a schedule of the salar...
4 4 Let's shoot for Tuesday at 11:45.
When I de-dupe the dataframe based on the contents of the document column using df.drop_duplicates(subset='document')
, I end up with half the number of documents.
Now that I have my original dataframe and a second dataframe with the unique set of document
values, I would like to compare the two to get a list of document_id
's that are duplicates.
For example, if the associated document
for document_id
4, 93, and 275 are all 'Let's shoot for Tuesday at 11:45.', then how do I get a dataframe with document
in one column, and list of associated duplicate document_id
's in another column?
document_ids document
...
4 [4, 93, 275] Let's shoot for Tuesday at 11:45.
I know that I could use a for loop, and compare each document every other document in the dataframe, and save all matches, but I am trying to avoid iterating over 500k lines multiple times. What instead is the most pythonic way of going about this?
python pandas dataframe
python pandas dataframe
asked Nov 20 '18 at 21:16
Gerrrald1Gerrrald1
396
396
2
I think you should go back to your original duplicatedDataFrame
and dodf.groupby('document').document_id.apply(list)
– ALollz
Nov 20 '18 at 21:19
add a comment |
2
I think you should go back to your original duplicatedDataFrame
and dodf.groupby('document').document_id.apply(list)
– ALollz
Nov 20 '18 at 21:19
2
2
I think you should go back to your original duplicated
DataFrame
and do df.groupby('document').document_id.apply(list)
– ALollz
Nov 20 '18 at 21:19
I think you should go back to your original duplicated
DataFrame
and do df.groupby('document').document_id.apply(list)
– ALollz
Nov 20 '18 at 21:19
add a comment |
1 Answer
1
active
oldest
votes
I would like to compare the two to get a list of document_id's that are duplicates.
You should be able to do this using your "initial" DataFrame with .duplicated(keep=False)
. Here's an example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
...: 'document_id': range(10),
...: 'document': list('abcabcdedb') # msg 'e' is not duplicated
...: })
In [3]: dupes = df.document.duplicated(keep=False)
In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()
Out[4]:
document document_id
0 a [0, 3]
1 b [1, 4, 9]
2 c [2, 5]
3 d [6, 8]
1
Awesome, you make it look easy haha, makes sense now, I was using the default ofkeep=first
.
– Gerrrald1
Nov 22 '18 at 1:08
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53401655%2fpandas-de-duplication-and-return-list-of-indexes-that-were-duplicates%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I would like to compare the two to get a list of document_id's that are duplicates.
You should be able to do this using your "initial" DataFrame with .duplicated(keep=False)
. Here's an example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
...: 'document_id': range(10),
...: 'document': list('abcabcdedb') # msg 'e' is not duplicated
...: })
In [3]: dupes = df.document.duplicated(keep=False)
In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()
Out[4]:
document document_id
0 a [0, 3]
1 b [1, 4, 9]
2 c [2, 5]
3 d [6, 8]
1
Awesome, you make it look easy haha, makes sense now, I was using the default ofkeep=first
.
– Gerrrald1
Nov 22 '18 at 1:08
add a comment |
I would like to compare the two to get a list of document_id's that are duplicates.
You should be able to do this using your "initial" DataFrame with .duplicated(keep=False)
. Here's an example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
...: 'document_id': range(10),
...: 'document': list('abcabcdedb') # msg 'e' is not duplicated
...: })
In [3]: dupes = df.document.duplicated(keep=False)
In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()
Out[4]:
document document_id
0 a [0, 3]
1 b [1, 4, 9]
2 c [2, 5]
3 d [6, 8]
1
Awesome, you make it look easy haha, makes sense now, I was using the default ofkeep=first
.
– Gerrrald1
Nov 22 '18 at 1:08
add a comment |
I would like to compare the two to get a list of document_id's that are duplicates.
You should be able to do this using your "initial" DataFrame with .duplicated(keep=False)
. Here's an example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
...: 'document_id': range(10),
...: 'document': list('abcabcdedb') # msg 'e' is not duplicated
...: })
In [3]: dupes = df.document.duplicated(keep=False)
In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()
Out[4]:
document document_id
0 a [0, 3]
1 b [1, 4, 9]
2 c [2, 5]
3 d [6, 8]
I would like to compare the two to get a list of document_id's that are duplicates.
You should be able to do this using your "initial" DataFrame with .duplicated(keep=False)
. Here's an example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
...: 'document_id': range(10),
...: 'document': list('abcabcdedb') # msg 'e' is not duplicated
...: })
In [3]: dupes = df.document.duplicated(keep=False)
In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()
Out[4]:
document document_id
0 a [0, 3]
1 b [1, 4, 9]
2 c [2, 5]
3 d [6, 8]
answered Nov 20 '18 at 23:12
Brad SolomonBrad Solomon
13.7k83486
13.7k83486
1
Awesome, you make it look easy haha, makes sense now, I was using the default ofkeep=first
.
– Gerrrald1
Nov 22 '18 at 1:08
add a comment |
1
Awesome, you make it look easy haha, makes sense now, I was using the default ofkeep=first
.
– Gerrrald1
Nov 22 '18 at 1:08
1
1
Awesome, you make it look easy haha, makes sense now, I was using the default of
keep=first
.– Gerrrald1
Nov 22 '18 at 1:08
Awesome, you make it look easy haha, makes sense now, I was using the default of
keep=first
.– Gerrrald1
Nov 22 '18 at 1:08
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53401655%2fpandas-de-duplication-and-return-list-of-indexes-that-were-duplicates%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
I think you should go back to your original duplicated
DataFrame
and dodf.groupby('document').document_id.apply(list)
– ALollz
Nov 20 '18 at 21:19