Pandas de-duplication and return list of indexes that were duplicates












2















I have a pandas dataframe with 500k rows. Structured like this, where the document column are strings:



   document_id                                           document
0 0 Here is our forecast
1 1 Traveling to have a business meeting takes the...
2 2 test successful. way to go!!!
3 3 Randy, Can you send me a schedule of the salar...
4 4 Let's shoot for Tuesday at 11:45.


When I de-dupe the dataframe based on the contents of the document column using df.drop_duplicates(subset='document'), I end up with half the number of documents.



Now that I have my original dataframe and a second dataframe with the unique set of document values, I would like to compare the two to get a list of document_id's that are duplicates.



For example, if the associated document for document_id 4, 93, and 275 are all 'Let's shoot for Tuesday at 11:45.', then how do I get a dataframe with document in one column, and list of associated duplicate document_id's in another column?



     document_ids                                           document    
...
4 [4, 93, 275] Let's shoot for Tuesday at 11:45.


I know that I could use a for loop, and compare each document every other document in the dataframe, and save all matches, but I am trying to avoid iterating over 500k lines multiple times. What instead is the most pythonic way of going about this?










share|improve this question


















  • 2





    I think you should go back to your original duplicated DataFrame and do df.groupby('document').document_id.apply(list)

    – ALollz
    Nov 20 '18 at 21:19
















2















I have a pandas dataframe with 500k rows. Structured like this, where the document column are strings:



   document_id                                           document
0 0 Here is our forecast
1 1 Traveling to have a business meeting takes the...
2 2 test successful. way to go!!!
3 3 Randy, Can you send me a schedule of the salar...
4 4 Let's shoot for Tuesday at 11:45.


When I de-dupe the dataframe based on the contents of the document column using df.drop_duplicates(subset='document'), I end up with half the number of documents.



Now that I have my original dataframe and a second dataframe with the unique set of document values, I would like to compare the two to get a list of document_id's that are duplicates.



For example, if the associated document for document_id 4, 93, and 275 are all 'Let's shoot for Tuesday at 11:45.', then how do I get a dataframe with document in one column, and list of associated duplicate document_id's in another column?



     document_ids                                           document    
...
4 [4, 93, 275] Let's shoot for Tuesday at 11:45.


I know that I could use a for loop, and compare each document every other document in the dataframe, and save all matches, but I am trying to avoid iterating over 500k lines multiple times. What instead is the most pythonic way of going about this?










share|improve this question


















  • 2





    I think you should go back to your original duplicated DataFrame and do df.groupby('document').document_id.apply(list)

    – ALollz
    Nov 20 '18 at 21:19














2












2








2








I have a pandas dataframe with 500k rows. Structured like this, where the document column are strings:



   document_id                                           document
0 0 Here is our forecast
1 1 Traveling to have a business meeting takes the...
2 2 test successful. way to go!!!
3 3 Randy, Can you send me a schedule of the salar...
4 4 Let's shoot for Tuesday at 11:45.


When I de-dupe the dataframe based on the contents of the document column using df.drop_duplicates(subset='document'), I end up with half the number of documents.



Now that I have my original dataframe and a second dataframe with the unique set of document values, I would like to compare the two to get a list of document_id's that are duplicates.



For example, if the associated document for document_id 4, 93, and 275 are all 'Let's shoot for Tuesday at 11:45.', then how do I get a dataframe with document in one column, and list of associated duplicate document_id's in another column?



     document_ids                                           document    
...
4 [4, 93, 275] Let's shoot for Tuesday at 11:45.


I know that I could use a for loop, and compare each document every other document in the dataframe, and save all matches, but I am trying to avoid iterating over 500k lines multiple times. What instead is the most pythonic way of going about this?










share|improve this question














I have a pandas dataframe with 500k rows. Structured like this, where the document column are strings:



   document_id                                           document
0 0 Here is our forecast
1 1 Traveling to have a business meeting takes the...
2 2 test successful. way to go!!!
3 3 Randy, Can you send me a schedule of the salar...
4 4 Let's shoot for Tuesday at 11:45.


When I de-dupe the dataframe based on the contents of the document column using df.drop_duplicates(subset='document'), I end up with half the number of documents.



Now that I have my original dataframe and a second dataframe with the unique set of document values, I would like to compare the two to get a list of document_id's that are duplicates.



For example, if the associated document for document_id 4, 93, and 275 are all 'Let's shoot for Tuesday at 11:45.', then how do I get a dataframe with document in one column, and list of associated duplicate document_id's in another column?



     document_ids                                           document    
...
4 [4, 93, 275] Let's shoot for Tuesday at 11:45.


I know that I could use a for loop, and compare each document every other document in the dataframe, and save all matches, but I am trying to avoid iterating over 500k lines multiple times. What instead is the most pythonic way of going about this?







python pandas dataframe






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 20 '18 at 21:16









Gerrrald1Gerrrald1

396




396








  • 2





    I think you should go back to your original duplicated DataFrame and do df.groupby('document').document_id.apply(list)

    – ALollz
    Nov 20 '18 at 21:19














  • 2





    I think you should go back to your original duplicated DataFrame and do df.groupby('document').document_id.apply(list)

    – ALollz
    Nov 20 '18 at 21:19








2




2





I think you should go back to your original duplicated DataFrame and do df.groupby('document').document_id.apply(list)

– ALollz
Nov 20 '18 at 21:19





I think you should go back to your original duplicated DataFrame and do df.groupby('document').document_id.apply(list)

– ALollz
Nov 20 '18 at 21:19












1 Answer
1






active

oldest

votes


















1















I would like to compare the two to get a list of document_id's that are duplicates.




You should be able to do this using your "initial" DataFrame with .duplicated(keep=False). Here's an example:



In [1]: import pandas as pd                                                                                                                                   

In [2]: df = pd.DataFrame({
...: 'document_id': range(10),
...: 'document': list('abcabcdedb') # msg 'e' is not duplicated
...: })

In [3]: dupes = df.document.duplicated(keep=False)
In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()
Out[4]:
document document_id
0 a [0, 3]
1 b [1, 4, 9]
2 c [2, 5]
3 d [6, 8]





share|improve this answer



















  • 1





    Awesome, you make it look easy haha, makes sense now, I was using the default of keep=first.

    – Gerrrald1
    Nov 22 '18 at 1:08













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53401655%2fpandas-de-duplication-and-return-list-of-indexes-that-were-duplicates%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1















I would like to compare the two to get a list of document_id's that are duplicates.




You should be able to do this using your "initial" DataFrame with .duplicated(keep=False). Here's an example:



In [1]: import pandas as pd                                                                                                                                   

In [2]: df = pd.DataFrame({
...: 'document_id': range(10),
...: 'document': list('abcabcdedb') # msg 'e' is not duplicated
...: })

In [3]: dupes = df.document.duplicated(keep=False)
In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()
Out[4]:
document document_id
0 a [0, 3]
1 b [1, 4, 9]
2 c [2, 5]
3 d [6, 8]





share|improve this answer



















  • 1





    Awesome, you make it look easy haha, makes sense now, I was using the default of keep=first.

    – Gerrrald1
    Nov 22 '18 at 1:08


















1















I would like to compare the two to get a list of document_id's that are duplicates.




You should be able to do this using your "initial" DataFrame with .duplicated(keep=False). Here's an example:



In [1]: import pandas as pd                                                                                                                                   

In [2]: df = pd.DataFrame({
...: 'document_id': range(10),
...: 'document': list('abcabcdedb') # msg 'e' is not duplicated
...: })

In [3]: dupes = df.document.duplicated(keep=False)
In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()
Out[4]:
document document_id
0 a [0, 3]
1 b [1, 4, 9]
2 c [2, 5]
3 d [6, 8]





share|improve this answer



















  • 1





    Awesome, you make it look easy haha, makes sense now, I was using the default of keep=first.

    – Gerrrald1
    Nov 22 '18 at 1:08
















1












1








1








I would like to compare the two to get a list of document_id's that are duplicates.




You should be able to do this using your "initial" DataFrame with .duplicated(keep=False). Here's an example:



In [1]: import pandas as pd                                                                                                                                   

In [2]: df = pd.DataFrame({
...: 'document_id': range(10),
...: 'document': list('abcabcdedb') # msg 'e' is not duplicated
...: })

In [3]: dupes = df.document.duplicated(keep=False)
In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()
Out[4]:
document document_id
0 a [0, 3]
1 b [1, 4, 9]
2 c [2, 5]
3 d [6, 8]





share|improve this answer














I would like to compare the two to get a list of document_id's that are duplicates.




You should be able to do this using your "initial" DataFrame with .duplicated(keep=False). Here's an example:



In [1]: import pandas as pd                                                                                                                                   

In [2]: df = pd.DataFrame({
...: 'document_id': range(10),
...: 'document': list('abcabcdedb') # msg 'e' is not duplicated
...: })

In [3]: dupes = df.document.duplicated(keep=False)
In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()
Out[4]:
document document_id
0 a [0, 3]
1 b [1, 4, 9]
2 c [2, 5]
3 d [6, 8]






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 20 '18 at 23:12









Brad SolomonBrad Solomon

13.7k83486




13.7k83486








  • 1





    Awesome, you make it look easy haha, makes sense now, I was using the default of keep=first.

    – Gerrrald1
    Nov 22 '18 at 1:08
















  • 1





    Awesome, you make it look easy haha, makes sense now, I was using the default of keep=first.

    – Gerrrald1
    Nov 22 '18 at 1:08










1




1





Awesome, you make it look easy haha, makes sense now, I was using the default of keep=first.

– Gerrrald1
Nov 22 '18 at 1:08







Awesome, you make it look easy haha, makes sense now, I was using the default of keep=first.

– Gerrrald1
Nov 22 '18 at 1:08






















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53401655%2fpandas-de-duplication-and-return-list-of-indexes-that-were-duplicates%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to change which sound is reproduced for terminal bell?

Can I use Tabulator js library in my java Spring + Thymeleaf project?

Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents