Pandas de-duplication and return list of indexes that were duplicates

I have a pandas dataframe with 500k rows. Structured like this, where the document column are strings:

   document_id                                           document

0            0                               Here is our forecast

1            1  Traveling to have a business meeting takes the...

2            2                      test successful. way to go!!!

3            3  Randy, Can you send me a schedule of the salar...

4            4                  Let's shoot for Tuesday at 11:45.

When I de-dupe the dataframe based on the contents of the document column using df.drop_duplicates(subset='document'), I end up with half the number of documents.

Now that I have my original dataframe and a second dataframe with the unique set of document values, I would like to compare the two to get a list of document_id's that are duplicates.

For example, if the associated document for document_id 4, 93, and 275 are all 'Let's shoot for Tuesday at 11:45.', then how do I get a dataframe with document in one column, and list of associated duplicate document_id's in another column?

     document_ids                                           document    

        ...

4    [4, 93, 275]                  Let's shoot for Tuesday at 11:45.

I know that I could use a for loop, and compare each document every other document in the dataframe, and save all matches, but I am trying to avoid iterating over 500k lines multiple times. What instead is the most pythonic way of going about this?

asked Nov 20 '18 at 21:16

Gerrrald1

396

2

I think you should go back to your original duplicated DataFrame and do df.groupby('document').document_id.apply(list)

– ALollz
Nov 20 '18 at 21:19

add a comment |

I have a pandas dataframe with 500k rows. Structured like this, where the document column are strings:

   document_id                                           document

0            0                               Here is our forecast

1            1  Traveling to have a business meeting takes the...

2            2                      test successful. way to go!!!

3            3  Randy, Can you send me a schedule of the salar...

4            4                  Let's shoot for Tuesday at 11:45.

When I de-dupe the dataframe based on the contents of the document column using df.drop_duplicates(subset='document'), I end up with half the number of documents.

Now that I have my original dataframe and a second dataframe with the unique set of document values, I would like to compare the two to get a list of document_id's that are duplicates.

     document_ids                                           document    

        ...

4    [4, 93, 275]                  Let's shoot for Tuesday at 11:45.

asked Nov 20 '18 at 21:16

Gerrrald1

396

2

I think you should go back to your original duplicated DataFrame and do df.groupby('document').document_id.apply(list)

– ALollz
Nov 20 '18 at 21:19

add a comment |

I have a pandas dataframe with 500k rows. Structured like this, where the document column are strings:

   document_id                                           document

0            0                               Here is our forecast

1            1  Traveling to have a business meeting takes the...

2            2                      test successful. way to go!!!

3            3  Randy, Can you send me a schedule of the salar...

4            4                  Let's shoot for Tuesday at 11:45.

When I de-dupe the dataframe based on the contents of the document column using df.drop_duplicates(subset='document'), I end up with half the number of documents.

Now that I have my original dataframe and a second dataframe with the unique set of document values, I would like to compare the two to get a list of document_id's that are duplicates.

     document_ids                                           document    

        ...

4    [4, 93, 275]                  Let's shoot for Tuesday at 11:45.

asked Nov 20 '18 at 21:16

Gerrrald1

396

I have a pandas dataframe with 500k rows. Structured like this, where the document column are strings:

   document_id                                           document

0            0                               Here is our forecast

1            1  Traveling to have a business meeting takes the...

2            2                      test successful. way to go!!!

3            3  Randy, Can you send me a schedule of the salar...

4            4                  Let's shoot for Tuesday at 11:45.

When I de-dupe the dataframe based on the contents of the document column using df.drop_duplicates(subset='document'), I end up with half the number of documents.

Now that I have my original dataframe and a second dataframe with the unique set of document values, I would like to compare the two to get a list of document_id's that are duplicates.

     document_ids                                           document    

        ...

4    [4, 93, 275]                  Let's shoot for Tuesday at 11:45.

python pandas dataframe

asked Nov 20 '18 at 21:16

Gerrrald1

396

asked Nov 20 '18 at 21:16

Gerrrald1

396

asked Nov 20 '18 at 21:16

Gerrrald1

396

asked Nov 20 '18 at 21:16

Gerrrald1

396

asked Nov 20 '18 at 21:16

Gerrrald1

396

2

I think you should go back to your original duplicated DataFrame and do df.groupby('document').document_id.apply(list)

– ALollz
Nov 20 '18 at 21:19

add a comment |

2

I think you should go back to your original duplicated DataFrame and do df.groupby('document').document_id.apply(list)

– ALollz
Nov 20 '18 at 21:19

I think you should go back to your original duplicated DataFrame and do df.groupby('document').document_id.apply(list)

– ALollz
Nov 20 '18 at 21:19

add a comment |

1 Answer
1

active

oldest

votes

I would like to compare the two to get a list of document_id's that are duplicates.

You should be able to do this using your "initial" DataFrame with .duplicated(keep=False). Here's an example:

In [1]: import pandas as pd                                                                                                                                   



In [2]: df = pd.DataFrame({ 

   ...:     'document_id': range(10), 

   ...:     'document': list('abcabcdedb') # msg 'e' is not duplicated

   ...: })



In [3]: dupes = df.document.duplicated(keep=False)                                                                                                            

In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()                                                                           

Out[4]: 

  document document_id

0        a      [0, 3]

1        b   [1, 4, 9]

2        c      [2, 5]

3        d      [6, 8]

answered Nov 20 '18 at 23:12

Brad Solomon

13.7k83486

1

Awesome, you make it look easy haha, makes sense now, I was using the default of keep=first.

– Gerrrald1
Nov 22 '18 at 1:08

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53401655%2fpandas-de-duplication-and-return-list-of-indexes-that-were-duplicates%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I would like to compare the two to get a list of document_id's that are duplicates.

You should be able to do this using your "initial" DataFrame with .duplicated(keep=False). Here's an example:

In [1]: import pandas as pd                                                                                                                                   



In [2]: df = pd.DataFrame({ 

   ...:     'document_id': range(10), 

   ...:     'document': list('abcabcdedb') # msg 'e' is not duplicated

   ...: })



In [3]: dupes = df.document.duplicated(keep=False)                                                                                                            

In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()                                                                           

Out[4]: 

  document document_id

0        a      [0, 3]

1        b   [1, 4, 9]

2        c      [2, 5]

3        d      [6, 8]

answered Nov 20 '18 at 23:12

Brad Solomon

13.7k83486

1

Awesome, you make it look easy haha, makes sense now, I was using the default of keep=first.

– Gerrrald1
Nov 22 '18 at 1:08

add a comment |

I would like to compare the two to get a list of document_id's that are duplicates.

You should be able to do this using your "initial" DataFrame with .duplicated(keep=False). Here's an example:

In [1]: import pandas as pd                                                                                                                                   



In [2]: df = pd.DataFrame({ 

   ...:     'document_id': range(10), 

   ...:     'document': list('abcabcdedb') # msg 'e' is not duplicated

   ...: })



In [3]: dupes = df.document.duplicated(keep=False)                                                                                                            

In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()                                                                           

Out[4]: 

  document document_id

0        a      [0, 3]

1        b   [1, 4, 9]

2        c      [2, 5]

3        d      [6, 8]

answered Nov 20 '18 at 23:12

Brad Solomon

13.7k83486

1

Awesome, you make it look easy haha, makes sense now, I was using the default of keep=first.

– Gerrrald1
Nov 22 '18 at 1:08

add a comment |

I would like to compare the two to get a list of document_id's that are duplicates.

You should be able to do this using your "initial" DataFrame with .duplicated(keep=False). Here's an example:

In [1]: import pandas as pd                                                                                                                                   



In [2]: df = pd.DataFrame({ 

   ...:     'document_id': range(10), 

   ...:     'document': list('abcabcdedb') # msg 'e' is not duplicated

   ...: })



In [3]: dupes = df.document.duplicated(keep=False)                                                                                                            

In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()                                                                           

Out[4]: 

  document document_id

0        a      [0, 3]

1        b   [1, 4, 9]

2        c      [2, 5]

3        d      [6, 8]

answered Nov 20 '18 at 23:12

Brad Solomon

13.7k83486

I would like to compare the two to get a list of document_id's that are duplicates.

You should be able to do this using your "initial" DataFrame with .duplicated(keep=False). Here's an example:

In [1]: import pandas as pd                                                                                                                                   



In [2]: df = pd.DataFrame({ 

   ...:     'document_id': range(10), 

   ...:     'document': list('abcabcdedb') # msg 'e' is not duplicated

   ...: })



In [3]: dupes = df.document.duplicated(keep=False)                                                                                                            

In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()                                                                           

Out[4]: 

  document document_id

0        a      [0, 3]

1        b   [1, 4, 9]

2        c      [2, 5]

3        d      [6, 8]

answered Nov 20 '18 at 23:12

Brad Solomon

13.7k83486

answered Nov 20 '18 at 23:12

Brad Solomon

13.7k83486

answered Nov 20 '18 at 23:12

Brad Solomon

13.7k83486

answered Nov 20 '18 at 23:12

Brad Solomon

13.7k83486

1

Awesome, you make it look easy haha, makes sense now, I was using the default of keep=first.

– Gerrrald1
Nov 22 '18 at 1:08

add a comment |

1

Awesome, you make it look easy haha, makes sense now, I was using the default of keep=first.

– Gerrrald1
Nov 22 '18 at 1:08

Awesome, you make it look easy haha, makes sense now, I was using the default of keep=first.

– Gerrrald1
Nov 22 '18 at 1:08

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky