Union of two pandas DataFrames












8















Say I have two data frames:



df1:



  A
0 a
1 b


df2:



  A
0 a
1 c


I want the result to be the union of the two frames with an extra column showing the source data frame that the row belongs to. In case of duplicates, duplicates should be removed and the respective extra column should show both sources:



  A  B
0 a df1, df2
1 b df1
2 c df2


I can get the concatenated data frame (df3) without duplicates as follows:



import pandas as pd
df3=pd.concat([df1,df2],ignore_index=True).drop_duplicates().reset_index(drop=True)


I can't think of/find a method to have control over what element goes where. How can I add the extra column?



Thank you very much for any tips.










share|improve this question





























    8















    Say I have two data frames:



    df1:



      A
    0 a
    1 b


    df2:



      A
    0 a
    1 c


    I want the result to be the union of the two frames with an extra column showing the source data frame that the row belongs to. In case of duplicates, duplicates should be removed and the respective extra column should show both sources:



      A  B
    0 a df1, df2
    1 b df1
    2 c df2


    I can get the concatenated data frame (df3) without duplicates as follows:



    import pandas as pd
    df3=pd.concat([df1,df2],ignore_index=True).drop_duplicates().reset_index(drop=True)


    I can't think of/find a method to have control over what element goes where. How can I add the extra column?



    Thank you very much for any tips.










    share|improve this question



























      8












      8








      8


      1






      Say I have two data frames:



      df1:



        A
      0 a
      1 b


      df2:



        A
      0 a
      1 c


      I want the result to be the union of the two frames with an extra column showing the source data frame that the row belongs to. In case of duplicates, duplicates should be removed and the respective extra column should show both sources:



        A  B
      0 a df1, df2
      1 b df1
      2 c df2


      I can get the concatenated data frame (df3) without duplicates as follows:



      import pandas as pd
      df3=pd.concat([df1,df2],ignore_index=True).drop_duplicates().reset_index(drop=True)


      I can't think of/find a method to have control over what element goes where. How can I add the extra column?



      Thank you very much for any tips.










      share|improve this question
















      Say I have two data frames:



      df1:



        A
      0 a
      1 b


      df2:



        A
      0 a
      1 c


      I want the result to be the union of the two frames with an extra column showing the source data frame that the row belongs to. In case of duplicates, duplicates should be removed and the respective extra column should show both sources:



        A  B
      0 a df1, df2
      1 b df1
      2 c df2


      I can get the concatenated data frame (df3) without duplicates as follows:



      import pandas as pd
      df3=pd.concat([df1,df2],ignore_index=True).drop_duplicates().reset_index(drop=True)


      I can't think of/find a method to have control over what element goes where. How can I add the extra column?



      Thank you very much for any tips.







      python pandas dataframe merge concatenation






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 22 at 23:44









      coldspeed

      127k23129216




      127k23129216










      asked Jan 22 at 19:32









      Leon RaiLeon Rai

      557




      557
























          3 Answers
          3






          active

          oldest

          votes


















          11














          Merge with an indicator argument, and remap the result:



          m = {'left_only': 'df1', 'right_only': 'df2', 'both': 'df1, df2'}

          result = df1.merge(df2, on=['A'], how='outer', indicator='B')
          result['B'] = result['B'].map(m)

          result
          A B
          0 a df1, df2
          1 b df1
          2 c df2





          share|improve this answer
























          • Nice and succinct!

            – cph_sto
            Jan 22 at 19:53











          • @cph_sto Thank you! Upvoted back.

            – coldspeed
            Jan 22 at 19:55






          • 1





            I have learnt a lot from you.

            – cph_sto
            Jan 22 at 19:56











          • Excellent! Could you add how to do the same for intersection? outer->inner?

            – Leon Rai
            Jan 22 at 20:28








          • 1





            @LeonRai df1.merge(df2, on=['A'], how='inner').assign(B='df1, df2') (since intersection implies membership in both)

            – coldspeed
            Jan 22 at 20:29



















          2














          We use outer join to solve this -



          df1 = pd.DataFrame({'A':['a','b']})
          df2 = pd.DataFrame({'A':['a','c']})
          df1['col1']='df1'
          df2['col2']='df2'
          df=pd.merge(df1, df2, on=['A'], how="outer").fillna('')
          df['B']=df['col1']+','+df['col2']
          df['B'] = df['B'].str.strip(',')
          df=df[['A','B']]
          df

          A B
          0 a df1,df2
          1 b df1
          2 c df2





          share|improve this answer
























          • thank you for the answer!

            – Leon Rai
            Jan 22 at 20:31











          • pleasure Leon :)

            – cph_sto
            Jan 22 at 20:38



















          2














          Use the command below:



          df3 = pd.concat([df1.assign(source='df1'), df2.assign(source='df2')]) 
          .groupby('A')
          .aggregate(list)
          .reset_index()


          The result will be:



             A      source
          0 a [df1, df2]
          1 b [df1]
          2 c [df2]


          The assign will add a column named source with value df1 and df2 to your dataframes. groupby command groups rows with same A value to single row. aggregate command describes how to aggregate other columns (source) for each group of rows with same A. I have used list aggregate function so that the source column be the list of values with same A.






          share|improve this answer


























          • thank you for the answer!

            – Leon Rai
            Jan 22 at 20:31











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54315222%2funion-of-two-pandas-dataframes%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          3 Answers
          3






          active

          oldest

          votes








          3 Answers
          3






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          11














          Merge with an indicator argument, and remap the result:



          m = {'left_only': 'df1', 'right_only': 'df2', 'both': 'df1, df2'}

          result = df1.merge(df2, on=['A'], how='outer', indicator='B')
          result['B'] = result['B'].map(m)

          result
          A B
          0 a df1, df2
          1 b df1
          2 c df2





          share|improve this answer
























          • Nice and succinct!

            – cph_sto
            Jan 22 at 19:53











          • @cph_sto Thank you! Upvoted back.

            – coldspeed
            Jan 22 at 19:55






          • 1





            I have learnt a lot from you.

            – cph_sto
            Jan 22 at 19:56











          • Excellent! Could you add how to do the same for intersection? outer->inner?

            – Leon Rai
            Jan 22 at 20:28








          • 1





            @LeonRai df1.merge(df2, on=['A'], how='inner').assign(B='df1, df2') (since intersection implies membership in both)

            – coldspeed
            Jan 22 at 20:29
















          11














          Merge with an indicator argument, and remap the result:



          m = {'left_only': 'df1', 'right_only': 'df2', 'both': 'df1, df2'}

          result = df1.merge(df2, on=['A'], how='outer', indicator='B')
          result['B'] = result['B'].map(m)

          result
          A B
          0 a df1, df2
          1 b df1
          2 c df2





          share|improve this answer
























          • Nice and succinct!

            – cph_sto
            Jan 22 at 19:53











          • @cph_sto Thank you! Upvoted back.

            – coldspeed
            Jan 22 at 19:55






          • 1





            I have learnt a lot from you.

            – cph_sto
            Jan 22 at 19:56











          • Excellent! Could you add how to do the same for intersection? outer->inner?

            – Leon Rai
            Jan 22 at 20:28








          • 1





            @LeonRai df1.merge(df2, on=['A'], how='inner').assign(B='df1, df2') (since intersection implies membership in both)

            – coldspeed
            Jan 22 at 20:29














          11












          11








          11







          Merge with an indicator argument, and remap the result:



          m = {'left_only': 'df1', 'right_only': 'df2', 'both': 'df1, df2'}

          result = df1.merge(df2, on=['A'], how='outer', indicator='B')
          result['B'] = result['B'].map(m)

          result
          A B
          0 a df1, df2
          1 b df1
          2 c df2





          share|improve this answer













          Merge with an indicator argument, and remap the result:



          m = {'left_only': 'df1', 'right_only': 'df2', 'both': 'df1, df2'}

          result = df1.merge(df2, on=['A'], how='outer', indicator='B')
          result['B'] = result['B'].map(m)

          result
          A B
          0 a df1, df2
          1 b df1
          2 c df2






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 22 at 19:37









          coldspeedcoldspeed

          127k23129216




          127k23129216













          • Nice and succinct!

            – cph_sto
            Jan 22 at 19:53











          • @cph_sto Thank you! Upvoted back.

            – coldspeed
            Jan 22 at 19:55






          • 1





            I have learnt a lot from you.

            – cph_sto
            Jan 22 at 19:56











          • Excellent! Could you add how to do the same for intersection? outer->inner?

            – Leon Rai
            Jan 22 at 20:28








          • 1





            @LeonRai df1.merge(df2, on=['A'], how='inner').assign(B='df1, df2') (since intersection implies membership in both)

            – coldspeed
            Jan 22 at 20:29



















          • Nice and succinct!

            – cph_sto
            Jan 22 at 19:53











          • @cph_sto Thank you! Upvoted back.

            – coldspeed
            Jan 22 at 19:55






          • 1





            I have learnt a lot from you.

            – cph_sto
            Jan 22 at 19:56











          • Excellent! Could you add how to do the same for intersection? outer->inner?

            – Leon Rai
            Jan 22 at 20:28








          • 1





            @LeonRai df1.merge(df2, on=['A'], how='inner').assign(B='df1, df2') (since intersection implies membership in both)

            – coldspeed
            Jan 22 at 20:29

















          Nice and succinct!

          – cph_sto
          Jan 22 at 19:53





          Nice and succinct!

          – cph_sto
          Jan 22 at 19:53













          @cph_sto Thank you! Upvoted back.

          – coldspeed
          Jan 22 at 19:55





          @cph_sto Thank you! Upvoted back.

          – coldspeed
          Jan 22 at 19:55




          1




          1





          I have learnt a lot from you.

          – cph_sto
          Jan 22 at 19:56





          I have learnt a lot from you.

          – cph_sto
          Jan 22 at 19:56













          Excellent! Could you add how to do the same for intersection? outer->inner?

          – Leon Rai
          Jan 22 at 20:28







          Excellent! Could you add how to do the same for intersection? outer->inner?

          – Leon Rai
          Jan 22 at 20:28






          1




          1





          @LeonRai df1.merge(df2, on=['A'], how='inner').assign(B='df1, df2') (since intersection implies membership in both)

          – coldspeed
          Jan 22 at 20:29





          @LeonRai df1.merge(df2, on=['A'], how='inner').assign(B='df1, df2') (since intersection implies membership in both)

          – coldspeed
          Jan 22 at 20:29













          2














          We use outer join to solve this -



          df1 = pd.DataFrame({'A':['a','b']})
          df2 = pd.DataFrame({'A':['a','c']})
          df1['col1']='df1'
          df2['col2']='df2'
          df=pd.merge(df1, df2, on=['A'], how="outer").fillna('')
          df['B']=df['col1']+','+df['col2']
          df['B'] = df['B'].str.strip(',')
          df=df[['A','B']]
          df

          A B
          0 a df1,df2
          1 b df1
          2 c df2





          share|improve this answer
























          • thank you for the answer!

            – Leon Rai
            Jan 22 at 20:31











          • pleasure Leon :)

            – cph_sto
            Jan 22 at 20:38
















          2














          We use outer join to solve this -



          df1 = pd.DataFrame({'A':['a','b']})
          df2 = pd.DataFrame({'A':['a','c']})
          df1['col1']='df1'
          df2['col2']='df2'
          df=pd.merge(df1, df2, on=['A'], how="outer").fillna('')
          df['B']=df['col1']+','+df['col2']
          df['B'] = df['B'].str.strip(',')
          df=df[['A','B']]
          df

          A B
          0 a df1,df2
          1 b df1
          2 c df2





          share|improve this answer
























          • thank you for the answer!

            – Leon Rai
            Jan 22 at 20:31











          • pleasure Leon :)

            – cph_sto
            Jan 22 at 20:38














          2












          2








          2







          We use outer join to solve this -



          df1 = pd.DataFrame({'A':['a','b']})
          df2 = pd.DataFrame({'A':['a','c']})
          df1['col1']='df1'
          df2['col2']='df2'
          df=pd.merge(df1, df2, on=['A'], how="outer").fillna('')
          df['B']=df['col1']+','+df['col2']
          df['B'] = df['B'].str.strip(',')
          df=df[['A','B']]
          df

          A B
          0 a df1,df2
          1 b df1
          2 c df2





          share|improve this answer













          We use outer join to solve this -



          df1 = pd.DataFrame({'A':['a','b']})
          df2 = pd.DataFrame({'A':['a','c']})
          df1['col1']='df1'
          df2['col2']='df2'
          df=pd.merge(df1, df2, on=['A'], how="outer").fillna('')
          df['B']=df['col1']+','+df['col2']
          df['B'] = df['B'].str.strip(',')
          df=df[['A','B']]
          df

          A B
          0 a df1,df2
          1 b df1
          2 c df2






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 22 at 19:52









          cph_stocph_sto

          1,691320




          1,691320













          • thank you for the answer!

            – Leon Rai
            Jan 22 at 20:31











          • pleasure Leon :)

            – cph_sto
            Jan 22 at 20:38



















          • thank you for the answer!

            – Leon Rai
            Jan 22 at 20:31











          • pleasure Leon :)

            – cph_sto
            Jan 22 at 20:38

















          thank you for the answer!

          – Leon Rai
          Jan 22 at 20:31





          thank you for the answer!

          – Leon Rai
          Jan 22 at 20:31













          pleasure Leon :)

          – cph_sto
          Jan 22 at 20:38





          pleasure Leon :)

          – cph_sto
          Jan 22 at 20:38











          2














          Use the command below:



          df3 = pd.concat([df1.assign(source='df1'), df2.assign(source='df2')]) 
          .groupby('A')
          .aggregate(list)
          .reset_index()


          The result will be:



             A      source
          0 a [df1, df2]
          1 b [df1]
          2 c [df2]


          The assign will add a column named source with value df1 and df2 to your dataframes. groupby command groups rows with same A value to single row. aggregate command describes how to aggregate other columns (source) for each group of rows with same A. I have used list aggregate function so that the source column be the list of values with same A.






          share|improve this answer


























          • thank you for the answer!

            – Leon Rai
            Jan 22 at 20:31
















          2














          Use the command below:



          df3 = pd.concat([df1.assign(source='df1'), df2.assign(source='df2')]) 
          .groupby('A')
          .aggregate(list)
          .reset_index()


          The result will be:



             A      source
          0 a [df1, df2]
          1 b [df1]
          2 c [df2]


          The assign will add a column named source with value df1 and df2 to your dataframes. groupby command groups rows with same A value to single row. aggregate command describes how to aggregate other columns (source) for each group of rows with same A. I have used list aggregate function so that the source column be the list of values with same A.






          share|improve this answer


























          • thank you for the answer!

            – Leon Rai
            Jan 22 at 20:31














          2












          2








          2







          Use the command below:



          df3 = pd.concat([df1.assign(source='df1'), df2.assign(source='df2')]) 
          .groupby('A')
          .aggregate(list)
          .reset_index()


          The result will be:



             A      source
          0 a [df1, df2]
          1 b [df1]
          2 c [df2]


          The assign will add a column named source with value df1 and df2 to your dataframes. groupby command groups rows with same A value to single row. aggregate command describes how to aggregate other columns (source) for each group of rows with same A. I have used list aggregate function so that the source column be the list of values with same A.






          share|improve this answer















          Use the command below:



          df3 = pd.concat([df1.assign(source='df1'), df2.assign(source='df2')]) 
          .groupby('A')
          .aggregate(list)
          .reset_index()


          The result will be:



             A      source
          0 a [df1, df2]
          1 b [df1]
          2 c [df2]


          The assign will add a column named source with value df1 and df2 to your dataframes. groupby command groups rows with same A value to single row. aggregate command describes how to aggregate other columns (source) for each group of rows with same A. I have used list aggregate function so that the source column be the list of values with same A.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 22 at 20:05

























          answered Jan 22 at 19:46









          Narges AyoubiNarges Ayoubi

          564




          564













          • thank you for the answer!

            – Leon Rai
            Jan 22 at 20:31



















          • thank you for the answer!

            – Leon Rai
            Jan 22 at 20:31

















          thank you for the answer!

          – Leon Rai
          Jan 22 at 20:31





          thank you for the answer!

          – Leon Rai
          Jan 22 at 20:31


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54315222%2funion-of-two-pandas-dataframes%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

          ComboBox Display Member on multiple fields

          Is it possible to collect Nectar points via Trainline?