How do I create a new column in pandas from the difference of two string columns?











up vote
2
down vote

favorite












How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?



I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".



I've tried doing:



import pandas as pd
data = pd.read_csv("AddressFile.csv")
data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'')
data['Address Difference']


but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).



I've also tried:



data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')


but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.



Any help would be appreciated.



Thanks










share|improve this question




























    up vote
    2
    down vote

    favorite












    How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?



    I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".



    I've tried doing:



    import pandas as pd
    data = pd.read_csv("AddressFile.csv")
    data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'')
    data['Address Difference']


    but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).



    I've also tried:



    data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')


    but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.



    Any help would be appreciated.



    Thanks










    share|improve this question


























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?



      I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".



      I've tried doing:



      import pandas as pd
      data = pd.read_csv("AddressFile.csv")
      data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'')
      data['Address Difference']


      but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).



      I've also tried:



      data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')


      but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.



      Any help would be appreciated.



      Thanks










      share|improve this question















      How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?



      I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".



      I've tried doing:



      import pandas as pd
      data = pd.read_csv("AddressFile.csv")
      data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'')
      data['Address Difference']


      but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).



      I've also tried:



      data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')


      but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.



      Any help would be appreciated.



      Thanks







      python regex pandas






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 16 at 19:30









      Vaishali

      16.8k3927




      16.8k3927










      asked Nov 13 at 20:19









      L. Taylor

      112




      112
























          3 Answers
          3






          active

          oldest

          votes

















          up vote
          3
          down vote













          Using replace with regex



          data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")





          share|improve this answer





















          • Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?
            – L. Taylor
            Nov 13 at 20:39










          • @L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'
            – W-B
            Nov 13 at 20:41




















          up vote
          2
          down vote













          I'd use a function that we can map across inputs. This should be fast.



          The function will use str.find to see if the other string is a subset. If the result of str.find is -1 then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.



          def rm(x, y):
          i = x.find(y)
          if i > -1:
          j = len(y)
          return x[:i] + x[i+j:]
          else:
          return x

          df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

          df

          BAD_ADR1 GOOD_ADR1 Address Difference
          0 123 Fake Street 123 Fake Street Apt 101 Apt 101





          share|improve this answer























          • Very cool, I guess this would be very expensive computationaly speaking?
            – Datanovice
            Nov 13 at 20:28






          • 1




            No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000
            – piRSquared
            Nov 13 at 20:29








          • 1




            Will do! Thanks sir will add this to my code base for reference!
            – Datanovice
            Nov 13 at 21:28


















          up vote
          1
          down vote













          You can replace the bad address part from good address



          df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


          Bad_Address Good_Address Address_Difference
          0 123 Fake Street 123 Fake Street Apt 101 Apt 101





          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288887%2fhow-do-i-create-a-new-column-in-pandas-from-the-difference-of-two-string-columns%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            3
            down vote













            Using replace with regex



            data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")





            share|improve this answer





















            • Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?
              – L. Taylor
              Nov 13 at 20:39










            • @L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'
              – W-B
              Nov 13 at 20:41

















            up vote
            3
            down vote













            Using replace with regex



            data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")





            share|improve this answer





















            • Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?
              – L. Taylor
              Nov 13 at 20:39










            • @L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'
              – W-B
              Nov 13 at 20:41















            up vote
            3
            down vote










            up vote
            3
            down vote









            Using replace with regex



            data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")





            share|improve this answer












            Using replace with regex



            data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 13 at 20:25









            W-B

            95.2k72860




            95.2k72860












            • Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?
              – L. Taylor
              Nov 13 at 20:39










            • @L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'
              – W-B
              Nov 13 at 20:41




















            • Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?
              – L. Taylor
              Nov 13 at 20:39










            • @L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'
              – W-B
              Nov 13 at 20:41


















            Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?
            – L. Taylor
            Nov 13 at 20:39




            Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?
            – L. Taylor
            Nov 13 at 20:39












            @L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'
            – W-B
            Nov 13 at 20:41






            @L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'
            – W-B
            Nov 13 at 20:41














            up vote
            2
            down vote













            I'd use a function that we can map across inputs. This should be fast.



            The function will use str.find to see if the other string is a subset. If the result of str.find is -1 then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.



            def rm(x, y):
            i = x.find(y)
            if i > -1:
            j = len(y)
            return x[:i] + x[i+j:]
            else:
            return x

            df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

            df

            BAD_ADR1 GOOD_ADR1 Address Difference
            0 123 Fake Street 123 Fake Street Apt 101 Apt 101





            share|improve this answer























            • Very cool, I guess this would be very expensive computationaly speaking?
              – Datanovice
              Nov 13 at 20:28






            • 1




              No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000
              – piRSquared
              Nov 13 at 20:29








            • 1




              Will do! Thanks sir will add this to my code base for reference!
              – Datanovice
              Nov 13 at 21:28















            up vote
            2
            down vote













            I'd use a function that we can map across inputs. This should be fast.



            The function will use str.find to see if the other string is a subset. If the result of str.find is -1 then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.



            def rm(x, y):
            i = x.find(y)
            if i > -1:
            j = len(y)
            return x[:i] + x[i+j:]
            else:
            return x

            df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

            df

            BAD_ADR1 GOOD_ADR1 Address Difference
            0 123 Fake Street 123 Fake Street Apt 101 Apt 101





            share|improve this answer























            • Very cool, I guess this would be very expensive computationaly speaking?
              – Datanovice
              Nov 13 at 20:28






            • 1




              No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000
              – piRSquared
              Nov 13 at 20:29








            • 1




              Will do! Thanks sir will add this to my code base for reference!
              – Datanovice
              Nov 13 at 21:28













            up vote
            2
            down vote










            up vote
            2
            down vote









            I'd use a function that we can map across inputs. This should be fast.



            The function will use str.find to see if the other string is a subset. If the result of str.find is -1 then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.



            def rm(x, y):
            i = x.find(y)
            if i > -1:
            j = len(y)
            return x[:i] + x[i+j:]
            else:
            return x

            df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

            df

            BAD_ADR1 GOOD_ADR1 Address Difference
            0 123 Fake Street 123 Fake Street Apt 101 Apt 101





            share|improve this answer














            I'd use a function that we can map across inputs. This should be fast.



            The function will use str.find to see if the other string is a subset. If the result of str.find is -1 then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.



            def rm(x, y):
            i = x.find(y)
            if i > -1:
            j = len(y)
            return x[:i] + x[i+j:]
            else:
            return x

            df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

            df

            BAD_ADR1 GOOD_ADR1 Address Difference
            0 123 Fake Street 123 Fake Street Apt 101 Apt 101






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 13 at 20:41

























            answered Nov 13 at 20:26









            piRSquared

            150k21135277




            150k21135277












            • Very cool, I guess this would be very expensive computationaly speaking?
              – Datanovice
              Nov 13 at 20:28






            • 1




              No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000
              – piRSquared
              Nov 13 at 20:29








            • 1




              Will do! Thanks sir will add this to my code base for reference!
              – Datanovice
              Nov 13 at 21:28


















            • Very cool, I guess this would be very expensive computationaly speaking?
              – Datanovice
              Nov 13 at 20:28






            • 1




              No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000
              – piRSquared
              Nov 13 at 20:29








            • 1




              Will do! Thanks sir will add this to my code base for reference!
              – Datanovice
              Nov 13 at 21:28
















            Very cool, I guess this would be very expensive computationaly speaking?
            – Datanovice
            Nov 13 at 20:28




            Very cool, I guess this would be very expensive computationaly speaking?
            – Datanovice
            Nov 13 at 20:28




            1




            1




            No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000
            – piRSquared
            Nov 13 at 20:29






            No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000
            – piRSquared
            Nov 13 at 20:29






            1




            1




            Will do! Thanks sir will add this to my code base for reference!
            – Datanovice
            Nov 13 at 21:28




            Will do! Thanks sir will add this to my code base for reference!
            – Datanovice
            Nov 13 at 21:28










            up vote
            1
            down vote













            You can replace the bad address part from good address



            df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


            Bad_Address Good_Address Address_Difference
            0 123 Fake Street 123 Fake Street Apt 101 Apt 101





            share|improve this answer

























              up vote
              1
              down vote













              You can replace the bad address part from good address



              df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


              Bad_Address Good_Address Address_Difference
              0 123 Fake Street 123 Fake Street Apt 101 Apt 101





              share|improve this answer























                up vote
                1
                down vote










                up vote
                1
                down vote









                You can replace the bad address part from good address



                df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


                Bad_Address Good_Address Address_Difference
                0 123 Fake Street 123 Fake Street Apt 101 Apt 101





                share|improve this answer












                You can replace the bad address part from good address



                df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


                Bad_Address Good_Address Address_Difference
                0 123 Fake Street 123 Fake Street Apt 101 Apt 101






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 13 at 20:25









                Vaishali

                16.8k3927




                16.8k3927






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288887%2fhow-do-i-create-a-new-column-in-pandas-from-the-difference-of-two-string-columns%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

                    ComboBox Display Member on multiple fields

                    Is it possible to collect Nectar points via Trainline?