Replace all emojis from a given unicode string












8















I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:



from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'


I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.



import regex as re
print u'U0001F469' # 👩
print u'U0001F60C' # 😌
print u'U0001F469U0001F60C' # 👩😌

text = u'someU0001F469U0001F60Cthing'
print text # some👩😌thing

# Removing "👩😌" works
print re.sub(ur'[U0001f469U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[U0001f469]+', u'', text) # some�thing









share|improve this question




















  • 2





    I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Inserting repr in the print statement shows that the result of re.sub is u'someude0cthing'. (Incidentally, sys.maxunicode is 65535.)

    – jwodder
    Nov 19 '18 at 22:27








  • 5





    Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.

    – Mark Ransom
    Nov 19 '18 at 22:28








  • 3





    @jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and the re treats them as 2 separate characters as well – i.e., removing the u0001's first.

    – usr2564301
    Nov 19 '18 at 22:30






  • 1





    Oh wait, not the u0001 – it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by [..] so one at a time, first the high order then the low order one.

    – usr2564301
    Nov 19 '18 at 22:32








  • 2





    Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?

    – Mark_Anderson
    Nov 19 '18 at 22:33


















8















I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:



from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'


I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.



import regex as re
print u'U0001F469' # 👩
print u'U0001F60C' # 😌
print u'U0001F469U0001F60C' # 👩😌

text = u'someU0001F469U0001F60Cthing'
print text # some👩😌thing

# Removing "👩😌" works
print re.sub(ur'[U0001f469U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[U0001f469]+', u'', text) # some�thing









share|improve this question




















  • 2





    I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Inserting repr in the print statement shows that the result of re.sub is u'someude0cthing'. (Incidentally, sys.maxunicode is 65535.)

    – jwodder
    Nov 19 '18 at 22:27








  • 5





    Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.

    – Mark Ransom
    Nov 19 '18 at 22:28








  • 3





    @jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and the re treats them as 2 separate characters as well – i.e., removing the u0001's first.

    – usr2564301
    Nov 19 '18 at 22:30






  • 1





    Oh wait, not the u0001 – it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by [..] so one at a time, first the high order then the low order one.

    – usr2564301
    Nov 19 '18 at 22:32








  • 2





    Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?

    – Mark_Anderson
    Nov 19 '18 at 22:33
















8












8








8


1






I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:



from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'


I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.



import regex as re
print u'U0001F469' # 👩
print u'U0001F60C' # 😌
print u'U0001F469U0001F60C' # 👩😌

text = u'someU0001F469U0001F60Cthing'
print text # some👩😌thing

# Removing "👩😌" works
print re.sub(ur'[U0001f469U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[U0001f469]+', u'', text) # some�thing









share|improve this question
















I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:



from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'


I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.



import regex as re
print u'U0001F469' # 👩
print u'U0001F60C' # 😌
print u'U0001F469U0001F60C' # 👩😌

text = u'someU0001F469U0001F60Cthing'
print text # some👩😌thing

# Removing "👩😌" works
print re.sub(ur'[U0001f469U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[U0001f469]+', u'', text) # some�thing






regex python-2.7 unicode






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 19 '18 at 22:30







dimitris93

















asked Nov 19 '18 at 22:20









dimitris93dimitris93

1,87742154




1,87742154








  • 2





    I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Inserting repr in the print statement shows that the result of re.sub is u'someude0cthing'. (Incidentally, sys.maxunicode is 65535.)

    – jwodder
    Nov 19 '18 at 22:27








  • 5





    Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.

    – Mark Ransom
    Nov 19 '18 at 22:28








  • 3





    @jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and the re treats them as 2 separate characters as well – i.e., removing the u0001's first.

    – usr2564301
    Nov 19 '18 at 22:30






  • 1





    Oh wait, not the u0001 – it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by [..] so one at a time, first the high order then the low order one.

    – usr2564301
    Nov 19 '18 at 22:32








  • 2





    Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?

    – Mark_Anderson
    Nov 19 '18 at 22:33
















  • 2





    I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Inserting repr in the print statement shows that the result of re.sub is u'someude0cthing'. (Incidentally, sys.maxunicode is 65535.)

    – jwodder
    Nov 19 '18 at 22:27








  • 5





    Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.

    – Mark Ransom
    Nov 19 '18 at 22:28








  • 3





    @jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and the re treats them as 2 separate characters as well – i.e., removing the u0001's first.

    – usr2564301
    Nov 19 '18 at 22:30






  • 1





    Oh wait, not the u0001 – it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by [..] so one at a time, first the high order then the low order one.

    – usr2564301
    Nov 19 '18 at 22:32








  • 2





    Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?

    – Mark_Anderson
    Nov 19 '18 at 22:33










2




2





I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Inserting repr in the print statement shows that the result of re.sub is u'someude0cthing'. (Incidentally, sys.maxunicode is 65535.)

– jwodder
Nov 19 '18 at 22:27







I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Inserting repr in the print statement shows that the result of re.sub is u'someude0cthing'. (Incidentally, sys.maxunicode is 65535.)

– jwodder
Nov 19 '18 at 22:27






5




5





Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.

– Mark Ransom
Nov 19 '18 at 22:28







Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.

– Mark Ransom
Nov 19 '18 at 22:28






3




3





@jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and the re treats them as 2 separate characters as well – i.e., removing the u0001's first.

– usr2564301
Nov 19 '18 at 22:30





@jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and the re treats them as 2 separate characters as well – i.e., removing the u0001's first.

– usr2564301
Nov 19 '18 at 22:30




1




1





Oh wait, not the u0001 – it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by [..] so one at a time, first the high order then the low order one.

– usr2564301
Nov 19 '18 at 22:32







Oh wait, not the u0001 – it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by [..] so one at a time, first the high order then the low order one.

– usr2564301
Nov 19 '18 at 22:32






2




2





Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?

– Mark_Anderson
Nov 19 '18 at 22:33







Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?

– Mark_Anderson
Nov 19 '18 at 22:33














3 Answers
3






active

oldest

votes


















3














In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'U0001F469').



The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.



To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.



subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)





share|improve this answer


























  • I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.

    – CJ59
    Nov 19 '18 at 22:49











  • @CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.

    – Mark Ransom
    Nov 19 '18 at 22:52



















2














The old 2.7 regex engine gets confused because:




  1. Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.


  2. Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).


  3. That means that [U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.



This fixes it:



print re.sub(ur'(U0001f469|U0001F60C)+', u'', text)  # something
# Removing only "👩" doesn't work
print re.sub(ur'(U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing


because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.



If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:



exclude_list = UNICODE_EMOJI.keys()

for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something


(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)






share|improve this answer


























  • Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.

    – Mark Ransom
    Nov 19 '18 at 22:48



















2














To remove all emojis from the input string using the current approach, use



import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'


If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.



Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53383475%2freplace-all-emojis-from-a-given-unicode-string%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3














    In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'U0001F469').



    The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.



    To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.



    subs = u'|'.join(exclude_list)
    print re.sub(subs, u'', text)





    share|improve this answer


























    • I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.

      – CJ59
      Nov 19 '18 at 22:49











    • @CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.

      – Mark Ransom
      Nov 19 '18 at 22:52
















    3














    In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'U0001F469').



    The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.



    To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.



    subs = u'|'.join(exclude_list)
    print re.sub(subs, u'', text)





    share|improve this answer


























    • I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.

      – CJ59
      Nov 19 '18 at 22:49











    • @CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.

      – Mark Ransom
      Nov 19 '18 at 22:52














    3












    3








    3







    In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'U0001F469').



    The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.



    To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.



    subs = u'|'.join(exclude_list)
    print re.sub(subs, u'', text)





    share|improve this answer















    In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'U0001F469').



    The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.



    To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.



    subs = u'|'.join(exclude_list)
    print re.sub(subs, u'', text)






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 19 '18 at 23:07

























    answered Nov 19 '18 at 22:39









    Mark RansomMark Ransom

    224k29281509




    224k29281509













    • I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.

      – CJ59
      Nov 19 '18 at 22:49











    • @CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.

      – Mark Ransom
      Nov 19 '18 at 22:52



















    • I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.

      – CJ59
      Nov 19 '18 at 22:49











    • @CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.

      – Mark Ransom
      Nov 19 '18 at 22:52

















    I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.

    – CJ59
    Nov 19 '18 at 22:49





    I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.

    – CJ59
    Nov 19 '18 at 22:49













    @CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.

    – Mark Ransom
    Nov 19 '18 at 22:52





    @CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.

    – Mark Ransom
    Nov 19 '18 at 22:52













    2














    The old 2.7 regex engine gets confused because:




    1. Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.


    2. Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).


    3. That means that [U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.



    This fixes it:



    print re.sub(ur'(U0001f469|U0001F60C)+', u'', text)  # something
    # Removing only "👩" doesn't work
    print re.sub(ur'(U0001f469)+', u'', text) # some�thing
    # .. and now it does:
    some😌thing


    because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.



    If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:



    exclude_list = UNICODE_EMOJI.keys()

    for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
    if bad in text:
    print 'Removing '+bad
    text = text.replace(bad, '')
    Removing 👩
    Removing 😌
    something


    (This also shows the intermediate results as proof it works; you only need the replace line in the loop.)






    share|improve this answer


























    • Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.

      – Mark Ransom
      Nov 19 '18 at 22:48
















    2














    The old 2.7 regex engine gets confused because:




    1. Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.


    2. Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).


    3. That means that [U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.



    This fixes it:



    print re.sub(ur'(U0001f469|U0001F60C)+', u'', text)  # something
    # Removing only "👩" doesn't work
    print re.sub(ur'(U0001f469)+', u'', text) # some�thing
    # .. and now it does:
    some😌thing


    because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.



    If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:



    exclude_list = UNICODE_EMOJI.keys()

    for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
    if bad in text:
    print 'Removing '+bad
    text = text.replace(bad, '')
    Removing 👩
    Removing 😌
    something


    (This also shows the intermediate results as proof it works; you only need the replace line in the loop.)






    share|improve this answer


























    • Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.

      – Mark Ransom
      Nov 19 '18 at 22:48














    2












    2








    2







    The old 2.7 regex engine gets confused because:




    1. Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.


    2. Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).


    3. That means that [U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.



    This fixes it:



    print re.sub(ur'(U0001f469|U0001F60C)+', u'', text)  # something
    # Removing only "👩" doesn't work
    print re.sub(ur'(U0001f469)+', u'', text) # some�thing
    # .. and now it does:
    some😌thing


    because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.



    If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:



    exclude_list = UNICODE_EMOJI.keys()

    for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
    if bad in text:
    print 'Removing '+bad
    text = text.replace(bad, '')
    Removing 👩
    Removing 😌
    something


    (This also shows the intermediate results as proof it works; you only need the replace line in the loop.)






    share|improve this answer















    The old 2.7 regex engine gets confused because:




    1. Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.


    2. Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).


    3. That means that [U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.



    This fixes it:



    print re.sub(ur'(U0001f469|U0001F60C)+', u'', text)  # something
    # Removing only "👩" doesn't work
    print re.sub(ur'(U0001f469)+', u'', text) # some�thing
    # .. and now it does:
    some😌thing


    because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.



    If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:



    exclude_list = UNICODE_EMOJI.keys()

    for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
    if bad in text:
    print 'Removing '+bad
    text = text.replace(bad, '')
    Removing 👩
    Removing 😌
    something


    (This also shows the intermediate results as proof it works; you only need the replace line in the loop.)







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 19 '18 at 23:00

























    answered Nov 19 '18 at 22:39









    usr2564301usr2564301

    17.7k73370




    17.7k73370













    • Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.

      – Mark Ransom
      Nov 19 '18 at 22:48



















    • Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.

      – Mark Ransom
      Nov 19 '18 at 22:48

















    Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.

    – Mark Ransom
    Nov 19 '18 at 22:48





    Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.

    – Mark Ransom
    Nov 19 '18 at 22:48











    2














    To remove all emojis from the input string using the current approach, use



    import re
    from emoji import UNICODE_EMOJI
    text = u'some👩😌thing'
    exclude_list = UNICODE_EMOJI.keys()
    rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
    print re.sub(rx, u'', text)
    # => u'something'


    If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.



    Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
    [GCC 5.4.0 20160609] on linux2.






    share|improve this answer




























      2














      To remove all emojis from the input string using the current approach, use



      import re
      from emoji import UNICODE_EMOJI
      text = u'some👩😌thing'
      exclude_list = UNICODE_EMOJI.keys()
      rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
      print re.sub(rx, u'', text)
      # => u'something'


      If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.



      Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
      [GCC 5.4.0 20160609] on linux2.






      share|improve this answer


























        2












        2








        2







        To remove all emojis from the input string using the current approach, use



        import re
        from emoji import UNICODE_EMOJI
        text = u'some👩😌thing'
        exclude_list = UNICODE_EMOJI.keys()
        rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
        print re.sub(rx, u'', text)
        # => u'something'


        If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.



        Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
        [GCC 5.4.0 20160609] on linux2.






        share|improve this answer













        To remove all emojis from the input string using the current approach, use



        import re
        from emoji import UNICODE_EMOJI
        text = u'some👩😌thing'
        exclude_list = UNICODE_EMOJI.keys()
        rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
        print re.sub(rx, u'', text)
        # => u'something'


        If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.



        Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
        [GCC 5.4.0 20160609] on linux2.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 19 '18 at 23:01









        Wiktor StribiżewWiktor Stribiżew

        313k16133210




        313k16133210






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53383475%2freplace-all-emojis-from-a-given-unicode-string%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

            ComboBox Display Member on multiple fields

            Is it possible to collect Nectar points via Trainline?