Replace all emojis from a given unicode string

I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:

from emoji import UNICODE_EMOJI

text = 'some👩😌thing'

exclude_list = UNICODE_EMOJI.keys()

output = ... = 'something'

I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.

import regex as re

print u'U0001F469'                     # 👩   

print u'U0001F60C'                     # 😌    

print u'U0001F469U0001F60C'           # 👩😌 



text = u'someU0001F469U0001F60Cthing' 

print text                              # some👩😌thing



# Removing "👩😌" works

print re.sub(ur'[U0001f469U0001F60C]+', u'', text)  # something

# Removing only "👩" doesn't work 

print re.sub(ur'[U0001f469]+', u'', text)            # some�thing

edited Nov 19 '18 at 22:30

asked Nov 19 '18 at 22:20

dimitris93

1,87742154

2

I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Inserting repr in the print statement shows that the result of re.sub is u'someude0cthing'. (Incidentally, sys.maxunicode is 65535.)

– jwodder
Nov 19 '18 at 22:27

5

Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.

– Mark Ransom
Nov 19 '18 at 22:28

3

@jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and the re treats them as 2 separate characters as well – i.e., removing the u0001's first.

– usr2564301
Nov 19 '18 at 22:30

1

Oh wait, not the u0001 – it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by [..] so one at a time, first the high order then the low order one.

– usr2564301
Nov 19 '18 at 22:32

2

Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?

– Mark_Anderson
Nov 19 '18 at 22:33

|
show 10 more comments

from emoji import UNICODE_EMOJI

text = 'some👩😌thing'

exclude_list = UNICODE_EMOJI.keys()

output = ... = 'something'

import regex as re

print u'U0001F469'                     # 👩   

print u'U0001F60C'                     # 😌    

print u'U0001F469U0001F60C'           # 👩😌 



text = u'someU0001F469U0001F60Cthing' 

print text                              # some👩😌thing



# Removing "👩😌" works

print re.sub(ur'[U0001f469U0001F60C]+', u'', text)  # something

# Removing only "👩" doesn't work 

print re.sub(ur'[U0001f469]+', u'', text)            # some�thing

edited Nov 19 '18 at 22:30

asked Nov 19 '18 at 22:20

dimitris93

1,87742154

2

I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Inserting repr in the print statement shows that the result of re.sub is u'someude0cthing'. (Incidentally, sys.maxunicode is 65535.)

– jwodder
Nov 19 '18 at 22:27

5

Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.

– Mark Ransom
Nov 19 '18 at 22:28

3

@jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and the re treats them as 2 separate characters as well – i.e., removing the u0001's first.

– usr2564301
Nov 19 '18 at 22:30

1

Oh wait, not the u0001 – it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by [..] so one at a time, first the high order then the low order one.

– usr2564301
Nov 19 '18 at 22:32

2

Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?

– Mark_Anderson
Nov 19 '18 at 22:33

|
show 10 more comments

from emoji import UNICODE_EMOJI

text = 'some👩😌thing'

exclude_list = UNICODE_EMOJI.keys()

output = ... = 'something'

import regex as re

print u'U0001F469'                     # 👩   

print u'U0001F60C'                     # 😌    

print u'U0001F469U0001F60C'           # 👩😌 



text = u'someU0001F469U0001F60Cthing' 

print text                              # some👩😌thing



# Removing "👩😌" works

print re.sub(ur'[U0001f469U0001F60C]+', u'', text)  # something

# Removing only "👩" doesn't work 

print re.sub(ur'[U0001f469]+', u'', text)            # some�thing

edited Nov 19 '18 at 22:30

asked Nov 19 '18 at 22:20

dimitris93

1,87742154

from emoji import UNICODE_EMOJI

text = 'some👩😌thing'

exclude_list = UNICODE_EMOJI.keys()

output = ... = 'something'

import regex as re

print u'U0001F469'                     # 👩   

print u'U0001F60C'                     # 😌    

print u'U0001F469U0001F60C'           # 👩😌 



text = u'someU0001F469U0001F60Cthing' 

print text                              # some👩😌thing



# Removing "👩😌" works

print re.sub(ur'[U0001f469U0001F60C]+', u'', text)  # something

# Removing only "👩" doesn't work 

print re.sub(ur'[U0001f469]+', u'', text)            # some�thing

regex python-2.7 unicode

edited Nov 19 '18 at 22:30

asked Nov 19 '18 at 22:20

dimitris93

1,87742154

edited Nov 19 '18 at 22:30

asked Nov 19 '18 at 22:20

dimitris93

1,87742154

edited Nov 19 '18 at 22:30

asked Nov 19 '18 at 22:20

dimitris93

1,87742154

asked Nov 19 '18 at 22:20

dimitris93

1,87742154

asked Nov 19 '18 at 22:20

dimitris93

1,87742154

2

I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Inserting repr in the print statement shows that the result of re.sub is u'someude0cthing'. (Incidentally, sys.maxunicode is 65535.)

– jwodder
Nov 19 '18 at 22:27

5

Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.

– Mark Ransom
Nov 19 '18 at 22:28

3

@jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and the re treats them as 2 separate characters as well – i.e., removing the u0001's first.

– usr2564301
Nov 19 '18 at 22:30

1

Oh wait, not the u0001 – it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by [..] so one at a time, first the high order then the low order one.

– usr2564301
Nov 19 '18 at 22:32

2

Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?

– Mark_Anderson
Nov 19 '18 at 22:33

|
show 10 more comments

2

I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Inserting repr in the print statement shows that the result of re.sub is u'someude0cthing'. (Incidentally, sys.maxunicode is 65535.)

– jwodder
Nov 19 '18 at 22:27

5

Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.

– Mark Ransom
Nov 19 '18 at 22:28

3

@jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and the re treats them as 2 separate characters as well – i.e., removing the u0001's first.

– usr2564301
Nov 19 '18 at 22:30

1

Oh wait, not the u0001 – it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by [..] so one at a time, first the high order then the low order one.

– usr2564301
Nov 19 '18 at 22:32

2

Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?

– Mark_Anderson
Nov 19 '18 at 22:33

I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Inserting repr in the print statement shows that the result of re.sub is u'someude0cthing'. (Incidentally, sys.maxunicode is 65535.)

– jwodder
Nov 19 '18 at 22:27

Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.

– Mark Ransom
Nov 19 '18 at 22:28

@jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and the re treats them as 2 separate characters as well – i.e., removing the u0001's first.

– usr2564301
Nov 19 '18 at 22:30

Oh wait, not the u0001 – it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by [..] so one at a time, first the high order then the low order one.

– usr2564301
Nov 19 '18 at 22:32

Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?

– Mark_Anderson
Nov 19 '18 at 22:33

|
show 10 more comments

3 Answers
3

active

oldest

votes

In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'U0001F469').

The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.

To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.

subs = u'|'.join(exclude_list)

print re.sub(subs, u'', text)

edited Nov 19 '18 at 23:07

answered Nov 19 '18 at 22:39

Mark Ransom

224k29281509

I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.

– CJ59
Nov 19 '18 at 22:49

@CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.

– Mark Ransom
Nov 19 '18 at 22:52

add a comment |

The old 2.7 regex engine gets confused because:

Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.

Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).

That means that [U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.

This fixes it:

print re.sub(ur'(U0001f469|U0001F60C)+', u'', text)  # something

# Removing only "👩" doesn't work 

print re.sub(ur'(U0001f469)+', u'', text)            # some�thing

# .. and now it does:

some😌thing

because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.

If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:

exclude_list = UNICODE_EMOJI.keys()



for bad in exclude_list:  # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all

    if bad in text:

        print 'Removing '+bad

        text = text.replace(bad, '')

Removing 👩

Removing 😌

something

(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)

edited Nov 19 '18 at 23:00

answered Nov 19 '18 at 22:39

usr2564301

17.7k73370

Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.

– Mark Ransom
Nov 19 '18 at 22:48

add a comment |

To remove all emojis from the input string using the current approach, use

import re

from emoji import UNICODE_EMOJI

text = u'some👩😌thing'

exclude_list = UNICODE_EMOJI.keys()

rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))

print re.sub(rx, u'', text)

# => u'something'

If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.

Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.

answered Nov 19 '18 at 23:01

Wiktor Stribiżew

313k16133210

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53383475%2freplace-all-emojis-from-a-given-unicode-string%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

subs = u'|'.join(exclude_list)

print re.sub(subs, u'', text)

edited Nov 19 '18 at 23:07

answered Nov 19 '18 at 22:39

Mark Ransom

224k29281509

I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.

– CJ59
Nov 19 '18 at 22:49

@CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.

– Mark Ransom
Nov 19 '18 at 22:52

add a comment |

subs = u'|'.join(exclude_list)

print re.sub(subs, u'', text)

edited Nov 19 '18 at 23:07

answered Nov 19 '18 at 22:39

Mark Ransom

224k29281509

I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.

– CJ59
Nov 19 '18 at 22:49

@CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.

– Mark Ransom
Nov 19 '18 at 22:52

add a comment |

subs = u'|'.join(exclude_list)

print re.sub(subs, u'', text)

edited Nov 19 '18 at 23:07

answered Nov 19 '18 at 22:39

Mark Ransom

224k29281509

subs = u'|'.join(exclude_list)

print re.sub(subs, u'', text)

edited Nov 19 '18 at 23:07

answered Nov 19 '18 at 22:39

Mark Ransom

224k29281509

edited Nov 19 '18 at 23:07

answered Nov 19 '18 at 22:39

Mark Ransom

224k29281509

answered Nov 19 '18 at 22:39

Mark Ransom

224k29281509

answered Nov 19 '18 at 22:39

Mark Ransom

224k29281509

I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.

– CJ59
Nov 19 '18 at 22:49

@CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.

– Mark Ransom
Nov 19 '18 at 22:52

add a comment |

I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.

– CJ59
Nov 19 '18 at 22:49

@CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.

– Mark Ransom
Nov 19 '18 at 22:52

I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.

– CJ59
Nov 19 '18 at 22:49

@CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.

– Mark Ransom
Nov 19 '18 at 22:52

add a comment |

The old 2.7 regex engine gets confused because:

Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.

Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).

That means that [U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.

This fixes it:

print re.sub(ur'(U0001f469|U0001F60C)+', u'', text)  # something

# Removing only "👩" doesn't work 

print re.sub(ur'(U0001f469)+', u'', text)            # some�thing

# .. and now it does:

some😌thing

because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.

If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:

exclude_list = UNICODE_EMOJI.keys()



for bad in exclude_list:  # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all

    if bad in text:

        print 'Removing '+bad

        text = text.replace(bad, '')

Removing 👩

Removing 😌

something

(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)

edited Nov 19 '18 at 23:00

answered Nov 19 '18 at 22:39

usr2564301

17.7k73370

Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.

– Mark Ransom
Nov 19 '18 at 22:48

add a comment |

The old 2.7 regex engine gets confused because:

Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.

Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).

That means that [U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.

This fixes it:

print re.sub(ur'(U0001f469|U0001F60C)+', u'', text)  # something

# Removing only "👩" doesn't work 

print re.sub(ur'(U0001f469)+', u'', text)            # some�thing

# .. and now it does:

some😌thing

because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.

If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:

exclude_list = UNICODE_EMOJI.keys()



for bad in exclude_list:  # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all

    if bad in text:

        print 'Removing '+bad

        text = text.replace(bad, '')

Removing 👩

Removing 😌

something

(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)

edited Nov 19 '18 at 23:00

answered Nov 19 '18 at 22:39

usr2564301

17.7k73370

Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.

– Mark Ransom
Nov 19 '18 at 22:48

add a comment |

The old 2.7 regex engine gets confused because:

Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.

Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).

That means that [U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.

This fixes it:

print re.sub(ur'(U0001f469|U0001F60C)+', u'', text)  # something

# Removing only "👩" doesn't work 

print re.sub(ur'(U0001f469)+', u'', text)            # some�thing

# .. and now it does:

some😌thing

because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.

If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:

exclude_list = UNICODE_EMOJI.keys()



for bad in exclude_list:  # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all

    if bad in text:

        print 'Removing '+bad

        text = text.replace(bad, '')

Removing 👩

Removing 😌

something

(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)

edited Nov 19 '18 at 23:00

answered Nov 19 '18 at 22:39

usr2564301

17.7k73370

The old 2.7 regex engine gets confused because:

Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.

Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).

That means that [U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.

This fixes it:

print re.sub(ur'(U0001f469|U0001F60C)+', u'', text)  # something

# Removing only "👩" doesn't work 

print re.sub(ur'(U0001f469)+', u'', text)            # some�thing

# .. and now it does:

some😌thing

because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.

If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:

exclude_list = UNICODE_EMOJI.keys()



for bad in exclude_list:  # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all

    if bad in text:

        print 'Removing '+bad

        text = text.replace(bad, '')

Removing 👩

Removing 😌

something

(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)

edited Nov 19 '18 at 23:00

answered Nov 19 '18 at 22:39

usr2564301

17.7k73370

edited Nov 19 '18 at 23:00

answered Nov 19 '18 at 22:39

usr2564301

17.7k73370

answered Nov 19 '18 at 22:39

usr2564301

17.7k73370

answered Nov 19 '18 at 22:39

usr2564301

17.7k73370

Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.

– Mark Ransom
Nov 19 '18 at 22:48

add a comment |

Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.

– Mark Ransom
Nov 19 '18 at 22:48

Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.

– Mark Ransom
Nov 19 '18 at 22:48

add a comment |

To remove all emojis from the input string using the current approach, use

import re

from emoji import UNICODE_EMOJI

text = u'some👩😌thing'

exclude_list = UNICODE_EMOJI.keys()

rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))

print re.sub(rx, u'', text)

# => u'something'

Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.

answered Nov 19 '18 at 23:01

Wiktor Stribiżew

313k16133210

add a comment |

To remove all emojis from the input string using the current approach, use

import re

from emoji import UNICODE_EMOJI

text = u'some👩😌thing'

exclude_list = UNICODE_EMOJI.keys()

rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))

print re.sub(rx, u'', text)

# => u'something'

Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.

answered Nov 19 '18 at 23:01

Wiktor Stribiżew

313k16133210

add a comment |

To remove all emojis from the input string using the current approach, use

import re

from emoji import UNICODE_EMOJI

text = u'some👩😌thing'

exclude_list = UNICODE_EMOJI.keys()

rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))

print re.sub(rx, u'', text)

# => u'something'

Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.

answered Nov 19 '18 at 23:01

Wiktor Stribiżew

313k16133210

To remove all emojis from the input string using the current approach, use

import re

from emoji import UNICODE_EMOJI

text = u'some👩😌thing'

exclude_list = UNICODE_EMOJI.keys()

rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))

print re.sub(rx, u'', text)

# => u'something'

Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.

answered Nov 19 '18 at 23:01

Wiktor Stribiżew

313k16133210

answered Nov 19 '18 at 23:01

Wiktor Stribiżew

313k16133210

answered Nov 19 '18 at 23:01

Wiktor Stribiżew

313k16133210

answered Nov 19 '18 at 23:01

Wiktor Stribiżew

313k16133210

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky