Replace all emojis from a given unicode string
I have a list of unicode symbols from the emoji
package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing
, and then removes all emojis, i.e. "something"
. Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'U0001F469' # 👩
print u'U0001F60C' # 😌
print u'U0001F469U0001F60C' # 👩😌
text = u'someU0001F469U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[U0001f469U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[U0001f469]+', u'', text) # some�thing
regex python-2.7 unicode
|
show 10 more comments
I have a list of unicode symbols from the emoji
package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing
, and then removes all emojis, i.e. "something"
. Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'U0001F469' # 👩
print u'U0001F60C' # 😌
print u'U0001F469U0001F60C' # 👩😌
text = u'someU0001F469U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[U0001f469U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[U0001f469]+', u'', text) # some�thing
regex python-2.7 unicode
2
I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Insertingrepr
in theprint
statement shows that the result ofre.sub
isu'someude0cthing'
. (Incidentally,sys.maxunicode
is 65535.)
– jwodder
Nov 19 '18 at 22:27
5
Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.
– Mark Ransom
Nov 19 '18 at 22:28
3
@jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and there
treats them as 2 separate characters as well – i.e., removing theu0001
's first.
– usr2564301
Nov 19 '18 at 22:30
1
Oh wait, not theu0001
– it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by[..]
so one at a time, first the high order then the low order one.
– usr2564301
Nov 19 '18 at 22:32
2
Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?
– Mark_Anderson
Nov 19 '18 at 22:33
|
show 10 more comments
I have a list of unicode symbols from the emoji
package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing
, and then removes all emojis, i.e. "something"
. Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'U0001F469' # 👩
print u'U0001F60C' # 😌
print u'U0001F469U0001F60C' # 👩😌
text = u'someU0001F469U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[U0001f469U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[U0001f469]+', u'', text) # some�thing
regex python-2.7 unicode
I have a list of unicode symbols from the emoji
package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing
, and then removes all emojis, i.e. "something"
. Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'U0001F469' # 👩
print u'U0001F60C' # 😌
print u'U0001F469U0001F60C' # 👩😌
text = u'someU0001F469U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[U0001f469U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[U0001f469]+', u'', text) # some�thing
regex python-2.7 unicode
regex python-2.7 unicode
edited Nov 19 '18 at 22:30
dimitris93
asked Nov 19 '18 at 22:20
dimitris93dimitris93
1,87742154
1,87742154
2
I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Insertingrepr
in theprint
statement shows that the result ofre.sub
isu'someude0cthing'
. (Incidentally,sys.maxunicode
is 65535.)
– jwodder
Nov 19 '18 at 22:27
5
Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.
– Mark Ransom
Nov 19 '18 at 22:28
3
@jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and there
treats them as 2 separate characters as well – i.e., removing theu0001
's first.
– usr2564301
Nov 19 '18 at 22:30
1
Oh wait, not theu0001
– it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by[..]
so one at a time, first the high order then the low order one.
– usr2564301
Nov 19 '18 at 22:32
2
Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?
– Mark_Anderson
Nov 19 '18 at 22:33
|
show 10 more comments
2
I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Insertingrepr
in theprint
statement shows that the result ofre.sub
isu'someude0cthing'
. (Incidentally,sys.maxunicode
is 65535.)
– jwodder
Nov 19 '18 at 22:27
5
Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.
– Mark Ransom
Nov 19 '18 at 22:28
3
@jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and there
treats them as 2 separate characters as well – i.e., removing theu0001
's first.
– usr2564301
Nov 19 '18 at 22:30
1
Oh wait, not theu0001
– it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by[..]
so one at a time, first the high order then the low order one.
– usr2564301
Nov 19 '18 at 22:32
2
Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?
– Mark_Anderson
Nov 19 '18 at 22:33
2
2
I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Inserting
repr
in the print
statement shows that the result of re.sub
is u'someude0cthing'
. (Incidentally, sys.maxunicode
is 65535.)– jwodder
Nov 19 '18 at 22:27
I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Inserting
repr
in the print
statement shows that the result of re.sub
is u'someude0cthing'
. (Incidentally, sys.maxunicode
is 65535.)– jwodder
Nov 19 '18 at 22:27
5
5
Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.
– Mark Ransom
Nov 19 '18 at 22:28
Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.
– Mark Ransom
Nov 19 '18 at 22:28
3
3
@jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and the
re
treats them as 2 separate characters as well – i.e., removing the u0001
's first.– usr2564301
Nov 19 '18 at 22:30
@jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and the
re
treats them as 2 separate characters as well – i.e., removing the u0001
's first.– usr2564301
Nov 19 '18 at 22:30
1
1
Oh wait, not the
u0001
– it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by [..]
so one at a time, first the high order then the low order one.– usr2564301
Nov 19 '18 at 22:32
Oh wait, not the
u0001
– it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by [..]
so one at a time, first the high order then the low order one.– usr2564301
Nov 19 '18 at 22:32
2
2
Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?
– Mark_Anderson
Nov 19 '18 at 22:33
Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?
– Mark_Anderson
Nov 19 '18 at 22:33
|
show 10 more comments
3 Answers
3
active
oldest
votes
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'U0001F469')
.
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace
, simply join all the characters together with |
. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.
– CJ59
Nov 19 '18 at 22:49
@CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.
– Mark Ransom
Nov 19 '18 at 22:52
add a comment |
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that
[U0001f469]+'
replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list
, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace
line in the loop.)
Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.
– Mark Ransom
Nov 19 '18 at 22:48
add a comment |
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape
the emoji chars, you will get nothing to repeat
error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list)
is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53383475%2freplace-all-emojis-from-a-given-unicode-string%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'U0001F469')
.
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace
, simply join all the characters together with |
. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.
– CJ59
Nov 19 '18 at 22:49
@CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.
– Mark Ransom
Nov 19 '18 at 22:52
add a comment |
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'U0001F469')
.
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace
, simply join all the characters together with |
. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.
– CJ59
Nov 19 '18 at 22:49
@CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.
– Mark Ransom
Nov 19 '18 at 22:52
add a comment |
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'U0001F469')
.
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace
, simply join all the characters together with |
. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'U0001F469')
.
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace
, simply join all the characters together with |
. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
edited Nov 19 '18 at 23:07
answered Nov 19 '18 at 22:39
Mark RansomMark Ransom
224k29281509
224k29281509
I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.
– CJ59
Nov 19 '18 at 22:49
@CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.
– Mark Ransom
Nov 19 '18 at 22:52
add a comment |
I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.
– CJ59
Nov 19 '18 at 22:49
@CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.
– Mark Ransom
Nov 19 '18 at 22:52
I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.
– CJ59
Nov 19 '18 at 22:49
I did not know that 2.7 could be compiled with 4 byte unicode support, and I definitely did not know that Ubuntu was distributing python2.7 with that enabled. Learn something new every day.
– CJ59
Nov 19 '18 at 22:49
@CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.
– Mark Ransom
Nov 19 '18 at 22:52
@CJ59 that's one of the best reasons to contribute to StackOverflow, the opportunities to learn are endless. Hardly a day goes by when I don't learn something new myself.
– Mark Ransom
Nov 19 '18 at 22:52
add a comment |
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that
[U0001f469]+'
replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list
, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace
line in the loop.)
Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.
– Mark Ransom
Nov 19 '18 at 22:48
add a comment |
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that
[U0001f469]+'
replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list
, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace
line in the loop.)
Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.
– Mark Ransom
Nov 19 '18 at 22:48
add a comment |
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that
[U0001f469]+'
replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list
, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace
line in the loop.)
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that
[U0001f469]+'
replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list
, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace
line in the loop.)
edited Nov 19 '18 at 23:00
answered Nov 19 '18 at 22:39
usr2564301usr2564301
17.7k73370
17.7k73370
Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.
– Mark Ransom
Nov 19 '18 at 22:48
add a comment |
Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.
– Mark Ransom
Nov 19 '18 at 22:48
Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.
– Mark Ransom
Nov 19 '18 at 22:48
Yes, creating a regular expression from a literal works great. The question expressed a desire to generate the expression from a list of characters, and it turns out that's much harder. I haven't cracked it yet.
– Mark Ransom
Nov 19 '18 at 22:48
add a comment |
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape
the emoji chars, you will get nothing to repeat
error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list)
is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.
add a comment |
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape
the emoji chars, you will get nothing to repeat
error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list)
is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.
add a comment |
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape
the emoji chars, you will get nothing to repeat
error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list)
is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape
the emoji chars, you will get nothing to repeat
error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list)
is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.
answered Nov 19 '18 at 23:01
Wiktor StribiżewWiktor Stribiżew
313k16133210
313k16133210
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53383475%2freplace-all-emojis-from-a-given-unicode-string%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
I can reproduce this bug in Python 2.7.10 on Mac OS X 10.13.6. Inserting
repr
in theprint
statement shows that the result ofre.sub
isu'someude0cthing'
. (Incidentally,sys.maxunicode
is 65535.)– jwodder
Nov 19 '18 at 22:27
5
Python 2 doesn't have very good Unicode support for characters outside the BMP. If you really need it you need a build of Python with 32-bit Unicode characters. Otherwise start using the later versions of Python 3.
– Mark Ransom
Nov 19 '18 at 22:28
3
@jwodder: beat me to it 😊 I bet these are internally stored as 2-byte surrogate characters, and the
re
treats them as 2 separate characters as well – i.e., removing theu0001
's first.– usr2564301
Nov 19 '18 at 22:30
1
Oh wait, not the
u0001
– it never sees these. It substitutes the proper surrogates first, and then replaces these. Bracketed by[..]
so one at a time, first the high order then the low order one.– usr2564301
Nov 19 '18 at 22:32
2
Screams text formatting error to me. Would it be prohibitive to convert all the unicode-compliant text to ascii and then search for the appropriate strings? Not elegant, but might get around the 2-byte surrogate-esque issues?
– Mark_Anderson
Nov 19 '18 at 22:33