Localized glyphs (locl) have unicode value U+FFFD (�)
up vote
5
down vote
favorite
Originally this seemed to me like a font issue, as there was no equivalent problem with other fonts, as indicated by this good MWE:
Copied text from good PDF
EBGaramond12-Italic.otf:
[ - ] бгдпт
[SRB] бгдпт
Good MWE:
documentclass[11pt]{article}
usepackage{fontspec}
setmainfont{Latin Modern Mono}
newcommand*{chars}{бгдпт}
newcommand*{samplepars}[1]{{%
vspace{normalbaselineskip}%
#1:par
fontspec{#1}[Script=Cyrillic]%
{normalfont [ - ]} {chars}par
{normalfont [SRB]} {addfontfeature{Language=Serbian}chars}par
}}
begin{document}
samplepars{EBGaramond12-Italic.otf}%
end{document}
The problem starts happening with the Source Han fonts (the (""/"K"/"SC"/"TC"
) fonts support all language glyphs via locl
). For the MWE we will use Source Han Sans TC (16MB).
(All of the currently available fonts can be downloaded quickly and easily with these commands.)
If another language than the font file default (e.g. using a language besides Chinese Traditional
with the ... TC
fonts) is used, substituted glyphs have unicode value U+FFFD
�
REPLACEMENT CHARACTER
. This makes accessibility tools, searching and copying in a PDF file impossible for those glyphs.
The expected behavior would be that the substituted glyphs have the same unicode value as the default glyph they are replacing.
Edit: For comparison, I was able to produce a good CJK PDF with LibreOffice Writer, from this ODT file. Uncompressing it with
mutool clean -d -a libreoffice.pdf libreoffice-uncompressed.pdf
and viewing the source does not contain anyBDC
,EMC
, or evenzh
, but one/Lang (en-US)
. The fonts embedded in the file are converted from OTF to Type1 by LibreOffice, which could be the cause of the glyph unicode values.
If the issue is indeed with TeX, how could it be fixed or worked around?
Copied text from bad PDF
SourceHanSansTC-Regular.otf:
[ZHS] ���⻣�����贈��⻆���函純賭難海�練�������
[ZHT] 刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化
Bad MWE:
documentclass[11pt]{article}
usepackage{fontspec}
setmainfont{Latin Modern Mono}
setlength{parindent}{0mm}
newcommand*{chars}{%
刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化}
newcommand*{samplepars}[1]{{%
vspace{normalbaselineskip}%
#1:par
fontspec{#1}[Script=CJK]%
% {normalfont [JAN]} {addfontfeature{Language=Japanese }chars}par
% {normalfont [KOR]} {addfontfeature{Language=Korean }chars}par
%%{normalfont [ZHH]} {addfontfeature{Language=Chinese Hong Kong }chars}par
{normalfont [ZHS]} {addfontfeature{Language=Chinese Simplified }chars}par
{normalfont [ZHT]} {addfontfeature{Language=Chinese Traditional}chars}par
}}
begin{document}
newpage
% samplepars{SourceHanSerif-Regular.otf}%
% samplepars{SourceHanSerifK-Regular.otf}%
%%samplepars{SourceHanSerifHC-Regular.otf}%
% samplepars{SourceHanSerifSC-Regular.otf}%
% samplepars{SourceHanSerifTC-Regular.otf}%
newpage
% samplepars{SourceHanSans-Regular.otf}%
% samplepars{SourceHanSansK-Regular.otf}%
%%samplepars{SourceHanSansHC-Regular.otf}%
% samplepars{SourceHanSansSC-Regular.otf}%
samplepars{SourceHanSansTC-Regular.otf}%
end{document}
fontspec unicode cjk opentype locale
|
show 6 more comments
up vote
5
down vote
favorite
Originally this seemed to me like a font issue, as there was no equivalent problem with other fonts, as indicated by this good MWE:
Copied text from good PDF
EBGaramond12-Italic.otf:
[ - ] бгдпт
[SRB] бгдпт
Good MWE:
documentclass[11pt]{article}
usepackage{fontspec}
setmainfont{Latin Modern Mono}
newcommand*{chars}{бгдпт}
newcommand*{samplepars}[1]{{%
vspace{normalbaselineskip}%
#1:par
fontspec{#1}[Script=Cyrillic]%
{normalfont [ - ]} {chars}par
{normalfont [SRB]} {addfontfeature{Language=Serbian}chars}par
}}
begin{document}
samplepars{EBGaramond12-Italic.otf}%
end{document}
The problem starts happening with the Source Han fonts (the (""/"K"/"SC"/"TC"
) fonts support all language glyphs via locl
). For the MWE we will use Source Han Sans TC (16MB).
(All of the currently available fonts can be downloaded quickly and easily with these commands.)
If another language than the font file default (e.g. using a language besides Chinese Traditional
with the ... TC
fonts) is used, substituted glyphs have unicode value U+FFFD
�
REPLACEMENT CHARACTER
. This makes accessibility tools, searching and copying in a PDF file impossible for those glyphs.
The expected behavior would be that the substituted glyphs have the same unicode value as the default glyph they are replacing.
Edit: For comparison, I was able to produce a good CJK PDF with LibreOffice Writer, from this ODT file. Uncompressing it with
mutool clean -d -a libreoffice.pdf libreoffice-uncompressed.pdf
and viewing the source does not contain anyBDC
,EMC
, or evenzh
, but one/Lang (en-US)
. The fonts embedded in the file are converted from OTF to Type1 by LibreOffice, which could be the cause of the glyph unicode values.
If the issue is indeed with TeX, how could it be fixed or worked around?
Copied text from bad PDF
SourceHanSansTC-Regular.otf:
[ZHS] ���⻣�����贈��⻆���函純賭難海�練�������
[ZHT] 刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化
Bad MWE:
documentclass[11pt]{article}
usepackage{fontspec}
setmainfont{Latin Modern Mono}
setlength{parindent}{0mm}
newcommand*{chars}{%
刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化}
newcommand*{samplepars}[1]{{%
vspace{normalbaselineskip}%
#1:par
fontspec{#1}[Script=CJK]%
% {normalfont [JAN]} {addfontfeature{Language=Japanese }chars}par
% {normalfont [KOR]} {addfontfeature{Language=Korean }chars}par
%%{normalfont [ZHH]} {addfontfeature{Language=Chinese Hong Kong }chars}par
{normalfont [ZHS]} {addfontfeature{Language=Chinese Simplified }chars}par
{normalfont [ZHT]} {addfontfeature{Language=Chinese Traditional}chars}par
}}
begin{document}
newpage
% samplepars{SourceHanSerif-Regular.otf}%
% samplepars{SourceHanSerifK-Regular.otf}%
%%samplepars{SourceHanSerifHC-Regular.otf}%
% samplepars{SourceHanSerifSC-Regular.otf}%
% samplepars{SourceHanSerifTC-Regular.otf}%
newpage
% samplepars{SourceHanSans-Regular.otf}%
% samplepars{SourceHanSansK-Regular.otf}%
%%samplepars{SourceHanSansHC-Regular.otf}%
% samplepars{SourceHanSansSC-Regular.otf}%
samplepars{SourceHanSansTC-Regular.otf}%
end{document}
fontspec unicode cjk opentype locale
2
I took a look in the locl-test.pdf provided by Ken Lunde. It has in the page stream BDC/EMC operators which mark up the language and Ken's comment sounds as if this is needed for copy&paste to work correctly. Adding this marks should be possible e.g. with the tagpdf package, but it works currently only for lualatex and pdflatex. But I can't load your font with lualatex, so I can't run more tests.
– Ulrike Fischer
Sep 19 at 8:36
2
I can now run the fonts with luatex (they need a 64bit luatex), but the reference file from Ken is a bit too complicated for real investigations. It would be better to have working example involving only one font and two variants.
– Ulrike Fischer
Sep 19 at 16:57
1
well it is not tagging. But tounicode entries are missing. I asked on the context list for a solution for luatex.
– Ulrike Fischer
Sep 26 at 21:17
1
the context fontloader will be updated to get this working, and I can then import it in lualatex/luaotfload. But I have no idea which of the various libraries in xetex is relevant here. You should try to ask on the xetex mailing list. The discussion on the context list is here mail-archive.com/ntg-context@ntg.nl/msg89086.html
– Ulrike Fischer
Sep 27 at 9:19
2
I have uploaded a new version of luaotfload to ctan which I hope corrects the problem for lualatex.
– Ulrike Fischer
Oct 3 at 13:42
|
show 6 more comments
up vote
5
down vote
favorite
up vote
5
down vote
favorite
Originally this seemed to me like a font issue, as there was no equivalent problem with other fonts, as indicated by this good MWE:
Copied text from good PDF
EBGaramond12-Italic.otf:
[ - ] бгдпт
[SRB] бгдпт
Good MWE:
documentclass[11pt]{article}
usepackage{fontspec}
setmainfont{Latin Modern Mono}
newcommand*{chars}{бгдпт}
newcommand*{samplepars}[1]{{%
vspace{normalbaselineskip}%
#1:par
fontspec{#1}[Script=Cyrillic]%
{normalfont [ - ]} {chars}par
{normalfont [SRB]} {addfontfeature{Language=Serbian}chars}par
}}
begin{document}
samplepars{EBGaramond12-Italic.otf}%
end{document}
The problem starts happening with the Source Han fonts (the (""/"K"/"SC"/"TC"
) fonts support all language glyphs via locl
). For the MWE we will use Source Han Sans TC (16MB).
(All of the currently available fonts can be downloaded quickly and easily with these commands.)
If another language than the font file default (e.g. using a language besides Chinese Traditional
with the ... TC
fonts) is used, substituted glyphs have unicode value U+FFFD
�
REPLACEMENT CHARACTER
. This makes accessibility tools, searching and copying in a PDF file impossible for those glyphs.
The expected behavior would be that the substituted glyphs have the same unicode value as the default glyph they are replacing.
Edit: For comparison, I was able to produce a good CJK PDF with LibreOffice Writer, from this ODT file. Uncompressing it with
mutool clean -d -a libreoffice.pdf libreoffice-uncompressed.pdf
and viewing the source does not contain anyBDC
,EMC
, or evenzh
, but one/Lang (en-US)
. The fonts embedded in the file are converted from OTF to Type1 by LibreOffice, which could be the cause of the glyph unicode values.
If the issue is indeed with TeX, how could it be fixed or worked around?
Copied text from bad PDF
SourceHanSansTC-Regular.otf:
[ZHS] ���⻣�����贈��⻆���函純賭難海�練�������
[ZHT] 刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化
Bad MWE:
documentclass[11pt]{article}
usepackage{fontspec}
setmainfont{Latin Modern Mono}
setlength{parindent}{0mm}
newcommand*{chars}{%
刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化}
newcommand*{samplepars}[1]{{%
vspace{normalbaselineskip}%
#1:par
fontspec{#1}[Script=CJK]%
% {normalfont [JAN]} {addfontfeature{Language=Japanese }chars}par
% {normalfont [KOR]} {addfontfeature{Language=Korean }chars}par
%%{normalfont [ZHH]} {addfontfeature{Language=Chinese Hong Kong }chars}par
{normalfont [ZHS]} {addfontfeature{Language=Chinese Simplified }chars}par
{normalfont [ZHT]} {addfontfeature{Language=Chinese Traditional}chars}par
}}
begin{document}
newpage
% samplepars{SourceHanSerif-Regular.otf}%
% samplepars{SourceHanSerifK-Regular.otf}%
%%samplepars{SourceHanSerifHC-Regular.otf}%
% samplepars{SourceHanSerifSC-Regular.otf}%
% samplepars{SourceHanSerifTC-Regular.otf}%
newpage
% samplepars{SourceHanSans-Regular.otf}%
% samplepars{SourceHanSansK-Regular.otf}%
%%samplepars{SourceHanSansHC-Regular.otf}%
% samplepars{SourceHanSansSC-Regular.otf}%
samplepars{SourceHanSansTC-Regular.otf}%
end{document}
fontspec unicode cjk opentype locale
Originally this seemed to me like a font issue, as there was no equivalent problem with other fonts, as indicated by this good MWE:
Copied text from good PDF
EBGaramond12-Italic.otf:
[ - ] бгдпт
[SRB] бгдпт
Good MWE:
documentclass[11pt]{article}
usepackage{fontspec}
setmainfont{Latin Modern Mono}
newcommand*{chars}{бгдпт}
newcommand*{samplepars}[1]{{%
vspace{normalbaselineskip}%
#1:par
fontspec{#1}[Script=Cyrillic]%
{normalfont [ - ]} {chars}par
{normalfont [SRB]} {addfontfeature{Language=Serbian}chars}par
}}
begin{document}
samplepars{EBGaramond12-Italic.otf}%
end{document}
The problem starts happening with the Source Han fonts (the (""/"K"/"SC"/"TC"
) fonts support all language glyphs via locl
). For the MWE we will use Source Han Sans TC (16MB).
(All of the currently available fonts can be downloaded quickly and easily with these commands.)
If another language than the font file default (e.g. using a language besides Chinese Traditional
with the ... TC
fonts) is used, substituted glyphs have unicode value U+FFFD
�
REPLACEMENT CHARACTER
. This makes accessibility tools, searching and copying in a PDF file impossible for those glyphs.
The expected behavior would be that the substituted glyphs have the same unicode value as the default glyph they are replacing.
Edit: For comparison, I was able to produce a good CJK PDF with LibreOffice Writer, from this ODT file. Uncompressing it with
mutool clean -d -a libreoffice.pdf libreoffice-uncompressed.pdf
and viewing the source does not contain anyBDC
,EMC
, or evenzh
, but one/Lang (en-US)
. The fonts embedded in the file are converted from OTF to Type1 by LibreOffice, which could be the cause of the glyph unicode values.
If the issue is indeed with TeX, how could it be fixed or worked around?
Copied text from bad PDF
SourceHanSansTC-Regular.otf:
[ZHS] ���⻣�����贈��⻆���函純賭難海�練�������
[ZHT] 刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化
Bad MWE:
documentclass[11pt]{article}
usepackage{fontspec}
setmainfont{Latin Modern Mono}
setlength{parindent}{0mm}
newcommand*{chars}{%
刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化}
newcommand*{samplepars}[1]{{%
vspace{normalbaselineskip}%
#1:par
fontspec{#1}[Script=CJK]%
% {normalfont [JAN]} {addfontfeature{Language=Japanese }chars}par
% {normalfont [KOR]} {addfontfeature{Language=Korean }chars}par
%%{normalfont [ZHH]} {addfontfeature{Language=Chinese Hong Kong }chars}par
{normalfont [ZHS]} {addfontfeature{Language=Chinese Simplified }chars}par
{normalfont [ZHT]} {addfontfeature{Language=Chinese Traditional}chars}par
}}
begin{document}
newpage
% samplepars{SourceHanSerif-Regular.otf}%
% samplepars{SourceHanSerifK-Regular.otf}%
%%samplepars{SourceHanSerifHC-Regular.otf}%
% samplepars{SourceHanSerifSC-Regular.otf}%
% samplepars{SourceHanSerifTC-Regular.otf}%
newpage
% samplepars{SourceHanSans-Regular.otf}%
% samplepars{SourceHanSansK-Regular.otf}%
%%samplepars{SourceHanSansHC-Regular.otf}%
% samplepars{SourceHanSansSC-Regular.otf}%
samplepars{SourceHanSansTC-Regular.otf}%
end{document}
fontspec unicode cjk opentype locale
fontspec unicode cjk opentype locale
edited Nov 19 at 23:07
asked Sep 18 at 20:59
svenper
473211
473211
2
I took a look in the locl-test.pdf provided by Ken Lunde. It has in the page stream BDC/EMC operators which mark up the language and Ken's comment sounds as if this is needed for copy&paste to work correctly. Adding this marks should be possible e.g. with the tagpdf package, but it works currently only for lualatex and pdflatex. But I can't load your font with lualatex, so I can't run more tests.
– Ulrike Fischer
Sep 19 at 8:36
2
I can now run the fonts with luatex (they need a 64bit luatex), but the reference file from Ken is a bit too complicated for real investigations. It would be better to have working example involving only one font and two variants.
– Ulrike Fischer
Sep 19 at 16:57
1
well it is not tagging. But tounicode entries are missing. I asked on the context list for a solution for luatex.
– Ulrike Fischer
Sep 26 at 21:17
1
the context fontloader will be updated to get this working, and I can then import it in lualatex/luaotfload. But I have no idea which of the various libraries in xetex is relevant here. You should try to ask on the xetex mailing list. The discussion on the context list is here mail-archive.com/ntg-context@ntg.nl/msg89086.html
– Ulrike Fischer
Sep 27 at 9:19
2
I have uploaded a new version of luaotfload to ctan which I hope corrects the problem for lualatex.
– Ulrike Fischer
Oct 3 at 13:42
|
show 6 more comments
2
I took a look in the locl-test.pdf provided by Ken Lunde. It has in the page stream BDC/EMC operators which mark up the language and Ken's comment sounds as if this is needed for copy&paste to work correctly. Adding this marks should be possible e.g. with the tagpdf package, but it works currently only for lualatex and pdflatex. But I can't load your font with lualatex, so I can't run more tests.
– Ulrike Fischer
Sep 19 at 8:36
2
I can now run the fonts with luatex (they need a 64bit luatex), but the reference file from Ken is a bit too complicated for real investigations. It would be better to have working example involving only one font and two variants.
– Ulrike Fischer
Sep 19 at 16:57
1
well it is not tagging. But tounicode entries are missing. I asked on the context list for a solution for luatex.
– Ulrike Fischer
Sep 26 at 21:17
1
the context fontloader will be updated to get this working, and I can then import it in lualatex/luaotfload. But I have no idea which of the various libraries in xetex is relevant here. You should try to ask on the xetex mailing list. The discussion on the context list is here mail-archive.com/ntg-context@ntg.nl/msg89086.html
– Ulrike Fischer
Sep 27 at 9:19
2
I have uploaded a new version of luaotfload to ctan which I hope corrects the problem for lualatex.
– Ulrike Fischer
Oct 3 at 13:42
2
2
I took a look in the locl-test.pdf provided by Ken Lunde. It has in the page stream BDC/EMC operators which mark up the language and Ken's comment sounds as if this is needed for copy&paste to work correctly. Adding this marks should be possible e.g. with the tagpdf package, but it works currently only for lualatex and pdflatex. But I can't load your font with lualatex, so I can't run more tests.
– Ulrike Fischer
Sep 19 at 8:36
I took a look in the locl-test.pdf provided by Ken Lunde. It has in the page stream BDC/EMC operators which mark up the language and Ken's comment sounds as if this is needed for copy&paste to work correctly. Adding this marks should be possible e.g. with the tagpdf package, but it works currently only for lualatex and pdflatex. But I can't load your font with lualatex, so I can't run more tests.
– Ulrike Fischer
Sep 19 at 8:36
2
2
I can now run the fonts with luatex (they need a 64bit luatex), but the reference file from Ken is a bit too complicated for real investigations. It would be better to have working example involving only one font and two variants.
– Ulrike Fischer
Sep 19 at 16:57
I can now run the fonts with luatex (they need a 64bit luatex), but the reference file from Ken is a bit too complicated for real investigations. It would be better to have working example involving only one font and two variants.
– Ulrike Fischer
Sep 19 at 16:57
1
1
well it is not tagging. But tounicode entries are missing. I asked on the context list for a solution for luatex.
– Ulrike Fischer
Sep 26 at 21:17
well it is not tagging. But tounicode entries are missing. I asked on the context list for a solution for luatex.
– Ulrike Fischer
Sep 26 at 21:17
1
1
the context fontloader will be updated to get this working, and I can then import it in lualatex/luaotfload. But I have no idea which of the various libraries in xetex is relevant here. You should try to ask on the xetex mailing list. The discussion on the context list is here mail-archive.com/ntg-context@ntg.nl/msg89086.html
– Ulrike Fischer
Sep 27 at 9:19
the context fontloader will be updated to get this working, and I can then import it in lualatex/luaotfload. But I have no idea which of the various libraries in xetex is relevant here. You should try to ask on the xetex mailing list. The discussion on the context list is here mail-archive.com/ntg-context@ntg.nl/msg89086.html
– Ulrike Fischer
Sep 27 at 9:19
2
2
I have uploaded a new version of luaotfload to ctan which I hope corrects the problem for lualatex.
– Ulrike Fischer
Oct 3 at 13:42
I have uploaded a new version of luaotfload to ctan which I hope corrects the problem for lualatex.
– Ulrike Fischer
Oct 3 at 13:42
|
show 6 more comments
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2ftex.stackexchange.com%2fquestions%2f451442%2flocalized-glyphs-locl-have-unicode-value-ufffd%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
I took a look in the locl-test.pdf provided by Ken Lunde. It has in the page stream BDC/EMC operators which mark up the language and Ken's comment sounds as if this is needed for copy&paste to work correctly. Adding this marks should be possible e.g. with the tagpdf package, but it works currently only for lualatex and pdflatex. But I can't load your font with lualatex, so I can't run more tests.
– Ulrike Fischer
Sep 19 at 8:36
2
I can now run the fonts with luatex (they need a 64bit luatex), but the reference file from Ken is a bit too complicated for real investigations. It would be better to have working example involving only one font and two variants.
– Ulrike Fischer
Sep 19 at 16:57
1
well it is not tagging. But tounicode entries are missing. I asked on the context list for a solution for luatex.
– Ulrike Fischer
Sep 26 at 21:17
1
the context fontloader will be updated to get this working, and I can then import it in lualatex/luaotfload. But I have no idea which of the various libraries in xetex is relevant here. You should try to ask on the xetex mailing list. The discussion on the context list is here mail-archive.com/ntg-context@ntg.nl/msg89086.html
– Ulrike Fischer
Sep 27 at 9:19
2
I have uploaded a new version of luaotfload to ctan which I hope corrects the problem for lualatex.
– Ulrike Fischer
Oct 3 at 13:42