Localized glyphs (locl) have unicode value U+FFFD (�)

up vote
5
down vote

favorite

Originally this seemed to me like a font issue, as there was no equivalent problem with other fonts, as indicated by this good MWE:

Copied text from good PDF

EBGaramond12-Italic.otf:

[ - ] бгдпт

[SRB] бгдпт

Good MWE:

documentclass[11pt]{article}

usepackage{fontspec}

setmainfont{Latin Modern Mono}



newcommand*{chars}{бгдпт}



newcommand*{samplepars}[1]{{%

  vspace{normalbaselineskip}%

  #1:par

  fontspec{#1}[Script=Cyrillic]%

  {normalfont [ - ]} {chars}par

  {normalfont [SRB]} {addfontfeature{Language=Serbian}chars}par

}}



begin{document}



samplepars{EBGaramond12-Italic.otf}%



end{document}

The problem starts happening with the Source Han fonts (the (""/"K"/"SC"/"TC") fonts support all language glyphs via locl). For the MWE we will use Source Han Sans TC (16MB).

^{(All of the currently available fonts can be downloaded quickly and easily with these commands.)}

If another language than the font file default (e.g. using a language besides Chinese Traditional with the ... TC fonts) is used, substituted glyphs have unicode value U+FFFD � REPLACEMENT CHARACTER. This makes accessibility tools, searching and copying in a PDF file impossible for those glyphs.

The expected behavior would be that the substituted glyphs have the same unicode value as the default glyph they are replacing.

Edit: For comparison, I was able to produce a good CJK PDF with LibreOffice Writer, from this ODT file. Uncompressing it with mutool clean -d -a libreoffice.pdf libreoffice-uncompressed.pdf and viewing the source does not contain any BDC, EMC, or even zh, but one /Lang (en-US). The fonts embedded in the file are converted from OTF to Type1 by LibreOffice, which could be the cause of the glyph unicode values.

If the issue is indeed with TeX, how could it be fixed or worked around?

Copied text from bad PDF

SourceHanSansTC-Regular.otf:

[ZHS] ���⻣�����贈��⻆���函純賭難海�練�������

[ZHT] 刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化

Bad MWE:

documentclass[11pt]{article}

usepackage{fontspec}

setmainfont{Latin Modern Mono}

setlength{parindent}{0mm}



newcommand*{chars}{%

    刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化}



newcommand*{samplepars}[1]{{%

  vspace{normalbaselineskip}%

  #1:par

  fontspec{#1}[Script=CJK]%

% {normalfont [JAN]} {addfontfeature{Language=Japanese           }chars}par

% {normalfont [KOR]} {addfontfeature{Language=Korean             }chars}par

%%{normalfont [ZHH]} {addfontfeature{Language=Chinese Hong Kong  }chars}par

  {normalfont [ZHS]} {addfontfeature{Language=Chinese Simplified }chars}par

  {normalfont [ZHT]} {addfontfeature{Language=Chinese Traditional}chars}par

}}



begin{document}



newpage

% samplepars{SourceHanSerif-Regular.otf}%

% samplepars{SourceHanSerifK-Regular.otf}%

%%samplepars{SourceHanSerifHC-Regular.otf}%

% samplepars{SourceHanSerifSC-Regular.otf}%

% samplepars{SourceHanSerifTC-Regular.otf}%



newpage

% samplepars{SourceHanSans-Regular.otf}%

% samplepars{SourceHanSansK-Regular.otf}%

%%samplepars{SourceHanSansHC-Regular.otf}%

% samplepars{SourceHanSansSC-Regular.otf}%

  samplepars{SourceHanSansTC-Regular.otf}%



end{document}

edited Nov 19 at 23:07

asked Sep 18 at 20:59

svenper

473211

2

I took a look in the locl-test.pdf provided by Ken Lunde. It has in the page stream BDC/EMC operators which mark up the language and Ken's comment sounds as if this is needed for copy&paste to work correctly. Adding this marks should be possible e.g. with the tagpdf package, but it works currently only for lualatex and pdflatex. But I can't load your font with lualatex, so I can't run more tests.
– Ulrike Fischer
Sep 19 at 8:36

2

I can now run the fonts with luatex (they need a 64bit luatex), but the reference file from Ken is a bit too complicated for real investigations. It would be better to have working example involving only one font and two variants.
– Ulrike Fischer
Sep 19 at 16:57

1

well it is not tagging. But tounicode entries are missing. I asked on the context list for a solution for luatex.
– Ulrike Fischer
Sep 26 at 21:17

1

the context fontloader will be updated to get this working, and I can then import it in lualatex/luaotfload. But I have no idea which of the various libraries in xetex is relevant here. You should try to ask on the xetex mailing list. The discussion on the context list is here mail-archive.com/ntg-context@ntg.nl/msg89086.html
– Ulrike Fischer
Sep 27 at 9:19

2

I have uploaded a new version of luaotfload to ctan which I hope corrects the problem for lualatex.
– Ulrike Fischer
Oct 3 at 13:42

|
show 6 more comments

up vote
5
down vote

favorite

Originally this seemed to me like a font issue, as there was no equivalent problem with other fonts, as indicated by this good MWE:

Copied text from good PDF

EBGaramond12-Italic.otf:

[ - ] бгдпт

[SRB] бгдпт

Good MWE:

documentclass[11pt]{article}

usepackage{fontspec}

setmainfont{Latin Modern Mono}



newcommand*{chars}{бгдпт}



newcommand*{samplepars}[1]{{%

  vspace{normalbaselineskip}%

  #1:par

  fontspec{#1}[Script=Cyrillic]%

  {normalfont [ - ]} {chars}par

  {normalfont [SRB]} {addfontfeature{Language=Serbian}chars}par

}}



begin{document}



samplepars{EBGaramond12-Italic.otf}%



end{document}

The problem starts happening with the Source Han fonts (the (""/"K"/"SC"/"TC") fonts support all language glyphs via locl). For the MWE we will use Source Han Sans TC (16MB).

^{(All of the currently available fonts can be downloaded quickly and easily with these commands.)}

The expected behavior would be that the substituted glyphs have the same unicode value as the default glyph they are replacing.

Edit: For comparison, I was able to produce a good CJK PDF with LibreOffice Writer, from this ODT file. Uncompressing it with mutool clean -d -a libreoffice.pdf libreoffice-uncompressed.pdf and viewing the source does not contain any BDC, EMC, or even zh, but one /Lang (en-US). The fonts embedded in the file are converted from OTF to Type1 by LibreOffice, which could be the cause of the glyph unicode values.

If the issue is indeed with TeX, how could it be fixed or worked around?

Copied text from bad PDF

SourceHanSansTC-Regular.otf:

[ZHS] ���⻣�����贈��⻆���函純賭難海�練�������

[ZHT] 刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化

Bad MWE:

documentclass[11pt]{article}

usepackage{fontspec}

setmainfont{Latin Modern Mono}

setlength{parindent}{0mm}



newcommand*{chars}{%

    刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化}



newcommand*{samplepars}[1]{{%

  vspace{normalbaselineskip}%

  #1:par

  fontspec{#1}[Script=CJK]%

% {normalfont [JAN]} {addfontfeature{Language=Japanese           }chars}par

% {normalfont [KOR]} {addfontfeature{Language=Korean             }chars}par

%%{normalfont [ZHH]} {addfontfeature{Language=Chinese Hong Kong  }chars}par

  {normalfont [ZHS]} {addfontfeature{Language=Chinese Simplified }chars}par

  {normalfont [ZHT]} {addfontfeature{Language=Chinese Traditional}chars}par

}}



begin{document}



newpage

% samplepars{SourceHanSerif-Regular.otf}%

% samplepars{SourceHanSerifK-Regular.otf}%

%%samplepars{SourceHanSerifHC-Regular.otf}%

% samplepars{SourceHanSerifSC-Regular.otf}%

% samplepars{SourceHanSerifTC-Regular.otf}%



newpage

% samplepars{SourceHanSans-Regular.otf}%

% samplepars{SourceHanSansK-Regular.otf}%

%%samplepars{SourceHanSansHC-Regular.otf}%

% samplepars{SourceHanSansSC-Regular.otf}%

  samplepars{SourceHanSansTC-Regular.otf}%



end{document}

edited Nov 19 at 23:07

asked Sep 18 at 20:59

svenper

473211

2

I took a look in the locl-test.pdf provided by Ken Lunde. It has in the page stream BDC/EMC operators which mark up the language and Ken's comment sounds as if this is needed for copy&paste to work correctly. Adding this marks should be possible e.g. with the tagpdf package, but it works currently only for lualatex and pdflatex. But I can't load your font with lualatex, so I can't run more tests.
– Ulrike Fischer
Sep 19 at 8:36

2

I can now run the fonts with luatex (they need a 64bit luatex), but the reference file from Ken is a bit too complicated for real investigations. It would be better to have working example involving only one font and two variants.
– Ulrike Fischer
Sep 19 at 16:57

1

well it is not tagging. But tounicode entries are missing. I asked on the context list for a solution for luatex.
– Ulrike Fischer
Sep 26 at 21:17

1

the context fontloader will be updated to get this working, and I can then import it in lualatex/luaotfload. But I have no idea which of the various libraries in xetex is relevant here. You should try to ask on the xetex mailing list. The discussion on the context list is here mail-archive.com/ntg-context@ntg.nl/msg89086.html
– Ulrike Fischer
Sep 27 at 9:19

2

I have uploaded a new version of luaotfload to ctan which I hope corrects the problem for lualatex.
– Ulrike Fischer
Oct 3 at 13:42

|
show 6 more comments

up vote
5
down vote

favorite

Originally this seemed to me like a font issue, as there was no equivalent problem with other fonts, as indicated by this good MWE:

Copied text from good PDF

EBGaramond12-Italic.otf:

[ - ] бгдпт

[SRB] бгдпт

Good MWE:

documentclass[11pt]{article}

usepackage{fontspec}

setmainfont{Latin Modern Mono}



newcommand*{chars}{бгдпт}



newcommand*{samplepars}[1]{{%

  vspace{normalbaselineskip}%

  #1:par

  fontspec{#1}[Script=Cyrillic]%

  {normalfont [ - ]} {chars}par

  {normalfont [SRB]} {addfontfeature{Language=Serbian}chars}par

}}



begin{document}



samplepars{EBGaramond12-Italic.otf}%



end{document}

The problem starts happening with the Source Han fonts (the (""/"K"/"SC"/"TC") fonts support all language glyphs via locl). For the MWE we will use Source Han Sans TC (16MB).

^{(All of the currently available fonts can be downloaded quickly and easily with these commands.)}

The expected behavior would be that the substituted glyphs have the same unicode value as the default glyph they are replacing.

Edit: For comparison, I was able to produce a good CJK PDF with LibreOffice Writer, from this ODT file. Uncompressing it with mutool clean -d -a libreoffice.pdf libreoffice-uncompressed.pdf and viewing the source does not contain any BDC, EMC, or even zh, but one /Lang (en-US). The fonts embedded in the file are converted from OTF to Type1 by LibreOffice, which could be the cause of the glyph unicode values.

If the issue is indeed with TeX, how could it be fixed or worked around?

Copied text from bad PDF

SourceHanSansTC-Regular.otf:

[ZHS] ���⻣�����贈��⻆���函純賭難海�練�������

[ZHT] 刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化

Bad MWE:

documentclass[11pt]{article}

usepackage{fontspec}

setmainfont{Latin Modern Mono}

setlength{parindent}{0mm}



newcommand*{chars}{%

    刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化}



newcommand*{samplepars}[1]{{%

  vspace{normalbaselineskip}%

  #1:par

  fontspec{#1}[Script=CJK]%

% {normalfont [JAN]} {addfontfeature{Language=Japanese           }chars}par

% {normalfont [KOR]} {addfontfeature{Language=Korean             }chars}par

%%{normalfont [ZHH]} {addfontfeature{Language=Chinese Hong Kong  }chars}par

  {normalfont [ZHS]} {addfontfeature{Language=Chinese Simplified }chars}par

  {normalfont [ZHT]} {addfontfeature{Language=Chinese Traditional}chars}par

}}



begin{document}



newpage

% samplepars{SourceHanSerif-Regular.otf}%

% samplepars{SourceHanSerifK-Regular.otf}%

%%samplepars{SourceHanSerifHC-Regular.otf}%

% samplepars{SourceHanSerifSC-Regular.otf}%

% samplepars{SourceHanSerifTC-Regular.otf}%



newpage

% samplepars{SourceHanSans-Regular.otf}%

% samplepars{SourceHanSansK-Regular.otf}%

%%samplepars{SourceHanSansHC-Regular.otf}%

% samplepars{SourceHanSansSC-Regular.otf}%

  samplepars{SourceHanSansTC-Regular.otf}%



end{document}

edited Nov 19 at 23:07

asked Sep 18 at 20:59

svenper

473211

Originally this seemed to me like a font issue, as there was no equivalent problem with other fonts, as indicated by this good MWE:

Copied text from good PDF

EBGaramond12-Italic.otf:

[ - ] бгдпт

[SRB] бгдпт

Good MWE:

documentclass[11pt]{article}

usepackage{fontspec}

setmainfont{Latin Modern Mono}



newcommand*{chars}{бгдпт}



newcommand*{samplepars}[1]{{%

  vspace{normalbaselineskip}%

  #1:par

  fontspec{#1}[Script=Cyrillic]%

  {normalfont [ - ]} {chars}par

  {normalfont [SRB]} {addfontfeature{Language=Serbian}chars}par

}}



begin{document}



samplepars{EBGaramond12-Italic.otf}%



end{document}

The problem starts happening with the Source Han fonts (the (""/"K"/"SC"/"TC") fonts support all language glyphs via locl). For the MWE we will use Source Han Sans TC (16MB).

^{(All of the currently available fonts can be downloaded quickly and easily with these commands.)}

The expected behavior would be that the substituted glyphs have the same unicode value as the default glyph they are replacing.

Edit: For comparison, I was able to produce a good CJK PDF with LibreOffice Writer, from this ODT file. Uncompressing it with mutool clean -d -a libreoffice.pdf libreoffice-uncompressed.pdf and viewing the source does not contain any BDC, EMC, or even zh, but one /Lang (en-US). The fonts embedded in the file are converted from OTF to Type1 by LibreOffice, which could be the cause of the glyph unicode values.

If the issue is indeed with TeX, how could it be fixed or worked around?

Copied text from bad PDF

SourceHanSansTC-Regular.otf:

[ZHS] ���⻣�����贈��⻆���函純賭難海�練�������

[ZHT] 刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化

Bad MWE:

documentclass[11pt]{article}

usepackage{fontspec}

setmainfont{Latin Modern Mono}

setlength{parindent}{0mm}



newcommand*{chars}{%

    刃令毒骨縣誤船述煙贈雪及角條低詩函純賭難海喝練灰起英次能窗化}



newcommand*{samplepars}[1]{{%

  vspace{normalbaselineskip}%

  #1:par

  fontspec{#1}[Script=CJK]%

% {normalfont [JAN]} {addfontfeature{Language=Japanese           }chars}par

% {normalfont [KOR]} {addfontfeature{Language=Korean             }chars}par

%%{normalfont [ZHH]} {addfontfeature{Language=Chinese Hong Kong  }chars}par

  {normalfont [ZHS]} {addfontfeature{Language=Chinese Simplified }chars}par

  {normalfont [ZHT]} {addfontfeature{Language=Chinese Traditional}chars}par

}}



begin{document}



newpage

% samplepars{SourceHanSerif-Regular.otf}%

% samplepars{SourceHanSerifK-Regular.otf}%

%%samplepars{SourceHanSerifHC-Regular.otf}%

% samplepars{SourceHanSerifSC-Regular.otf}%

% samplepars{SourceHanSerifTC-Regular.otf}%



newpage

% samplepars{SourceHanSans-Regular.otf}%

% samplepars{SourceHanSansK-Regular.otf}%

%%samplepars{SourceHanSansHC-Regular.otf}%

% samplepars{SourceHanSansSC-Regular.otf}%

  samplepars{SourceHanSansTC-Regular.otf}%



end{document}

fontspec unicode cjk opentype locale

edited Nov 19 at 23:07

asked Sep 18 at 20:59

svenper

473211

edited Nov 19 at 23:07

asked Sep 18 at 20:59

svenper

473211

edited Nov 19 at 23:07

asked Sep 18 at 20:59

svenper

473211

asked Sep 18 at 20:59

svenper

473211

asked Sep 18 at 20:59

svenper

473211

2

I took a look in the locl-test.pdf provided by Ken Lunde. It has in the page stream BDC/EMC operators which mark up the language and Ken's comment sounds as if this is needed for copy&paste to work correctly. Adding this marks should be possible e.g. with the tagpdf package, but it works currently only for lualatex and pdflatex. But I can't load your font with lualatex, so I can't run more tests.
– Ulrike Fischer
Sep 19 at 8:36

2

I can now run the fonts with luatex (they need a 64bit luatex), but the reference file from Ken is a bit too complicated for real investigations. It would be better to have working example involving only one font and two variants.
– Ulrike Fischer
Sep 19 at 16:57

1

well it is not tagging. But tounicode entries are missing. I asked on the context list for a solution for luatex.
– Ulrike Fischer
Sep 26 at 21:17

1

the context fontloader will be updated to get this working, and I can then import it in lualatex/luaotfload. But I have no idea which of the various libraries in xetex is relevant here. You should try to ask on the xetex mailing list. The discussion on the context list is here mail-archive.com/ntg-context@ntg.nl/msg89086.html
– Ulrike Fischer
Sep 27 at 9:19

2

I have uploaded a new version of luaotfload to ctan which I hope corrects the problem for lualatex.
– Ulrike Fischer
Oct 3 at 13:42

|
show 6 more comments

2

I took a look in the locl-test.pdf provided by Ken Lunde. It has in the page stream BDC/EMC operators which mark up the language and Ken's comment sounds as if this is needed for copy&paste to work correctly. Adding this marks should be possible e.g. with the tagpdf package, but it works currently only for lualatex and pdflatex. But I can't load your font with lualatex, so I can't run more tests.
– Ulrike Fischer
Sep 19 at 8:36

2

I can now run the fonts with luatex (they need a 64bit luatex), but the reference file from Ken is a bit too complicated for real investigations. It would be better to have working example involving only one font and two variants.
– Ulrike Fischer
Sep 19 at 16:57

1

well it is not tagging. But tounicode entries are missing. I asked on the context list for a solution for luatex.
– Ulrike Fischer
Sep 26 at 21:17

1

the context fontloader will be updated to get this working, and I can then import it in lualatex/luaotfload. But I have no idea which of the various libraries in xetex is relevant here. You should try to ask on the xetex mailing list. The discussion on the context list is here mail-archive.com/ntg-context@ntg.nl/msg89086.html
– Ulrike Fischer
Sep 27 at 9:19

2

I have uploaded a new version of luaotfload to ctan which I hope corrects the problem for lualatex.
– Ulrike Fischer
Oct 3 at 13:42

I took a look in the locl-test.pdf provided by Ken Lunde. It has in the page stream BDC/EMC operators which mark up the language and Ken's comment sounds as if this is needed for copy&paste to work correctly. Adding this marks should be possible e.g. with the tagpdf package, but it works currently only for lualatex and pdflatex. But I can't load your font with lualatex, so I can't run more tests.
– Ulrike Fischer
Sep 19 at 8:36

I can now run the fonts with luatex (they need a 64bit luatex), but the reference file from Ken is a bit too complicated for real investigations. It would be better to have working example involving only one font and two variants.
– Ulrike Fischer
Sep 19 at 16:57

well it is not tagging. But tounicode entries are missing. I asked on the context list for a solution for luatex.
– Ulrike Fischer
Sep 26 at 21:17

the context fontloader will be updated to get this working, and I can then import it in lualatex/luaotfload. But I have no idea which of the various libraries in xetex is relevant here. You should try to ask on the xetex mailing list. The discussion on the context list is here mail-archive.com/ntg-context@ntg.nl/msg89086.html
– Ulrike Fischer
Sep 27 at 9:19

I have uploaded a new version of luaotfload to ctan which I hope corrects the problem for lualatex.
– Ulrike Fischer
Oct 3 at 13:42

|
show 6 more comments

active

oldest

votes

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "85"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2ftex.stackexchange.com%2fquestions%2f451442%2flocalized-glyphs-locl-have-unicode-value-ufffd%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky