In PDF_DecodeText() UTF-16, do surrogate fusion before language code stripping No behavior change for valid PDFs, since language codes are ASCII. Surrogate fusion is conceptually part of UTF-16 decoding, so let's do complete UTF-16 conversion before doing the PDF-specific language code stripping. Also makes this look more like the UTF-8 codepath. It'd be nice if we could: * WideString::FromUTF16BE() / FromUTF16LE() take uint8_t* instead of const unsigned short* * Make it do surrogate fusion * Call it here But that's for another day. (And ideally, one day, we'll use a UTF-8 or UTF-16 string type for everything instead of WideString that is UTF-16 on Win and UTF-32 elsewhere.) WideString::FromUTF16LE/BE existing and not doing fusion looks like a footgun, but at the moment it's mostly called from test code, so it's likely not an active bug. Change-Id: Idad54dfd9cdbafec5580b58a70b4337cfc1037f7 Reviewed-on: https://pdfium-review.googlesource.com/c/pdfium/+/113911 Reviewed-by: Lei Zhang <thestig@chromium.org> Commit-Queue: Nico Weber <thakis@chromium.org> Auto-Submit: Nico Weber <thakis@chromium.org>

commit: d1debc7735c5e84c1f750dd3b94edcaa031291ba [log] [tgz]
author: Nico Weber <thakis@chromium.org> Tue Nov 28 01:51:54 2023 +0000
committer: Pdfium LUCI CQ <pdfium-scoped@luci-project-accounts.iam.gserviceaccount.com> Tue Nov 28 01:51:54 2023 +0000
tree: 9030864bf6dcd267a1a7c70b8802caeed430c796
parent: b98c5b4c0c240082f78a34a7abb00a8e9409cb17 [diff]
diff --git a/core/fpdfapi/parser/fpdf_parser_decode.cpp b/core/fpdfapi/parser/fpdf_parser_decode.cpp
index 459e4d9..6412a87 100644
--- a/core/fpdfapi/parser/fpdf_parser_decode.cpp
+++ b/core/fpdfapi/parser/fpdf_parser_decode.cpp

@@ -514,12 +514,11 @@
         dest_buf[dest_pos++] = span[i + 1] << 8 | span[i];
       }
     }
-
-    dest_pos = StripLanguageCodes(dest_buf, dest_pos);
-
 #if defined(WCHAR_T_IS_32_BIT)
     dest_pos = FuseSurrogates(dest_buf, dest_pos);
 #endif
+
+    dest_pos = StripLanguageCodes(dest_buf, dest_pos);
   } else if (span.size() >= 3 && span[0] == 0xef && span[1] == 0xbb &&
              span[2] == 0xbf) {
     result = FX_UTF8Decode(span.subspan(3));
commit	d1debc7735c5e84c1f750dd3b94edcaa031291ba	[log] [tgz]
author	Nico Weber <thakis@chromium.org>	Tue Nov 28 01:51:54 2023 +0000
committer	Pdfium LUCI CQ <pdfium-scoped@luci-project-accounts.iam.gserviceaccount.com>	Tue Nov 28 01:51:54 2023 +0000
tree	9030864bf6dcd267a1a7c70b8802caeed430c796
parent	b98c5b4c0c240082f78a34a7abb00a8e9409cb17 [diff]