In PDF_DecodeText() UTF-16, do surrogate fusion before language code stripping
No behavior change for valid PDFs, since language codes are ASCII.
Surrogate fusion is conceptually part of UTF-16 decoding, so let's
do complete UTF-16 conversion before doing the PDF-specific language
code stripping. Also makes this look more like the UTF-8 codepath.
It'd be nice if we could:
* WideString::FromUTF16BE() / FromUTF16LE() take uint8_t* instead of
const unsigned short*
* Make it do surrogate fusion
* Call it here
But that's for another day. (And ideally, one day, we'll use a
UTF-8 or UTF-16 string type for everything instead of WideString
that is UTF-16 on Win and UTF-32 elsewhere.)
WideString::FromUTF16LE/BE existing and not doing fusion looks like
a footgun, but at the moment it's mostly called from test code, so
it's likely not an active bug.
Change-Id: Idad54dfd9cdbafec5580b58a70b4337cfc1037f7
Reviewed-on: https://pdfium-review.googlesource.com/c/pdfium/+/113911
Reviewed-by: Lei Zhang <thestig@chromium.org>
Commit-Queue: Nico Weber <thakis@chromium.org>
Auto-Submit: Nico Weber <thakis@chromium.org>
diff --git a/core/fpdfapi/parser/fpdf_parser_decode.cpp b/core/fpdfapi/parser/fpdf_parser_decode.cpp
index 459e4d9..6412a87 100644
--- a/core/fpdfapi/parser/fpdf_parser_decode.cpp
+++ b/core/fpdfapi/parser/fpdf_parser_decode.cpp
@@ -514,12 +514,11 @@
dest_buf[dest_pos++] = span[i + 1] << 8 | span[i];
}
}
-
- dest_pos = StripLanguageCodes(dest_buf, dest_pos);
-
#if defined(WCHAR_T_IS_32_BIT)
dest_pos = FuseSurrogates(dest_buf, dest_pos);
#endif
+
+ dest_pos = StripLanguageCodes(dest_buf, dest_pos);
} else if (span.size() >= 3 && span[0] == 0xef && span[1] == 0xbb &&
span[2] == 0xbf) {
result = FX_UTF8Decode(span.subspan(3));