tree 2f5c96180737cd8388754f0a1b059c49ddf9c68b
parent 445b54a7346277d3aa270c3b3c4e049dd9cf14d8
author Nico Weber <thakis@chromium.org> 1700633475 +0000
committer Pdfium LUCI CQ <pdfium-scoped@luci-project-accounts.iam.gserviceaccount.com> 1700633475 +0000

Extract language code stripping from PDF_DecodeText() into function

For unicode text, strings can contain 0x001b (for UTF-16) or 0x1b (for
UTF-8) followed a 2-byte BCP 47 language code (two ascii bytes),
optionally followed by a 2-byte ISO 3166 country code (another two ascii
bytes), terminated by another 0x001b / 0x1b.

These can be used to put different translations of the same text into
the same string.  But we currently just strip out these language codes.

Since the language and country codes are ascii and they're a multiple of
two, it's ok to strip them after doing UTF-16 / UTF-8 conversion.

So extract this code into a separate function.
(I think the function can be simplified a bit now, but in this CL I'd
like to try and due a pure code move.)

This modifies code added in
https://pdfium-review.googlesource.com/c/pdfium/+/41070

(After this it's hopefully easy to add support for UTF-8 text strings,
which is my actual goal.)

No intended behavior change.

Change-Id: I5e25a26a8f30308ee6ca377f17e82850f2d43274
Reviewed-on: https://pdfium-review.googlesource.com/c/pdfium/+/113790
Reviewed-by: Lei Zhang <thestig@chromium.org>
Commit-Queue: Lei Zhang <thestig@chromium.org>
Auto-Submit: Nico Weber <thakis@chromium.org>
