Add a utf8() mode that allows byte/UTF-8 strings as input & output. by FGasper · Pull Request #78 · makamaka/JSON-PP

FGasper · 2022-09-07T16:26:24Z

MAINTAINER: See what you think of this. I’ll add documentation updates if you’re amenable to the change itself.

JSON::PP has a number of options that indicate a desire to facilitate different applications’ nonstandard needs. For example, latin1() caters to applications that use Latin-1 encoding rather than UTF-8, which violates the JSON specification.

Some nontrivial Perl applications forgo character decoding. Their authors/maintainers may not know “perlunitut”’s recommended workflow, or the application may simply not care about Unicode. Either way, in such applications it’s ideal for a JSON encoder & decoder to forgo the usual UTF-8 decode/encode steps.

utf8(0) almost achieves this. It falls over, though, if the JSON document contains a Unicode character escape (e.g., "\u00e9"), which JSON::PP decodes as Perl "\xe9". This causes an inconsistency in the decode logic: "é" in UTF-8 will yield a different result from "\u00e9".

Ordinarily it works to do encode_utf8( JSON::PP->new->utf8->decode(..) ), but that falls over if applications need to allow non-UTF-8 sequences in JSON inputs.

In short, a need exists for this Perl string:

qq<"\xff\xc3\xa9\xc3\xa9\\u00e9">

… to decode to "\xff\xc3\xa9\xc3\xa9".

This changeset adds a solution to this problem by changing utf8() from a simple flag to an enum: the existing chars-in-chars-out (0) and bytes-in-chars-out (1) options, plus a new bytes-in-bytes-out option. Named constants are added to avoid “magic numbers”.

JSON::PP has a number of options that indicate a desire to facilitate different applications’ nonstandard needs. For example, latin1() caters to applications that use Latin-1 encoding rather than UTF-8, which violates the JSON specification. Some nontrivial Perl applications forgo character decoding. Their authors/maintainers may not know “perlunitut”’s recommended workflow, or the application may simply not care about Unicode. Either way, in such applications it’s ideal for a JSON encoder & decoder to forgo the usual UTF-8 decode/encode steps. utf8(0) almost achieves this. It falls over, though, if the JSON document contains a Unicode character escape (e.g., "\u00e9"), which JSON::PP decodes as Perl "\xe9". This causes an inconsistency in the decode logic: "é" in UTF-8 will yield a different result from "\u00e9". Ordinarily it works to do encode_utf8( JSON::PP->new->utf8->decode(..) ), but that falls over if applications need to allow non-UTF-8 sequences in JSON inputs. In short, a need exists for this Perl string: qq<"\xff\xc3\xa9\xc3\xa9\\u00e9"> … to decode to "\xff\xc3\xa9\xc3\xa9". This changeset adds a solution to this problem by changing utf8() from a simple flag to an enum: the existing chars-in-chars-out (0) and bytes-in-chars-out (1) options, plus a new bytes-in-bytes-out option. Named constants are added to avoid “magic numbers”.

charsbar · 2022-09-07T22:29:58Z

I understand your point but if the change breaks compatibility with JSON::XS, it's unacceptable because JSON::PP is basically a fallback module of it. I am also reluctant to add a new mode if it's for JSON::PP only. Could you discuss this with the JSON::XS maintainer first?

FGasper force-pushed the utf8_bytes branch from 808adf9 to d848b81 Compare September 7, 2022 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a utf8() mode that allows byte/UTF-8 strings as input & output.#78

Add a utf8() mode that allows byte/UTF-8 strings as input & output.#78
FGasper wants to merge 1 commit intomakamaka:masterfrom
FGasper:utf8_bytes

FGasper commented Sep 7, 2022

Uh oh!

charsbar commented Sep 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FGasper commented Sep 7, 2022

Uh oh!

charsbar commented Sep 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants