Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
Robert Muir
rcmuir at gmail.com
Wed Feb 21 20:07:07 UTC 2018
On Wed, Feb 21, 2018 at 1:16 PM, Xueming Shen <xueming.shen at oracle.com> wrote:
>
> Hi Robert,
>
> Understood a silent replacement might not be the desired behavior in
> some use scenarios. Anymore details regarding what "most apps want"
> when there is/are malformed/unmappable? It appears the best the
> underneath de/encoder can do here is to throw an IOException. Given
> the caller of the Reader/Writer does not have the access to the bytes of
> the underlying stream src (reader)/dst(writer), there is in theory
> impossible
> to do anything to recover and continue without risking data loss. The
> assumption here is if you want to have a fine-grained control of the de/
> encoding, you might want to work with the Input/OutStream/Channel +
> CharsetDe/Encoder instead of Reader/Writer.
>
> No, I'm not saying we can't do
> Reader(CharsetDecoder)/Writer(CharsetEncoder),
> just wanted to know what's the real use scenario and what's the better/
> best choice here.
>
I think the exception is the best default? This is the default
behavior of python for example, unless you specifically ask for
"replace" or "ignore".
>>> b'\xFFabc'.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position
0: invalid start byte
Its also the default behavior of 'iconv' command-line tool used for
converting charsets, unless you pass additional options.
$ iconv -f utf-8 -t utf-8 test2.mp4
ftypisomisomiso2avc1mp41e
iconv: test2.mp4:1:26: cannot convert
Unfortunately in java, when using Charset or String parameters, it
gives silently replacement with \uFFFD, etc. Its necessary to pass a
CharsetDecoder to get an exception that something went wrong.
The current situation is especially confusing as there is nothing in
the javadocs to indicate that the behavior of InputStreamReader(x,
Charset) and InputStreamReader(x, String) differ substantially from
InputStreamReader(x, CharsetDecoder) ! I think the Charset and String
parameters should default to REPORT, so the behavior of all
constructors are consistent. If you want to replace, you should have
to ask for it. I think replacement has use-cases but they are more
"expert", e.g. web-crawling and so on. In general, wrong bytes
indicate a problem and it can be very difficult to debug these issues
when java hides these problems by default...
More information about the core-libs-dev
mailing list