JEP 400 vs new Scanner(System.in)

Wed Oct 19 23:05:04 UTC 2022

PREAMBLE: Due to not being sure where to post it, this was posted to
amber-dev before. I have updated it to take into account Ron Pressler’s
notes on System.console, and Brian Goetz’s notes on steering clear of
shoving deadlines in debate posts like this one.
—

JDK18 brought JEP400 which changes the default charset encoding to UTF-8.
This, probably out of necessity, goes quite far, in that
Charset.defaultCharset() is now more or less a constant - always returns
UTF_8. It’s now quite difficult to retrieve the OS-configured encoding (the
’native’ encoding).

However, that does mean one of the most common lines in all of java’s
history, is now necessarily buggy: new Scanner(System.in) is now broken.
Always, unless your docs specifically state that you must feed the app
UTF_8 data. Linting tools ought to flag it down as incorrect. It’s
incorrect In a nasty way too: Initially it seems to work fine, but if
you’re on an OS whose native encoding isn’t UTF-8, this is subtly broken;
enter non-ASCII characters on the command line and the app doesn’t handle
them appropriately. A bug that is literally utterly undiscoverable on macs
and most linux computers, even. How can you figure out your code is broken
if all the machines you test it on use UTF-8 as an OS default?

This affects beginning java programmers particularly (who tend to be
writing some command line-interactive apps at first). In light of Brian
Goetz’s post “Paving the Onramp” (
https://openjdk.org/projects/amber/design-notes/on-ramp) - the experience
for new users is evidently of some importance to the OpenJDK team. In light
of that, the current state of writing command line interactive java apps is
inconsistent with that goal.

The right way to read system input in a way that works in both pre- and
post-JEP400 JVM editions appears to be, as far as I can tell:

Charset nativeCharset =
Charset.forName(System.getProperty("native.encoding",
Charset.defaultEncoding().name());
Scanner sc = new Scanner(System.in <https://system.in/>, nativeCharset);

I’ll risk the hyperbole: That’s.. atrocious. Hopefully I’m missing
something!

Breaking _thousands_ of blogs, tutorials, stack overflow answers, and books
in the process, everything that contains new Scanner(System.in). Even sysin
interaction that doesn’t use scanner is likely broken; the general strategy
then becomes:

new InputStreamReader(System.in <https://system.in/>);

which suffers from the same problem.

I see a few directions for trying to address this; I’m not quite sure which
way would be most appropriate:

   - Completely re-work keyboard input, in light of *Paving the on-ramp*.
   Scanner has always been a problematic API if used for keyboard input, in
   that the default delimiter isn’t convenient. I think the single most common
   beginner java stackoverflow question is the bizarre interaction between
   scanner’s nextLine() and scanner’s next(), and to make matters
   considerably worse, the proper fix (which is to call .useDelimiter(“\\R”) on
   the scanner first) is said in less than 1% of answers; the vast majority of
   tutorials and answers tell you to call .nextLine() after every
.nextX() call.
   A suboptimal suggestion (it now means using space to delimit your input is
   broken). Scanner is now also quite inconsistent: The constructor goes for
   ‘internet standard’, using UTF-8 as a default even if the OS does not, but
   the locale *does* go by platform default, which affects double parsing
   amongst other things: scanner.nextDouble() will require you to use
   commas as fractions separator if your OS is configured to use the Dutch
   locale, for example. It’s weird that scanner neither fully follows common
   platform-independent expectations (english locale, UTF-8), nor
   local-platform expectation (OS-configured locale and OS-configured
   charset). One way out is to make a new API for ‘command line apps’ and take
   into account Paving the on-ramp’s plans when designing it.
   - Rewrite specifically the new Scanner(InputStream) constructor as
   defaulting to native encoding even when everything else in java defaults to
   UTF-8 now, because that constructor is 99% used for System.in. Scanner
   has its own File-based constructor, so new
   Scanner(Files.newInputStream(..)) is quite rare.
   - Define that constructor to act as follows: the charset used is the
   platform default (i.e., from JDK18 and up, UTF-8), *unless* arg ==
   System.in is true, in which case the scanner uses native encoding. This
   is a bit bizarre to write in the spec but does the right thing in the most
   circumstances and unbreaks thousands of tutorials, blogs, and answer sites,
   and is most convenient to code against. That’s usually the case with voodoo
   magic (because this surely risks being ’too magical’): It’s convenient and
   does the right thing almost always, at the risk of being hard to fathom and
   producing convoluted spec documentation.
   - Attach the problem that what’s really broken isn’t so much scanner,
   it’s System.in itself: byte based, of course, but now that all java
   methods default to UTF-8, almost all interactions with it (given that most
   System.in interaction is char-based, not byte-based) are now also
   broken. Create a second field or method in System that gives you a
Reader instead
   of an InputStream, with the OS-native encoding applied to make it. This
   still leaves those thousands of tutorials broken, but at least the proper
   code is now simply new Scanner(System.charIn()) or whatnot, instead of
   the atrocious snippet above.
   - Even less impactful, make a new method in Charset to get the native
   encoding without having to delve into System.getProperty().
   Charset.nativeEncoding() seems like a method that should exist.
   Unfortunately this would be of no help to create code that works pre- and
   post-JEP400, but in time, having code that only works post-JEP400 is fine,
   I assume.
   - Create a new concept ‘represents a stream that would use platform
   native encoding if characters are read/written to it’, have System.in
    return true for this, and have filterstreams like BufferedInputStream just
   pass the call through, then redefine relevant APIs such as Scanner and
   PrintStream (e.g. anything that internalises conversion from bytes to
   characters) to pick charset encoding (native vs UTF8) based on that
   property. This is a more robust take on ‘new Scanner(System.in) should
   do the right thing'. Possibly the in/out/err streams that Process gives
   you should also have this flag set.
   - (based on feedback from Ron Pressler in amber-dev) Try to move the
   community away from treating System.in and System.out as the streams to
   be used for ‘command line apps’, and towards using System.console() instead,
   which is already char based, and is better positioned to take care of
   picking the right charset for you. However, this is quite a big job, given
   that virtually all tutorials, books, q&a sites like Stack Overflow talk
   about System.in/out and not about Console. Even if somehow the message
    gets out and these start using Console instead, the experience for java
   developers would be deplorable, given that *no IDE supports Console!* -
   possibly because it is maybe difficult for them to set it up properly? At
   any rate, just like the JDBC group works together with DB vendors to ensure
   JDBC actually is fit for purpose, there would have to be something set up
   to ensure tool developers like the eclipse team or IntelliJ update their
   templates and support Console for their run/debug-inside-IDE features. An
   open question then comes up: How does the OpenJDK team move the community
   in the direction that the OpenJDK wants them to move? “Build it and they
   will come”? I highly doubt that would work here; System.in works well
   enough for the base case at first glance. At the very least a statement by
   the OpenJDK that new Scanner(System.in) is a bad idea would help to
   start the decades-long work of trying to break down established Stack
   Overflow answers, mark tutorials as obsolete, etc. I have no idea if the
   OpenJDK even wants to meddle with community interaction like this, but if
   it does not, then “it’s fine, Console exists, it’s not our problem the
   community doesn’t use it” seems a bit hollow.

If it was up to me, I think a multitude of steps are warranted, each
relatively simple.

   - Create Charset.nativeEncoding(). Which simply returns
   Charset.forName(System.getProperty(“native.encoding”). But with the
   advantage that its shorter, doesn’t require knowing a magic string, and
   will fail at compile time if compiled against versions that predate the
   existence of the native.encoding property, instead of NPEs at runtime.
   - Create System.charIn(). Which just returns an InputStreamReader
   wrapped around System.in, but with native encoding applied.
   - Put the job of how java apps do basic command line stuff on the agenda
   as a thing that should probably be addressed in the next 5 years or so,
   maybe after the steps laid out in Paving the on-ramp are more fleshed out.
   - In order to avoid problems, re-spec new Scanner(System.in) to default
   to native encoding, specifically when the passed inputstream is identical
   to System.in. Don’t bother with trying to introduce an abstracted
   ‘prefers native encoding’ flag system.
   - Contact IntelliJ, eclipse, and possibly maven/gradle (insofar that
   Console doesn’t work when using mvn run and the like) and ask them what
   they need to add console support, keeping in mind encoding is important,
   and possibly, to rewire their syso (eclipse) and sysout (intellij)
   template shortcuts away from System.out.println and towards
   System.console().printf instead.

 --Reinier Zwitserloot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20221019/628a3ce7/attachment-0001.htm>