JEP400 vs new Scanner(System.in)

Tue Oct 18 08:35:22 UTC 2022

Hi!

Perhaps core-libs-dev is the more appropriate mailing list, but there are
hundreds of posts a month there. I'm not sure whether the severity of the
problem would require a more fundamental solution. So I will reply here as
well, because I see that Brian Goetz reviewed and endorsed the JEP-400 and
I would like to see him endorsing a rescue action as well to keep my trust
in the Java platform.
If there is a higher priority than CRITICAL, then this is the case.
If left as it is, this JEP will completely ruin Java as a starting language
because no simple beginner's example from the web will work anymore and any
new will look extremely complicated. Reinier described it very
realistically.

Just a side note for the java.io.Console: This class does not work in IDEs.
System.console() returns null in IDEs. It only works when Java is invoked
from the native OS console. Therefore it is very much useless.

— Kamil Sevecek

On Thu, 13 Oct 2022 at 19:07, Ron Pressler <ron.pressler at oracle.com> wrote:

> Hi.
>
> The appropriate list is core-libs-dev, where this discussion should
> continue.
>
> System.in is the standard input, which may or may not be the keyboard. For
> keyboard input, take a look at the java.io.Console class [1], in particular
> its charset and reader methods.
>
> [1]:
> https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/io/Console.html
>
> — Ron
>
> On 13 Oct 2022, at 16:20, Reinier Zwitserloot <reinier at zwitserloot.com>
> wrote:
>
> PREAMBLE: I’m not entirely certain amber-dev is the appropriate venue. If
> not, where should this be discussed? It’s not quite a bug but nearly so,
> and not quite a simple feature request either.
>
> JDK18 brought JEP400 which changes the default charset encoding to UTF-8.
> This, probably out of necessity, goes quite far, in that
> Charset.defaultCharset() is now more or less a constant - always returns
> UTF_8. It’s now quite difficult to retrieve the OS-configured encoding (the
> ’native’ encoding).
>
> However, that does mean one of the most common lines in all of java’s
> history, is now necessarily buggy: new Scanner(System.in) is now broken.
> Always, unless your docs specifically state that you must feed the app
> UTF_8 data. Linting tools ought to flag it down as incorrect. It’s
> incorrect In a nasty way too: Initially it seems to work fine, but if
> you’re on an OS whose native encoding isn’t UTF-8, this is subtly broken;
> enter non-ASCII characters on the command line and the app doesn’t handle
> them appropriately. A bug that is literally utterly undiscoverable on macs
> and most linux computers, even. How can you figure out your code is broken
> if all the machines you test it on use UTF-8 as an OS default?
>
> This affects beginning java programmers particularly (who tend to be
> writing some command line-interactive apps at first). In light of Brian
> Goetz’s post “Paving the Onramp” (
> https://openjdk.org/projects/amber/design-notes/on-ramp) - the experience
> for new users is evidently of some importance to the OpenJDK team. In light
> of that, the current state of writing command line interactive java apps is
> inconsistent with that goal.
>
> The right way to read system input in a way that works in both pre- and
> post-JEP400 JVM editions appears to be, as far as I can tell:
>
> Charset nativeCharset = Charset.forName(System.getProperty("native.encoding", Charset.defaultEncoding().name());
> Scanner sc = new Scanner(System.in, nativeCharset);
>
>
> I’ll risk the hyperbole: That’s.. atrocious. Hopefully I’m missing
> something!
>
> Breaking _thousands_ of blogs, tutorials, stack overflow answers, and
> books in the process, everything that contains new Scanner(System.in).
> Even sysin interaction that doesn’t use scanner is likely broken; the
> general strategy then becomes:
>
> new InputStreamReader(System.in);
>
>
> which suffers from the same problem.
>
> I see a few directions for trying to address this; I’m not quite sure
> which way would be most appropriate:
>
>
>    - Completely re-work keyboard input, in light of *Paving the on-ramp*.
>    Scanner has always been a problematic API if used for keyboard input, in
>    that the default delimiter isn’t convenient. I think the single most common
>    beginner java stackoverflow question is the bizarre interaction between
>    scanner’s nextLine() and scanner’s next(), and to make matters
>    considerably worse, the proper fix (which is to call
>    .useDelimiter(“\\R”) on the scanner first) is said in less than 1% of
>    answers; the vast majority of tutorials and answers tell you to call
>    .nextLine() after every .nextX() call. A suboptimal suggestion (it now
>    means using space to delimit your input is broken). Scanner is now also
>    quite inconsistent: The constructor goes for ‘internet standard’, using
>    UTF-8 as a default even if the OS does not, but the locale *does* go
>    by platform default, which affects double parsing amongst other things:
>    scanner.nextDouble() will require you to use commas as fractions
>    separator if your OS is configured to use the Dutch locale, for example.
>    It’s weird that scanner neither fully follows common platform-independent
>    expectations (english locale, UTF-8), nor local-platform expectation
>    (OS-configured locale and OS-configured charset). One way out is to make a
>    new API for ‘command line apps’ and take into account Paving the on-ramp’s
>    plans when designing it.
>    - Rewrite specifically the new Scanner(InputStream) constructor as
>    defaulting to native encoding even when everything else in java defaults to
>    UTF-8 now, because that constructor is 99% used for System.in. Scanner
>    has its own File-based constructor, so new
>    Scanner(Files.newInputStream(..)) is quite rare.
>    - Define that constructor to act as follows: the charset used is the
>    platform default (i.e., from JDK18 and up, UTF-8), *unless* arg ==
>    System.in is true, in which case the scanner uses native encoding.
>    This is a bit bizarre to write in the spec but does the right thing in the
>    most circumstances and unbreaks thousands of tutorials, blogs, and answer
>    sites, and is most convenient to code against. That’s usually the case with
>    voodoo magic (because this surely risks being ’too magical’): It’s
>    convenient and does the right thing almost always, at the risk of being
>    hard to fathom and producing convoluted spec documentation.
>    - Attach the problem that what’s really broken isn’t so much scanner,
>    it’s System.in itself: byte based, of course, but now that all java
>    methods default to UTF-8, almost all interactions with it (given that most
>    System.in interaction is char-based, not byte-based) are now also
>    broken. Create a second field or method in System that gives you a
>    Reader instead of an InputStream, with the OS-native encoding applied
>    to make it. This still leaves those thousands of tutorials broken, but at
>    least the proper code is now simply new Scanner(System.charIn()) or
>    whatnot, instead of the atrocious snippet above.
>    - Even less impactful, make a new method in Charset to get the native
>    encoding without having to delve into System.getProperty().
>    Charset.nativeEncoding() seems like a method that should exist.
>    Unfortunately this would be of no help to create code that works pre- and
>    post-JEP400, but in time, having code that only works post-JEP400 is fine,
>    I assume.
>    - Create a new concept ‘represents a stream that would use platform
>    native encoding if characters are read/written to it’, have System.in
>     return true for this, and have filterstreams like BufferedInputStream just
>    pass the call through, then redefine relevant APIs such as Scanner and
>    PrintStream (e.g. anything that internalises conversion from bytes to
>    characters) to pick charset encoding (native vs UTF8) based on that
>    property. This is a more robust take on ‘new Scanner(System.in) should
>    do the right thing'. Possibly the in/out/err streams that Process gives
>    you should also have this flag set.
>
>
>
> If it was up to me, I think a multitude of steps are warranted, each
> relatively simple.
>
>
>    - Create Charset.nativeEncoding(). Which simply returns
>    Charset.forName(System.getProperty(“native.encoding”). But with the
>    advantage that its shorter, doesn’t require knowing a magic string, and
>    will fail at compile time if compiled against versions that predate the
>    existence of the native.encoding property, instead of NPEs at runtime.
>    - Create System.charIn(). Which just returns an InputStreamReader
>    wrapped around System.in, but with native encoding applied.
>    - Put the job of how java apps do basic command line stuff on the
>    agenda as a thing that should probably be addressed in the next 5 years or
>    so, maybe after the steps laid out in Paving the on-ramp are more fleshed
>    out.
>    - In order to avoid problems, *before* the next LTS goes out, re-spec new
>    Scanner(System.in) to default to native encoding, specifically when
>    the passed inputstream is identical to System.in. Don’t bother with
>    trying to introduce an abstracted ‘prefers native encoding’ flag system.
>
>
>  --Reinier Zwitserloot
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/amber-dev/attachments/20221018/2919372c/attachment-0001.htm>