JEP400 vs new Scanner(System.in)
Kamil Ševeček
kamil at sevecek.net
Tue Oct 18 08:35:22 UTC 2022
Hi!
Perhaps core-libs-dev is the more appropriate mailing list, but there are
hundreds of posts a month there. I'm not sure whether the severity of the
problem would require a more fundamental solution. So I will reply here as
well, because I see that Brian Goetz reviewed and endorsed the JEP-400 and
I would like to see him endorsing a rescue action as well to keep my trust
in the Java platform.
If there is a higher priority than CRITICAL, then this is the case.
If left as it is, this JEP will completely ruin Java as a starting language
because no simple beginner's example from the web will work anymore and any
new will look extremely complicated. Reinier described it very
realistically.
Just a side note for the java.io.Console: This class does not work in IDEs.
System.console() returns null in IDEs. It only works when Java is invoked
from the native OS console. Therefore it is very much useless.
— Kamil Sevecek
On Thu, 13 Oct 2022 at 19:07, Ron Pressler <ron.pressler at oracle.com> wrote:
> Hi.
>
> The appropriate list is core-libs-dev, where this discussion should
> continue.
>
> System.in is the standard input, which may or may not be the keyboard. For
> keyboard input, take a look at the java.io.Console class [1], in particular
> its charset and reader methods.
>
> [1]:
> https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/io/Console.html
>
> — Ron
>
> On 13 Oct 2022, at 16:20, Reinier Zwitserloot <reinier at zwitserloot.com>
> wrote:
>
> PREAMBLE: I’m not entirely certain amber-dev is the appropriate venue. If
> not, where should this be discussed? It’s not quite a bug but nearly so,
> and not quite a simple feature request either.
>
> JDK18 brought JEP400 which changes the default charset encoding to UTF-8.
> This, probably out of necessity, goes quite far, in that
> Charset.defaultCharset() is now more or less a constant - always returns
> UTF_8. It’s now quite difficult to retrieve the OS-configured encoding (the
> ’native’ encoding).
>
> However, that does mean one of the most common lines in all of java’s
> history, is now necessarily buggy: new Scanner(System.in) is now broken.
> Always, unless your docs specifically state that you must feed the app
> UTF_8 data. Linting tools ought to flag it down as incorrect. It’s
> incorrect In a nasty way too: Initially it seems to work fine, but if
> you’re on an OS whose native encoding isn’t UTF-8, this is subtly broken;
> enter non-ASCII characters on the command line and the app doesn’t handle
> them appropriately. A bug that is literally utterly undiscoverable on macs
> and most linux computers, even. How can you figure out your code is broken
> if all the machines you test it on use UTF-8 as an OS default?
>
> This affects beginning java programmers particularly (who tend to be
> writing some command line-interactive apps at first). In light of Brian
> Goetz’s post “Paving the Onramp” (
> https://openjdk.org/projects/amber/design-notes/on-ramp) - the experience
> for new users is evidently of some importance to the OpenJDK team. In light
> of that, the current state of writing command line interactive java apps is
> inconsistent with that goal.
>
> The right way to read system input in a way that works in both pre- and
> post-JEP400 JVM editions appears to be, as far as I can tell:
>
> Charset nativeCharset = Charset.forName(System.getProperty("native.encoding", Charset.defaultEncoding().name());
> Scanner sc = new Scanner(System.in, nativeCharset);
>
>
> I’ll risk the hyperbole: That’s.. atrocious. Hopefully I’m missing
> something!
>
> Breaking _thousands_ of blogs, tutorials, stack overflow answers, and
> books in the process, everything that contains new Scanner(System.in).
> Even sysin interaction that doesn’t use scanner is likely broken; the
> general strategy then becomes:
>
> new InputStreamReader(System.in);
>
>
> which suffers from the same problem.
>
> I see a few directions for trying to address this; I’m not quite sure
> which way would be most appropriate:
>
>
> - Completely re-work keyboard input, in light of *Paving the on-ramp*.
> Scanner has always been a problematic API if used for keyboard input, in
> that the default delimiter isn’t convenient. I think the single most common
> beginner java stackoverflow question is the bizarre interaction between
> scanner’s nextLine() and scanner’s next(), and to make matters
> considerably worse, the proper fix (which is to call
> .useDelimiter(“\\R”) on the scanner first) is said in less than 1% of
> answers; the vast majority of tutorials and answers tell you to call
> .nextLine() after every .nextX() call. A suboptimal suggestion (it now
> means using space to delimit your input is broken). Scanner is now also
> quite inconsistent: The constructor goes for ‘internet standard’, using
> UTF-8 as a default even if the OS does not, but the locale *does* go
> by platform default, which affects double parsing amongst other things:
> scanner.nextDouble() will require you to use commas as fractions
> separator if your OS is configured to use the Dutch locale, for example.
> It’s weird that scanner neither fully follows common platform-independent
> expectations (english locale, UTF-8), nor local-platform expectation
> (OS-configured locale and OS-configured charset). One way out is to make a
> new API for ‘command line apps’ and take into account Paving the on-ramp’s
> plans when designing it.
> - Rewrite specifically the new Scanner(InputStream) constructor as
> defaulting to native encoding even when everything else in java defaults to
> UTF-8 now, because that constructor is 99% used for System.in. Scanner
> has its own File-based constructor, so new
> Scanner(Files.newInputStream(..)) is quite rare.
> - Define that constructor to act as follows: the charset used is the
> platform default (i.e., from JDK18 and up, UTF-8), *unless* arg ==
> System.in is true, in which case the scanner uses native encoding.
> This is a bit bizarre to write in the spec but does the right thing in the
> most circumstances and unbreaks thousands of tutorials, blogs, and answer
> sites, and is most convenient to code against. That’s usually the case with
> voodoo magic (because this surely risks being ’too magical’): It’s
> convenient and does the right thing almost always, at the risk of being
> hard to fathom and producing convoluted spec documentation.
> - Attach the problem that what’s really broken isn’t so much scanner,
> it’s System.in itself: byte based, of course, but now that all java
> methods default to UTF-8, almost all interactions with it (given that most
> System.in interaction is char-based, not byte-based) are now also
> broken. Create a second field or method in System that gives you a
> Reader instead of an InputStream, with the OS-native encoding applied
> to make it. This still leaves those thousands of tutorials broken, but at
> least the proper code is now simply new Scanner(System.charIn()) or
> whatnot, instead of the atrocious snippet above.
> - Even less impactful, make a new method in Charset to get the native
> encoding without having to delve into System.getProperty().
> Charset.nativeEncoding() seems like a method that should exist.
> Unfortunately this would be of no help to create code that works pre- and
> post-JEP400, but in time, having code that only works post-JEP400 is fine,
> I assume.
> - Create a new concept ‘represents a stream that would use platform
> native encoding if characters are read/written to it’, have System.in
> return true for this, and have filterstreams like BufferedInputStream just
> pass the call through, then redefine relevant APIs such as Scanner and
> PrintStream (e.g. anything that internalises conversion from bytes to
> characters) to pick charset encoding (native vs UTF8) based on that
> property. This is a more robust take on ‘new Scanner(System.in) should
> do the right thing'. Possibly the in/out/err streams that Process gives
> you should also have this flag set.
>
>
>
> If it was up to me, I think a multitude of steps are warranted, each
> relatively simple.
>
>
> - Create Charset.nativeEncoding(). Which simply returns
> Charset.forName(System.getProperty(“native.encoding”). But with the
> advantage that its shorter, doesn’t require knowing a magic string, and
> will fail at compile time if compiled against versions that predate the
> existence of the native.encoding property, instead of NPEs at runtime.
> - Create System.charIn(). Which just returns an InputStreamReader
> wrapped around System.in, but with native encoding applied.
> - Put the job of how java apps do basic command line stuff on the
> agenda as a thing that should probably be addressed in the next 5 years or
> so, maybe after the steps laid out in Paving the on-ramp are more fleshed
> out.
> - In order to avoid problems, *before* the next LTS goes out, re-spec new
> Scanner(System.in) to default to native encoding, specifically when
> the passed inputstream is identical to System.in. Don’t bother with
> trying to introduce an abstracted ‘prefers native encoding’ flag system.
>
>
> --Reinier Zwitserloot
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/amber-dev/attachments/20221018/2919372c/attachment-0001.htm>
More information about the amber-dev
mailing list