Definition of 'character' in JLS

Jesse Silverman jessevsilverman at gmail.com
Wed Jun 15 13:48:22 UTC 2022


I had a question about the definition of the term 'character' in the JLS,
more specifically in section 3.10.4 (I was apparently either confused or
looking at a JLS version where the definitions were in 3.10.6 at the
time)...

We had a long, rambling discussion of the meaning of numerous terms in Java
here:
https://coderanch.com/t/743492/java/Java-Terminology-Charset

Where I thought I was reading contradictory definitions of terms relating
to 'character'.

In that thread I say:
I was confused about Java String Literals because of this confusing issue
seen in 3.10.6 of the JLS:
UnicodeInputCharacter:
   UnicodeEscape
   RawInputCharacter
UnicodeEscape:
   \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
RawInputCharacter:
   any Unicode character

So, a "RawInputCharacter" can apparently be *any* valid Unicode character,
but I had already seen that UnicodeEscape must be between \u0000 and
\uFFFF, requiring surrogate pair encoding by hand for anything outside the
BMP.  So String literals are sort of Unicode, but if you are using escape
sequences to represent anything, you must know that it is encoded as UTF-16
internally.  I kept hitting that.

I bring up similar stuff unsuccessfully a few more times, then mentioned
this:
Specifically the following should all map neatly, but they lose me at the
end of the chain, which was true in UCS-2 and as far as I can tell, unless
I am confused, this isn't true since they started supporting UTF-16:
A character literal can be specified as a ' SingleCharacter ' which is any
'InputCharacter' except ' or \ (so far so good) which is any
'UnicodeInputCharacter' except CR or LF (I'm still nervously with them)
which is any UnicodeEscape or RawInputCharacter (getting a little shaky, as
UnicodeEscape is limited to either BMP or at least BMP or a single code
unit of a UTF-16 surrogate pair, I am not sure which but let's keep
going)...
lastly they say a RawInputCharacter is "any Unicode character" and I am
out, I fold, because we have just proved that 1 == 2.
You can not shove a non-BMP Unicode character into a Java character
literal, as much as I'd like to.  It won't fit.  I am not sure if it is
legal or not to stick half of a surrogate pair encoding into a character
literal, I see times I think I'd need to, but I know I can't stick both of
them in there, it is literally like assigning an int to a short...

I almost regretted bringing it up at all, but I wound up learning a LOT
from that thread, but are the terms consistent and I am missing something,
or does the JLS instead rather state that a character literal can be 'any
Unicode character' which is untrue, as only BMP characters can be directly
represented in a Java character literal?

I do know the following for sure:
A character literal is always of type char
No non-BMP Unicode characters can be represented by a single value of type
char.

So how could a character literal ever represent arbitrary non-BMP Unicode
characters?  If it can't, then what is the meaning of "Any Unicode
Character" within the JLS?  I use the term to mean the union of all BMP and
all non-BMP Unicode characters.

Yours in Confusion,
Jesse V. Silverman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/jls-jvms-spec-comments/attachments/20220615/1f6b8667/attachment.htm>


More information about the jls-jvms-spec-comments mailing list