RFC for future/ enhancement of java.lang.String - was Fwd: [Yaml-core] encoding and content in YML files
Zenaan Harkness
zen at freedbms.net
Sat Dec 12 02:50:04 UTC 2015
java.lang.String is deficient. It does not provide a way to lay out a
random "string" of Unicode characters ("graphemes" or "characters as
the user perceives them"), e.g. to center or right-align such a
"string" of "characters" in a text console.
In discovering this I embarked on a two week journey of reading and a
little code exploration. See here for current status including a
detailed Javadoc detailing the issues we face:
https://zenaan.github.io/zen/javadoc/zen/lang/string.html
(There are some API thoughts/notes in the class file which don't
appear in the Javadoc header.)
java.lang.String is final, which precludes subclassing to enhance away
its deficiencies.
Notwithstanding, String's deficiencies ought be fixed, at least over
the longer term (e.g. it might depend on a not fully backwards
compatible version of Java, in the longest term vision, wrapper in the
short term etc).
Regards
Zenaan
---------- Forwarded message ----------
From: Zenaan Harkness <zen at freedbms.net>
Date: Thu, 15 Oct 2015 22:30:15 +0000
Subject: Re: [Yaml-core] encoding and content in YML files
To: yaml-core at lists.sourceforge.net
Cc: jose isaias cabrera <jicman at cinops.xerox.com>
On 10/15/15, Oren Ben-Kiki <oren at ben-kiki.org> wrote:
[Oren is responding to a new user's confusion around "the actual
characters vs. the Unicode code points" to be represented in a text
file]
> Doesn't work that way :-)
>
> "Code point" is a "platonic ideal". \u1234, UTF8, UTF16LE, UTF16BE,
> UTF32LE, UTF32BE, etc. are all different ways to encode it.
>
> Think of the number three. You can't put the number three into a file. You
> can put the byte 0b00000011 into a file, encoding it as a 1-byte integer
> (uint8_t). Or you can write the ascii string 't' 'h' 'r' 'e' 'e' into a
> file. You you could put the ascii character '3' into a file. Or any of a
> zillion other ways. All these are _encodings_. The number 3 itself is
> neither of them. It is a platonic ideal (the successor of the successor of
> the successor of the zero element, if you go by Peano's axioms - and the
> previous sentence is yet another "encoding").
>
> A "code point" is like that. You can only put an _encoding_ of a code point
> into a file.
Very well described. Thank you Oren.
Here is a little project, with some Javadoc describing Java's Unicode
limitations, for those working in Java:
http://zenaan.github.io/zen/javadoc/zen/lang/string.html
I haven't touched this since May, and may well not for another year or
two by the looks of it. Still, the Javadoc may be useful for those
trying to wrap their heads around all things Unicode, and also, and in
particular, with respect to Java programming. It certainly did my head
in, a few times.
Here was my original post to stringtemplate-discussion mailing list,
"a coder's lament on the paucity of java.lang.String functionality":
https://groups.google.com/forum/?_escaped_fragment_=msg/stringtemplate-discussion/jJ_gZrF8SKg/ir_cuPRx1JsJ#!msg/stringtemplate-discussion/jJ_gZrF8SKg/ir_cuPRx1JsJ
I also posted that email to a semi-private google group, and we
engaged in a substantial discussion, which may be useful to those who
still are struggling after reading the Javadoc posted above - if so, I
am willing to repost the key exchanges here (with emails/names
redacted) - just ask.
Regards
Zenaan
More information about the discuss
mailing list