CodePointCursor + CodePointParser (was Re: string indexing (was: Java needs an immutable byte array wrapper))
Zenaan Harkness
zen at freedbms.net
Thu Apr 13 13:26:51 UTC 2017
On Sun, Nov 13, 2016 at 11:21:48PM +1100, Zenaan Harkness wrote:
> https://zenaan.github.io/zen/javadoc/zen/lang/string.html
> (Note, this was pre-Java 9)
>
> Hopefully by Java 10, 11 or 12, we might see full grapheme support in
> Java (as is the case in Swift), now that String is implemented with byte
> array storage.
Further:
https://github.com/zenaan/zen
https://github.com/zenaan/zen/blob/master/src/java/zen/lang/CodePointCursor.java
https://github.com/zenaan/zen/blob/master/src/java/zen/lang/CodePointParser.java
A small step for anyone who prefers to work with Unicode code points
rather than Java String's "code units":
* CodePointCursor.java:
- complete and straightforward to use Unicode code point cursor
(or "iterator")
- resettable - and optionally reversable when reset (or at
construction)
- bidirectional - can move forwards and backwards
- supports hops of >1 , in either direction
- supports (default constructor) creation of default "null" cursor
- provides methods for both inclusive and exclusive indexes
- provides methods for both code point indexes, and underlying code
unit indexes
- supports traditional Java hasNext() and next() idiom
- supports also peek(), advance(), curr() and prev() idioms
- outputs a useful string representation of itself
* CodePointParser.java:
- limited parser exercising CodePointCursor
- parses string and unsigned long - no overflow checking
- supports optional literals escape char
- traditional parsing model
- pretty step-wise output messages/ parse analysis
This parser is very simple as seen, but the two functions are
well tested fwiw.
Now the next layer above code points is graphemes, "grapheme clusters"
as I think Swift calls them, or "code point clusters" or "codepoint
clusters" as they ought rightfully be called.
That looks like a big job, unlike a simple code point cursor and parser.
Now ideally we would be starting with and building upon, UTF-8 strings,
not 16-bit "code unit strings", but that's for another day... or year :/
Also have not yet checked out IBM's ICU4J yet...
Any feedback, positive or negative, appreciated.
Regards,
Zenaan
----------
# Method signatures (CP = code point, CU = code unit) :
public class CodePointCursor implements IDEBUG {
public CodePointCursor () {}
public CodePointCursor (String s) {this(s, false);}
public CodePointCursor (String s, boolean reverse)
public CodePointCursor reset (boolean reverse)
public void setDebug (boolean debug) {this._debug = debug;}
public int getCPLen ()
public int getCPIdxIn ()
public int getCPIdxEx ()
public int getCUIdxEx ()
public boolean hasNext () {return i != iend;} // | hasNext(1)
public boolean hasNext (int n)
public int peekIdx () throws IndexOutOfBoundsException
public int peekIdx (int n) throws IndexOutOfBoundsException
public int advance () throws IndexOutOfBoundsException
public int advance (int n) throws IndexOutOfBoundsException
public int peek ()
public int peek (int n)
public int next ()
public int next (int n)
public int curr ()
public int prev ()
public String toString () {
public class CodePointParser implements IDEBUG {
public CodePointParser (CodePointCursor i) {this.i = i;}
public class CPParserException extends RuntimeException {
public void setEscape (int escapecp) {
public boolean hasEscape () {return lescape != 0;}
public String toString () {
public int parseString (StringBuilder result, int end_char)
public int parseString (StringBuilder result, int end_char1, int
end_char2)
public static final int CURSOR_END = -1;
public static final int ESCAPE_AT_END = -2;
public int parseString (StringBuilder result, int end_char1, int
end_char2, Messages m)
public static class Messages {
public String noDigit = "<no digit> ";
public String endChar = "<end_char> ";
public String atEnd = "<at end> ";
public String escape = "<escape> ";
public String literal = "<literal> ";
public String pushBack = "<PUSH BACK> ";
public long parseULong (long defaultVal)
public long parseULong (long defaultVal, int base, Messages m)
More information about the discuss
mailing list