Unicode script support in Regex and Character class

Tue Apr 27 14:35:58 UTC 2010

Am 27.04.2010 06:25, schrieb Xueming Shen:
> Ulf Zibis wrote:
>> Am 24.04.2010 01:09, schrieb Xueming Shen:
>>>
>>> I changed the data file "format" a bit, so now the overal 
>>> uniName.dat is less than 88k (last version is 122+k), but
>>> the I can no long use cpLen as the capacity for the hashmap. I'm now 
>>> using a hardcoded 20000 for 5.2.
>>
>> Again, is 88k the compressed or the uncompressed size ?
>
> Yes, it's the size of compressed data.

I'm wondering, as script.txt only has ~120k.

> Your smart "save one more byte" suggestion will save
> 400+byte, a tiny 0.5%, unfortunately:-)

I didn't mean the save by total file footprint, I meant it by byte-wise 
read() count.
My code only needs to read 1 int per character block against 1 byte + 1 
int, which looks kinda ugly too.
Anyway, the theoretical max win would be < 20 %.

>
>>
>>>> -- Is it faster, first copying the whole date in a byte[], and then 
>>>> using ByteBuffer.getInt etc. against directly using DataInputStream 
>>>> methods?
> The current impl use neither ByteBuffer nor DataInputStream now, so no 
> compare here.
If JIT-compiled, bb.get() should be as fast as ba[cpOff++] & 0xff.
My compare is about the manually byte2int assembling + triple buffering 
the data (getResourceAsStream() is a buffered stream, and I believe 
InflaterInputStream too)

> Yes, to use DataInputStream will definitely makes code look better (no 
> more those "ugly"
> shifts), but it also will slow down thing a little since it adds one 
> more layer. But speed
> may not really a concern here.

On the other hand:
- layer shouldn't matter if DIS is yet JIT-compiled.
- readInt() might be faster than 4 times read() + manually assembling 
the int value. (if not, DataInputStream needs reengineering)
- readFully() might be better optimized than your hand-coded read loop 
(if not, let's do it ;-) )
-- hand-coded loop might only make sense, if using thread.sleep() after 
each chunk,
     so concurrent threads could continue their work, while waiting for 
the harddisk to read.
- your code will surely run in interpreter mode, as GIT wouldn't have 
time to compile it fast enough.
- there is some chance, that DIS will be yet JIT-compiled from usage of 
other program parts before.
- and last but not least, use the given API's for byte code footprint 
reduction as most as you can. Give good programming example as newbies 
tend to use API sources as first template for their own code. Seeing API 
use cases helps to become familiar with the complexity of the Java-API. 
(Same for Arrays.binarySearch())

>
>>>> -- You could create a very long String with the whole data and then 
>>>> use subString for the individual strings which could share the same 
>>>> backing char[].
>>
> The disadvantage of using a big buffer String to hold everything then 
> have the individual names to substring
> from it is that it might simply break the softreference logic here. 
> The big char[] will never been gc-ed as
> long as there is still one single name object (substring-ed from it) 
> is still walking around in system somewhere.
> I don't think the vm/gc is that smart, isn't it?

Good point, I missed that.
But I'm still no friend of the SR usage here. It doesn't solve my main 
complain:
- In-economically initializing the whole amount of data for likely 1 or 
few invocations of getName(int cp), and repetitively, if SR was cleaned.
- Don't pollute the GC more than necessary (it would have to handle each 
of the strings + char[]s separate), especially if memory comes towards 
it's limit.
Additionally, if not interned, equal character name strings would be 
hold in memory for as many copies, as SR fails, if interned, they would 
never be GC'd.
You may argue, that code is rarely used, but if all corners of the Java 
API would be coded such memory/performance-wasting, we ... I don't think 
about it better.
We could add (Attention: CCC change) a cacheCharacterNames(boolean 
yesNo) method to serve users, which excessively need this functionality.

>
> But this will definitely be faster, given the burden of creating a 
> String from bytes (we put in the optimization
> earlier, so this operation should be faster now compared to 6u).

+ saving the memory overhead + GC work for the cpNum char[]s.

Additionally:
- No need to compare iis != null in finally block, possible NPE would be 
thrown earlier.
- Move SR logic to get() method to omit the possible remaining SR->NPE 
problem:
     public static String get(int cp) {
         HashMap<Integer, String> names;
         if (refNames == null || (names = refNames.get()) == null)
             refNames = new SoftReference<>(names = getNames());
         return names.get(cp);
     }
- then synchronize entire getNames() method.
- save 2nd null-check after sync, as fail would still be much more 
unlikely as getName(int cp) usage at all, and only risks 2nd superfluous 
init.
- Is it good idea to return null in case of io fail to calling code, 
instead propagating the given exception or better throwing an error?
- use Integer.toHexString(cp) instead Integer.toString(cp, 16);
- IMPORTANT (check if CCC is affected):
   Do I understand right, that j.l.Ch.getName('5') would return:
       "Basic Latin 35"
   ... but j.l.Ch.getName('0') would return:
       "DIGIT ZERO..DIGIT NINE"
   I think both should return:
       "DIGIT ZERO..DIGIT NINE" (otherwise we don't have to cache that 
value ;-) )
   or at least:
       "Basic Latin U+0035"

See new version in attachment.

-Ulf