Unicode script support in Regex and Character class

Tue Apr 27 15:35:34 UTC 2010

Oops, added attachment.

-Ulf

Am 27.04.2010 16:35, schrieb Ulf Zibis:
> Am 27.04.2010 06:25, schrieb Xueming Shen:
>> Ulf Zibis wrote:
>>> Am 24.04.2010 01:09, schrieb Xueming Shen:
>>>>
>>>> I changed the data file "format" a bit, so now the overal 
>>>> uniName.dat is less than 88k (last version is 122+k), but
>>>> the I can no long use cpLen as the capacity for the hashmap. I'm 
>>>> now using a hardcoded 20000 for 5.2.
>>>
>>> Again, is 88k the compressed or the uncompressed size ?
>>
>> Yes, it's the size of compressed data.
>
> I'm wondering, as script.txt only has ~120k.
>
>> Your smart "save one more byte" suggestion will save
>> 400+byte, a tiny 0.5%, unfortunately:-)
>
> I didn't mean the save by total file footprint, I meant it by 
> byte-wise read() count.
> My code only needs to read 1 int per character block against 1 byte + 
> 1 int, which looks kinda ugly too.
> Anyway, the theoretical max win would be < 20 %.
>
>>
>>>
>>>>> -- Is it faster, first copying the whole date in a byte[], and 
>>>>> then using ByteBuffer.getInt etc. against directly using 
>>>>> DataInputStream methods?
>> The current impl use neither ByteBuffer nor DataInputStream now, so 
>> no compare here.
> If JIT-compiled, bb.get() should be as fast as ba[cpOff++] & 0xff.
> My compare is about the manually byte2int assembling + triple 
> buffering the data (getResourceAsStream() is a buffered stream, and I 
> believe InflaterInputStream too)
>
>> Yes, to use DataInputStream will definitely makes code look better 
>> (no more those "ugly"
>> shifts), but it also will slow down thing a little since it adds one 
>> more layer. But speed
>> may not really a concern here.
>
> On the other hand:
> - layer shouldn't matter if DIS is yet JIT-compiled.
> - readInt() might be faster than 4 times read() + manually assembling 
> the int value. (if not, DataInputStream needs reengineering)
> - readFully() might be better optimized than your hand-coded read loop 
> (if not, let's do it ;-) )
> -- hand-coded loop might only make sense, if using thread.sleep() 
> after each chunk,
>     so concurrent threads could continue their work, while waiting for 
> the harddisk to read.
> - your code will surely run in interpreter mode, as GIT wouldn't have 
> time to compile it fast enough.
> - there is some chance, that DIS will be yet JIT-compiled from usage 
> of other program parts before.
> - and last but not least, use the given API's for byte code footprint 
> reduction as most as you can. Give good programming example as newbies 
> tend to use API sources as first template for their own code. Seeing 
> API use cases helps to become familiar with the complexity of the 
> Java-API. (Same for Arrays.binarySearch())
>
>
>>
>>>>> -- You could create a very long String with the whole data and 
>>>>> then use subString for the individual strings which could share 
>>>>> the same backing char[].
>>>
>> The disadvantage of using a big buffer String to hold everything then 
>> have the individual names to substring
>> from it is that it might simply break the softreference logic here. 
>> The big char[] will never been gc-ed as
>> long as there is still one single name object (substring-ed from it) 
>> is still walking around in system somewhere.
>> I don't think the vm/gc is that smart, isn't it?
>
> Good point, I missed that.
> But I'm still no friend of the SR usage here. It doesn't solve my main 
> complain:
> - In-economically initializing the whole amount of data for likely 1 
> or few invocations of getName(int cp), and repetitively, if SR was 
> cleaned.
> - Don't pollute the GC more than necessary (it would have to handle 
> each of the strings + char[]s separate), especially if memory comes 
> towards it's limit.
> Additionally, if not interned, equal character name strings would be 
> hold in memory for as many copies, as SR fails, if interned, they 
> would never be GC'd.
> You may argue, that code is rarely used, but if all corners of the 
> Java API would be coded such memory/performance-wasting, we ... I 
> don't think about it better.
> We could add (Attention: CCC change) a cacheCharacterNames(boolean 
> yesNo) method to serve users, which excessively need this functionality.
>
>
>>
>> But this will definitely be faster, given the burden of creating a 
>> String from bytes (we put in the optimization
>> earlier, so this operation should be faster now compared to 6u).
>
> + saving the memory overhead + GC work for the cpNum char[]s.
>
>
> Additionally:
> - No need to compare iis != null in finally block, possible NPE would 
> be thrown earlier.
> - Move SR logic to get() method to omit the possible remaining SR->NPE 
> problem:
>     public static String get(int cp) {
>         HashMap<Integer, String> names;
>         if (refNames == null || (names = refNames.get()) == null)
>             refNames = new SoftReference<>(names = getNames());
>         return names.get(cp);
>     }
> - then synchronize entire getNames() method.
> - save 2nd null-check after sync, as fail would still be much more 
> unlikely as getName(int cp) usage at all, and only risks 2nd 
> superfluous init.
> - Is it good idea to return null in case of io fail to calling code, 
> instead propagating the given exception or better throwing an error?
> - use Integer.toHexString(cp) instead Integer.toString(cp, 16);
> - IMPORTANT (check if CCC is affected):
>   Do I understand right, that j.l.Ch.getName('5') would return:
>       "Basic Latin 35"
>   ... but j.l.Ch.getName('0') would return:
>       "DIGIT ZERO..DIGIT NINE"
>   I think both should return:
>       "DIGIT ZERO..DIGIT NINE" (otherwise we don't have to cache that 
> value ;-) )
>   or at least:
>       "Basic Latin U+0035"
>
> See new version in attachment.
>
>
> -Ulf
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: CharacterName1.java
Type: java/*
Size: 3408 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20100427/e32aa5e7/CharacterName1.java>