Unicode script support in Regex and Character class
Ulf Zibis
Ulf.Zibis at gmx.de
Tue Apr 27 15:35:34 UTC 2010
Oops, added attachment.
-Ulf
Am 27.04.2010 16:35, schrieb Ulf Zibis:
> Am 27.04.2010 06:25, schrieb Xueming Shen:
>> Ulf Zibis wrote:
>>> Am 24.04.2010 01:09, schrieb Xueming Shen:
>>>>
>>>> I changed the data file "format" a bit, so now the overal
>>>> uniName.dat is less than 88k (last version is 122+k), but
>>>> the I can no long use cpLen as the capacity for the hashmap. I'm
>>>> now using a hardcoded 20000 for 5.2.
>>>
>>> Again, is 88k the compressed or the uncompressed size ?
>>
>> Yes, it's the size of compressed data.
>
> I'm wondering, as script.txt only has ~120k.
>
>> Your smart "save one more byte" suggestion will save
>> 400+byte, a tiny 0.5%, unfortunately:-)
>
> I didn't mean the save by total file footprint, I meant it by
> byte-wise read() count.
> My code only needs to read 1 int per character block against 1 byte +
> 1 int, which looks kinda ugly too.
> Anyway, the theoretical max win would be < 20 %.
>
>>
>>>
>>>>> -- Is it faster, first copying the whole date in a byte[], and
>>>>> then using ByteBuffer.getInt etc. against directly using
>>>>> DataInputStream methods?
>> The current impl use neither ByteBuffer nor DataInputStream now, so
>> no compare here.
> If JIT-compiled, bb.get() should be as fast as ba[cpOff++] & 0xff.
> My compare is about the manually byte2int assembling + triple
> buffering the data (getResourceAsStream() is a buffered stream, and I
> believe InflaterInputStream too)
>
>> Yes, to use DataInputStream will definitely makes code look better
>> (no more those "ugly"
>> shifts), but it also will slow down thing a little since it adds one
>> more layer. But speed
>> may not really a concern here.
>
> On the other hand:
> - layer shouldn't matter if DIS is yet JIT-compiled.
> - readInt() might be faster than 4 times read() + manually assembling
> the int value. (if not, DataInputStream needs reengineering)
> - readFully() might be better optimized than your hand-coded read loop
> (if not, let's do it ;-) )
> -- hand-coded loop might only make sense, if using thread.sleep()
> after each chunk,
> so concurrent threads could continue their work, while waiting for
> the harddisk to read.
> - your code will surely run in interpreter mode, as GIT wouldn't have
> time to compile it fast enough.
> - there is some chance, that DIS will be yet JIT-compiled from usage
> of other program parts before.
> - and last but not least, use the given API's for byte code footprint
> reduction as most as you can. Give good programming example as newbies
> tend to use API sources as first template for their own code. Seeing
> API use cases helps to become familiar with the complexity of the
> Java-API. (Same for Arrays.binarySearch())
>
>
>>
>>>>> -- You could create a very long String with the whole data and
>>>>> then use subString for the individual strings which could share
>>>>> the same backing char[].
>>>
>> The disadvantage of using a big buffer String to hold everything then
>> have the individual names to substring
>> from it is that it might simply break the softreference logic here.
>> The big char[] will never been gc-ed as
>> long as there is still one single name object (substring-ed from it)
>> is still walking around in system somewhere.
>> I don't think the vm/gc is that smart, isn't it?
>
> Good point, I missed that.
> But I'm still no friend of the SR usage here. It doesn't solve my main
> complain:
> - In-economically initializing the whole amount of data for likely 1
> or few invocations of getName(int cp), and repetitively, if SR was
> cleaned.
> - Don't pollute the GC more than necessary (it would have to handle
> each of the strings + char[]s separate), especially if memory comes
> towards it's limit.
> Additionally, if not interned, equal character name strings would be
> hold in memory for as many copies, as SR fails, if interned, they
> would never be GC'd.
> You may argue, that code is rarely used, but if all corners of the
> Java API would be coded such memory/performance-wasting, we ... I
> don't think about it better.
> We could add (Attention: CCC change) a cacheCharacterNames(boolean
> yesNo) method to serve users, which excessively need this functionality.
>
>
>>
>> But this will definitely be faster, given the burden of creating a
>> String from bytes (we put in the optimization
>> earlier, so this operation should be faster now compared to 6u).
>
> + saving the memory overhead + GC work for the cpNum char[]s.
>
>
> Additionally:
> - No need to compare iis != null in finally block, possible NPE would
> be thrown earlier.
> - Move SR logic to get() method to omit the possible remaining SR->NPE
> problem:
> public static String get(int cp) {
> HashMap<Integer, String> names;
> if (refNames == null || (names = refNames.get()) == null)
> refNames = new SoftReference<>(names = getNames());
> return names.get(cp);
> }
> - then synchronize entire getNames() method.
> - save 2nd null-check after sync, as fail would still be much more
> unlikely as getName(int cp) usage at all, and only risks 2nd
> superfluous init.
> - Is it good idea to return null in case of io fail to calling code,
> instead propagating the given exception or better throwing an error?
> - use Integer.toHexString(cp) instead Integer.toString(cp, 16);
> - IMPORTANT (check if CCC is affected):
> Do I understand right, that j.l.Ch.getName('5') would return:
> "Basic Latin 35"
> ... but j.l.Ch.getName('0') would return:
> "DIGIT ZERO..DIGIT NINE"
> I think both should return:
> "DIGIT ZERO..DIGIT NINE" (otherwise we don't have to cache that
> value ;-) )
> or at least:
> "Basic Latin U+0035"
>
> See new version in attachment.
>
>
> -Ulf
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: CharacterName1.java
Type: java/*
Size: 3408 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20100427/e32aa5e7/CharacterName1.java>
More information about the core-libs-dev
mailing list