System properties

Andrew Thompson lordpixel at me.com
Sat May 26 07:36:37 PDT 2012


Apologies if this is already obvious to everyone reading, but let me state clearly why I think changing this may be a problem.

The JVM's default encoding is used whenever one converts bytes to text without explicitly specifying an encoding. To be honest, that's a bad practice, but sometimes unavoidable.

Consider this problem statement:

Given an arbitrary 'plain text file' correctly determine its encoding and read its contents into a Java String.

Now, in many domains there exists information to help you e.g:

- if the file came from a web server, the web server may have set an HTTP header which specifies the encoding (maybe it is even the right one!)
- if you wrote the file originally, hopefully you know the encoding [but see below]
- if the file is in some known format, e.g. its an XML file, then the file itself contains the encoding on the first line (again, maybe it is even correct!)

But in general the problem is a hard one. Given an _arbitrary_ file there is no good way to know the right answer. The web browser vendors implemented some sophisticated heuristic sniffers (universal charset detectors) which would work OK some percentage of the time but often get it wrong.

As I recall, Apple originally gave a rationale for using MacRoman as the default on US English systems... from 1984 thru 2001 when OS X came out, pretty much every text file created on a US or UK English Macintosh computer would have been encoded in MacRoman. So, statistically, assuming users would mostly open more files from their own machine than not, Mac Roman made sense and would continue to do so for a while as many people took years to move for OS 9 to OS X and continued to use older software that created Mac Roman files.

Its now 2012 and most text editors default to UTF8 on OS X systems. Windows 7 is much more likely to use utf8 whereas previous versions used codepage 1252. One can argue the balance has shifted and utf8 is a universally good choice.

Here's the thing though... if I have a program that uses the default encoding which I use today on Java 6 on a US Mac system, and I have saved a bunch of files on my computer, then when I launch the same program on Java 7 on the same computer and open the same files, some of the characters are going to be corrupted. One can argue the author of the program is at fault for not specifying the encoding explicitly, and indeed for allowing the program to use an ancient obsolete encoding like Mac Roman, but the user is stuck with a broken program either way.

Of course, everything I've said probably applies to non-English systems to, just with some other old Mac text encoding. And it probably also applies to programs that read text over the network and are sloppy about encodings there too.

Honestly, I think we probably are at a point where this change makes sense. More than a decade has passed. But at the very least it should *never* *ever* be US ASCII in any shipping build (which would be going backwards from MacRoman) and it really *should* be in the release notes, because this is not a small change, though it is a subtle one. Its fair to make an explicit choice to do this, its not a good idea to do it by accident.


On May 21, 2012, at 10:29 PM, Michael Hall wrote:

> 
> On May 21, 2012, at 8:59 PM, Xueming Shen wrote:
> 
>> My apology, I mean if run jvm in C or POSIX locale, the file.encoding is set to US-ASCII by Oracle 7u4/7u6, which
>> is expected.
> 
> OK, I'm finding some information on this. I will have to do some more follow up reading up on this to determine what locale applies to my running of the jvm unless someone happens to know?
> I guess as you say it still comes down to whether or not this should be in effect. That is the jvm is run with the C locale, then no bug report for file.encoding US-ASCII.
> If not, my locale should not be C/POSIX, then there should be a bug report as I should be UTF-8. 
> Again, MacRoman no longer applies, period, which was what I originally questioned. But the rationale for the change pertaining to locale is interesting.

AndyT (lordpixel - the cat who walks through walls)
A little bigger on the inside

	(see you later space cowboy, you can't take the sky from me)


AndyT (lordpixel - the cat who walks through walls)
A little bigger on the inside

	(see you later space cowboy, you can't take the sky from me)




More information about the macosx-port-dev mailing list