RFR: 8305734: BitSet.get(int, int) always returns the empty BitSet when the Integer.MAX VALUE is set

Sat Jul 22 04:46:51 UTC 2023

On Fri, 7 Apr 2023 12:22:03 GMT, Andy-Tatman <duke at openjdk.org> wrote:

> See https://bugs.java.com/bugdatabase/view_bug?bug_id=8305734 and https://bugs.java.com/bugdatabase/view_bug?bug_id=JDK-8311905

Hi, thanks for the background and the discussion of the alternatives. I'm not sure that drafting the CSR is the right next step; there are a number of foibles about editing the specifications that come into play when preparing the CSR. However, what you've written above is a good design discussion about the different alternatives and how they affect the specification and the implementation.

One thing that we need to consider is compatibility. From an excessively pedantic standpoint, any change in behavior is a potential incompatibility (somebody might be relying on that bug!) but I think that fixing obviously incorrect behavior is reasonable. As much as possible I'd like to make sure that things that worked, at least partially, continue to work.

With this in mind I'm leaning toward allowing a BitSet to contain bits including the Integer.MAX_VALUE bit, and adjusting various specs accordingly. I think the best way to specify this is to say that length() returns an int in the range [0, 2^31] as an unsigned value. In practice of course this means it can return any Java `int` value in the range [0, MAX_VALUE] and also Integer.MIN_VALUE. A sufficiently careful programmer can use such values correctly, as long as they avoid doing certain things such as comparing against zero. (We might want to have a note in the spec to that effect.)

It's good that you analyzed the `valueOf` methods. It looks to me like the implementation will store an actual array potentially containing bits beyond the MAX_VALUE bit, and this will affect things like length() and size() and such bits won't be accessible via get() or set(). So, on the one hand, this behavior seems clearly broken and ought to be fixed by limiting the input array along the lines suggested by your three options.

On the other hand, it seems that from looking at the code, it's possible to create an "over-size" BitSet with valueOf(), and perform bulk bit manipulations on it and other BitSets using methods like and(), or(), and xor(). It also appears possible to read out the bits successfully using toByteArray() or toLongArray(). Thus an application might be able to manipulate bit arrays of up to about 2^37 bits long by using BitSet with long arrays or LongBuffers. Restricting the input of valueOf() to 2^31 bits would break such applications.

Also note that the specification says the bits are numbered by nonnegative integers (that is, zero to 2^31) which would seem to preclude longer bit arrays. However, if somebody does have an application that works with longer bit arrays, it seems unreasonable for us to break that application, unless we have good justification for doing so. I think it's a bit unlikely that somebody is actually relying on the feature of over-size arrays, but it seems possible, and I don't see any evidence otherwise (but I haven't looked too hard yet).

If we were to restrict BitSet to 2^31 bits, I'd rule out (2) from the three options. It doesn't seem sensible to check for and throw an exception only if certain bits are set. If the system did that, it might mean that code sometimes works but sometimes fails, depending on the _values_ contained in the array or buffer. Seems like it would make things hard to diagnose. Of the remaining options, I'd prefer option (1) to rejecting long arrays over (3) ignoring the extra values in the array. (1) is fail-fast but arguably not very friendly. However, (3) could potentially could result in silent data loss, which definitely seems like a bad idea.

However I'm still uncomfortable with restricting BitSet to 2^31 bits and thus changing the behavior of the various valueOf() methods. I'd welcome further discussion or evidence regarding whether applications might or might not be relying on the over-sized behavior. I did a quick survey of open source code but I haven't found anything conclusive yet.

Another approach would be to adjust the specifications to allow for over-size arrays. Methods that operate on bit indexes would for the most part not be able to read or write to the areas of the array beyond 2^31, but the bulk operations (and/or/xor) would work. Methods such as cardinality(), length(), nextSetBit(), and similar would be more difficult. We can accommodate the full 31-bit range using the "unsigned" carve-out mentioned previously. But that doesn't work for larger values.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13388#issuecomment-1646454706