JEP-198 - Lets start talking about JSON

Thu Dec 15 23:18:21 UTC 2022

I'll have to read the whole thing, but are pure JSON parsers really the
go-to for most people? I'm a big advocate of providing also something
similar to XPath/XQuery and that's IMHO JSONiq (90% XQuery). I might be
biased, of course, as I'm working on Brackit[1] in my spare time (which is
also a query compiler and intended to be used with proven optimizations by
document stores / JSON stores), but also can be used as an in-memory query
engine.

kind regards
Johannes

[1] https://github.com/sirixdb/brackit

Am Do., 15. Dez. 2022 um 23:03 Uhr schrieb Reinier Zwitserloot <
reinier at zwitserloot.com>:

> A recent Advent-of-Code puzzle also made me double check the support of
> JSON in the java core libs and it is indeed a curious situation that the
> java core libs don’t cater to it particularly well.
>
> However, I’m not seeing an easy way forward to try to close this hole in
> the core library offerings.
>
> If you need to stream huge swaths of JSON, generally there’s a clear unit
> size that you can just databind. Something like:
>
> String jsonStr = """ { "version": 5, "data": [
>   -- 1 million relatively small records in this list --
>   ] } """;
>
>
> The usual swath of JSON parsers tend to support this (giving you a stream
> of java instances created by databinding those small records one by one),
> or if not, the best move forward is presumably to file a pull request with
> those projects; the java.util.log experiment shows that trying to
> ‘core-librarize’ needs that the community at large already fulfills with
> third party deps isn’t a good move, especially if the core library variant
> tries to oversimplify to avoid the trap of being too opinionated (which
> core libs shouldn’t be). In other words, the need for ’stream this JSON for
> me’ style APIs is even more exotic that Ethan is suggesting.
>
> I see a fundamental problem here:
>
>
>    - The 95%+ use case for working with JSON for your average java coder
>    is best done with data binding.
>    - core libs doesn’t want to provide it, partly because it’s got a
>    large design space, partly because the field’s already covered by GSON and
>    Jackson-json; java.util.log proves this doesn’t work. At least, I gather
>    that’s what Ethan thinks and I agree with this assessment.
>    - A language that claims to be “batteries included” that doesn’t ship
>    with a JSON parser in this era is dubious, to say the least.
>
>
> I’m not sure how to square this circle. Hence it feels like core-libs
> needs to hold some more fundamental debates first:
>
>
>    - Maybe it’s time to state in a more or less official decree that
>    well-established, large design space jobs will remain the purview of
>    dependencies no matter how popular it has, unless being part of the
>    core-libs adds something more fundamental the third party deps cannot bring
>    to the table (such as language integration), or the community standardizes
>    on a single library (JSR310’s story, more or less). JSON parsing would
>    qualify as ‘well-established’ (GSON and Jackson) and ‘large design space’
>    as Ethan pointed out.
>    - Given that 99% of java projects, even really simple ones, start with
>    maven/gradle and a list of deps, is that really a problem?
>
>
> I’m honestly not sure what the right answer is. On one hand, the npm
> ecosystem seems to be doing very well even though their ‘batteries
> included’ situation is an utter shambles. Then again, the notion that your
> average nodejs project includes 10x+ more dependencies than other languages
> is likely a significant part of the security clown fiesta going on over
> there as far as 3rd party deps is concerned, so by no means should java
> just blindly emulate their solutions.
>
> I don’t like the idea of shipping a non-data-binding JSON API in the core
> libs. The root issue with JSON is that you just can’t tell how to interpret
> any given JSON token, because that’s not how JSON is used in practice. What
> does 5 mean? Could be that I’m to take that as an int, or as a double, or
> perhaps even as a j.t.Instant (epoch-millis), and defaulting behaviour
> (similar to j.u.Map’s .getOrDefault is *very* convenient to parse most
> JSON out there in the real world - omitting k/v pairs whose value is still
> on default is very common). That’s what makes those databind libraries so
> enticing: Instead of trying to pattern match my way into this behaviour:
>
>
>    - If the element isn’t there at all or null, give me a list-of-longs
>    with a single 0 in it.
>    - If the element is a number, make me a list-of-longs with 1 value in
>    it, that is that number, as long.
>    - If the element is a string, parse it into a long, then get me a list
>    with this one long value (because IEEE double rules mean sometimes you have
>    to put these things in string form or they get mangled by javascript-
>    eval style parsers).
>
>
> And yet the above is quite common, and can easily be done by a databinder,
> which sees you want a List<Long> for a field whose default value is
> List.of(1L), and, armed with that knowledge, can transit the JSON into
> java in that way.
>
> You don’t *need* databinding to cater to this idea: You could for example
> have a jsonNode.asLong(123) method that would parse a string if need be,
> even. But this has nothing to do with pattern matching either.
>
>  --Reinier Zwitserloot
>
>
> On 15 Dec 2022 at 21:30:17, Ethan McCue <ethan at mccue.dev> wrote:
>
>> I'm writing this to drive some forward motion and to nerd-snipe those who
>> know better than I do into putting their thoughts into words.
>>
>> There are three ways to process JSON[1]
>> - Streaming (Push or Pull)
>> - Traversing a Tree (Realized or Lazy)
>> - Declarative Databind (N ways)
>>
>> Of these, JEP-198 explicitly ruled out providing "JAXB style type safe
>> data binding."
>>
>> No justification is given, but if I had to insert my own: mapping the
>> Json model to/from the Java/JVM object model is a cursed combo of
>> - Huge possible design space
>> - Unpalatably large surface for backwards compatibility
>> - Serialization! Boo![2]
>>
>> So for an artifact like the JDK, it probably doesn't make sense to
>> include. That tracks.
>> It won't make everyone happy, people like databind APIs, but it tracks.
>>
>> So for the "read flow" these are the things to figure out.
>>
>>                 | Should Provide? | Intended User(s) |
>> ----------------+-----------------+------------------+
>>  Streaming Push |                 |                  |
>> ----------------+-----------------+------------------+
>>  Streaming Pull |                 |                  |
>> ----------------+-----------------+------------------+
>>  Realized Tree  |                 |                  |
>> ----------------+-----------------+------------------+
>>  Lazy Tree      |                 |                  |
>> ----------------+-----------------+------------------+
>>
>> At which point, we should talk about what "meets needs of Java developers
>> using JSON" implies.
>>
>> JSON is ubiquitous. Most kinds of software us schmucks write could have a
>> reason to interact with it.
>> The full set of "user personas" therefore aren't practical for me to talk
>> about.[3]
>>
>> JSON documents, however, are not so varied.
>>
>> - There are small ones (1-10kb)
>> - There are medium ones (10-1000kb)
>> - There are big ones (1000kb-???)
>>
>> - There are shallow ones
>> - There are deep ones
>>
>> So that feels like an easier direction to talk about it from.
>>
>>
>> This repo[4] has some convenient toy examples of how some of those APIs
>> look in libraries
>> in the ecosystem. Specifically the Streaming Pull and Realized Tree
>> models.
>>
>>         User r = new User();
>>         while (true) {
>>             JsonToken token = reader.peek();
>>             switch (token) {
>>                 case BEGIN_OBJECT:
>>                     reader.beginObject();
>>                     break;
>>                 case END_OBJECT:
>>                     reader.endObject();
>>                     return r;
>>                 case NAME:
>>                     String fieldname = reader.nextName();
>>                     switch (fieldname) {
>>                         case "id":
>>                             r.setId(reader.nextString());
>>                             break;
>>                         case "index":
>>                             r.setIndex(reader.nextInt());
>>                             break;
>>                         ...
>>                         case "friends":
>>                             r.setFriends(new ArrayList<>());
>>                             Friend f = null;
>>                             carryOn = true;
>>                             while (carryOn) {
>>                                 token = reader.peek();
>>                                 switch (token) {
>>                                     case BEGIN_ARRAY:
>>                                         reader.beginArray();
>>                                         break;
>>                                     case END_ARRAY:
>>                                         reader.endArray();
>>                                         carryOn = false;
>>                                         break;
>>                                     case BEGIN_OBJECT:
>>                                         reader.beginObject();
>>                                         f = new Friend();
>>                                         break;
>>                                     case END_OBJECT:
>>                                         reader.endObject();
>>                                         r.getFriends().add(f);
>>                                         break;
>>                                     case NAME:
>>                                         String fn = reader.nextName();
>>                                         switch (fn) {
>>                                             case "id":
>>
>> f.setId(reader.nextString());
>>                                                 break;
>>                                             case "name":
>>
>> f.setName(reader.nextString());
>>                                                 break;
>>                                         }
>>                                         break;
>>                                 }
>>                             }
>>                             break;
>>                     }
>>             }
>>
>> I think its not hard to argue that the streaming apis are brutalist. The
>> above is Gson, but Jackson, moshi, etc
>> seem at least morally equivalent.
>>
>> Its hard to write, hard to write *correctly*, and theres is a curious
>> protensity towards pairing it
>> with anemic, mutable models.
>>
>> That being said, it handles big documents and deep documents really well.
>> It also performs
>> pretty darn well and is good enough as a "fallback" when the intended
>> user experience
>> is through something like databind.
>>
>> So what could we do meaningfully better with the language we have
>> today/will have tommorow?
>>
>> - Sealed interfaces + Pattern matching could give a nicer model for tokens
>>
>>         sealed interface JsonToken {
>>             record Field(String name) implements JsonToken {}
>>             record BeginArray() implements JsonToken {}
>>             record EndArray() implements JsonToken {}
>>             record BeginObject() implements JsonToken {}
>>             record EndObject() implements JsonToken {}
>>             // ...
>>         }
>>
>>         // ...
>>
>>         User r = new User();
>>         while (true) {
>>             JsonToken token = reader.peek();
>>             switch (token) {
>>                 case BeginObject __:
>>                     reader.beginObject();
>>                     break;
>>                 case EndObject __:
>>                     reader.endObject();
>>                     return r;
>>                 case Field("id"):
>>                     r.setId(reader.nextString());
>>                     break;
>>                 case Field("index"):
>>                     r.setIndex(reader.nextInt());
>>                     break;
>>
>>                 // ...
>>
>>                 case Field("friends"):
>>                     r.setFriends(new ArrayList<>());
>>                     Friend f = null;
>>                     carryOn = true;
>>                     while (carryOn) {
>>                         token = reader.peek();
>>                         switch (token) {
>>                 // ...
>>
>> - Value classes can make it all more efficient
>>
>>         sealed interface JsonToken {
>>             value record Field(String name) implements JsonToken {}
>>             value record BeginArray() implements JsonToken {}
>>             value record EndArray() implements JsonToken {}
>>             value record BeginObject() implements JsonToken {}
>>             value record EndObject() implements JsonToken {}
>>             // ...
>>         }
>>
>> - (Fun One) We can transform a simpler-to-write push parser into a pull
>> parser with Coroutines
>>
>>     This is just a toy we could play with while making something in the
>> JDK. I'm pretty sure
>>     we could make a parser which feeds into something like
>>
>>         interface Listener {
>>             void onObjectStart();
>>             void onObjectEnd();
>>             void onArrayStart();
>>             void onArrayEnd();
>>             void onField(String name);
>>             // ...
>>         }
>>
>>     and invert a loop like
>>
>>         while (true) {
>>             char c = next();
>>             switch (c) {
>>                 case '{':
>>                     listener.onObjectStart();
>>                     // ...
>>                 // ...
>>             }
>>         }
>>
>>     by putting a Coroutine.yield in the callback.
>>
>>     That might be a meaningful simplification in code structure, I don't
>> know enough to say.
>>
>> But, I think there are some hard questions like
>>
>> - Is the intent[5] to be make backing parser for ecosystem databind apis?
>> - Is the intent that users who want to handle big/deep documents fall
>> back to this?
>> - Are those new language features / conveniences enough to offset the
>> cost of committing to a new api?
>> - To whom exactly does a low level api provide value?
>> - What benefit is standardization in the JDK?
>>
>> and just generally - who would be the consumer(s) of this?
>>
>> The other kind of API still on the table is a Tree. There are two ways to
>> handle this
>>
>> 1. Load it into `Object`. Use a bunch of instanceof checks/casts to
>> confirm what it actually is.
>>
>>         Object v;
>>         User u = new User();
>>
>>         if ((v = jso.get("id")) != null) {
>>             u.setId((String) v);
>>         }
>>         if ((v = jso.get("index")) != null) {
>>             u.setIndex(((Long) v).intValue());
>>         }
>>         if ((v = jso.get("guid")) != null) {
>>             u.setGuid((String) v);
>>         }
>>         if ((v = jso.get("isActive")) != null) {
>>             u.setIsActive(((Boolean) v));
>>         }
>>         if ((v = jso.get("balance")) != null) {
>>             u.setBalance((String) v);
>>         }
>>         // ...
>>         if ((v = jso.get("latitude")) != null) {
>>             u.setLatitude(v instanceof BigDecimal ? ((BigDecimal)
>> v).doubleValue() : (Double) v);
>>         }
>>         if ((v = jso.get("longitude")) != null) {
>>             u.setLongitude(v instanceof BigDecimal ? ((BigDecimal)
>> v).doubleValue() : (Double) v);
>>         }
>>         if ((v = jso.get("greeting")) != null) {
>>             u.setGreeting((String) v);
>>         }
>>         if ((v = jso.get("favoriteFruit")) != null) {
>>             u.setFavoriteFruit((String) v);
>>         }
>>         if ((v = jso.get("tags")) != null) {
>>             List<Object> jsonarr = (List<Object>) v;
>>             u.setTags(new ArrayList<>());
>>             for (Object vi : jsonarr) {
>>                 u.getTags().add((String) vi);
>>             }
>>         }
>>         if ((v = jso.get("friends")) != null) {
>>             List<Object> jsonarr = (List<Object>) v;
>>             u.setFriends(new ArrayList<>());
>>             for (Object vi : jsonarr) {
>>                 Map<String, Object> jso0 = (Map<String, Object>) vi;
>>                 Friend f = new Friend();
>>                 f.setId((String) jso0.get("id"));
>>                 f.setName((String) jso0.get("name"));
>>                 u.getFriends().add(f);
>>             }
>>         }
>>
>> 2. Have an explicit model for Json, and helper methods that do said
>> casts[6]
>>
>>
>> this.setSiteSetting(readFromJson(jsonObject.getJsonObject("site")));
>> JsonArray groups = jsonObject.getJsonArray("group");
>> if(groups != null)
>> {
>> int len = groups.size();
>> for(int i=0; i<len; i++)
>> {
>> JsonObject grp = groups.getJsonObject(i);
>> SNMPSetting grpSetting = readFromJson(grp);
>> String grpName = grp.getString("dbgroup", null);
>> if(grpName != null && grpSetting != null)
>> this.groupSettings.put(grpName, grpSetting);
>> }
>> }
>> JsonArray hosts = jsonObject.getJsonArray("host");
>> if(hosts != null)
>> {
>> int len = hosts.size();
>> for(int i=0; i<len; i++)
>> {
>> JsonObject host = hosts.getJsonObject(i);
>> SNMPSetting hostSetting = readFromJson(host);
>> String hostName = host.getString("dbhost", null);
>> if(hostName != null && hostSetting != null)
>> this.hostSettings.put(hostName, hostSetting);
>> }
>> }
>>
>> I think what has become easier to represent in the language nowadays is
>> that explicit model for Json.
>> Its the 101 lesson of sealed interfaces.[7] It feels nice and clean.
>>
>>         sealed interface Json {
>>             final class Null implements Json {}
>>             final class True implements Json {}
>>             final class False implements Json {}
>>             final class Array implements Json {}
>>             final class Object implements Json {}
>>             final class String implements Json {}
>>             final class Number implements Json {}
>>         }
>>
>> And the cast-and-check approach is now more viable on account of pattern
>> matching.
>>
>>         if (jso.get("id") instanceof String v) {
>>             u.setId(v);
>>         }
>>         if (jso.get("index") instanceof Long v) {
>>             u.setIndex(v.intValue());
>>         }
>>         if (jso.get("guid") instanceof String v) {
>>             u.setGuid(v);
>>         }
>>
>>         // or
>>
>>         if (jso.get("id") instanceof String id &&
>>                 jso.get("index") instanceof Long index &&
>>                 jso.get("guid") instanceof String guid) {
>>             return new User(id, index, guid, ...); // look ma, no setters!
>>         }
>>
>>
>> And on the horizon, again, is value types.
>>
>> But there are problems with this approach beyond the performance
>> implications of loading into
>> a tree.
>>
>> For one, all the code samples above have different behaviors around null
>> keys and missing keys
>> that are not obvious from first glance.
>>
>> This won't accept any null or missing fields
>>
>>         if (jso.get("id") instanceof String id &&
>>                 jso.get("index") instanceof Long index &&
>>                 jso.get("guid") instanceof String guid) {
>>             return new User(id, index, guid, ...);
>>         }
>>
>> This will accept individual null or missing fields, but also will
>> silently ignore
>> fields with incorrect types
>>
>>         if (jso.get("id") instanceof String v) {
>>             u.setId(v);
>>         }
>>         if (jso.get("index") instanceof Long v) {
>>             u.setIndex(v.intValue());
>>         }
>>         if (jso.get("guid") instanceof String v) {
>>             u.setGuid(v);
>>         }
>>
>> And, compared to databind where there is information about the expected
>> structure of the document
>> and its the job of the framework to assert that, I posit that the errors
>> that would be encountered
>> when writing code against this would be more like
>>
>>     "something wrong with user"
>>
>> than
>>
>>     "problem at users[5].name, expected string or null. got 5"
>>
>> Which feels unideal.
>>
>>
>> One approach I find promising is something close to what Elm does with
>> its decoders[8]. Not just combining assertion
>> and binding like what pattern matching with records allows, but including
>> a scheme for bubbling/nesting errors.
>>
>>     static String string(Json json) throws JsonDecodingException {
>>         if (!(json instanceof Json.String jsonString)) {
>>             throw JsonDecodingException.of(
>>                     "expected a string",
>>                     json
>>             );
>>         } else {
>>             return jsonString.value();
>>         }
>>     }
>>
>>     static <T> T field(Json json, String fieldName, Decoder<? extends T>
>> valueDecoder) throws JsonDecodingException {
>>         var jsonObject = object(json);
>>         var value = jsonObject.get(fieldName);
>>         if (value == null) {
>>             throw JsonDecodingException.atField(
>>                     fieldName,
>>                     JsonDecodingException.of(
>>                             "no value for field",
>>                             json
>>                     )
>>             );
>>         }
>>         else {
>>             try {
>>                 return valueDecoder.decode(value);
>>             } catch (JsonDecodingException e) {
>>                 throw JsonDecodingException.atField(
>>                         fieldName,
>>                         e
>>                 );
>>             }  catch (Exception e) {
>>                 throw JsonDecodingException.atField(fieldName,
>> JsonDecodingException.of(e, value));
>>             }
>>         }
>>     }
>>
>> Which I think has some benefits over the ways I've seen of working with
>> trees.
>>
>>
>>
>> - It is declarative enough that folks who prefer databind might be happy
>> enough.
>>
>>         static User fromJson(Json json) {
>>             return new User(
>>                 Decoder.field(json, "id", Decoder::string),
>>                 Decoder.field(json, "index", Decoder::long_),
>>                 Decoder.field(json, "guid", Decoder::string),
>>             );
>>         }
>>
>>         / ...
>>
>>         List<User> users = Decoders.array(json, User::fromJson);
>>
>> - Handling null and optional fields could be less easily conflated
>>
>>     Decoder.field(json, "id", Decoder::string);
>>
>>     Decoder.nullableField(json, "id", Decoder::string);
>>
>>     Decoder.optionalField(json, "id", Decoder::string);
>>
>>     Decoder.optionalNullableField(json, "id", Decoder::string);
>>
>>
>> - It composes well with user defined classes
>>
>>     record Guid(String value) {
>>         Guid {
>>             // some assertions on the structure of value
>>         }
>>     }
>>
>>     Decoder.string(json, "guid", guid -> new Guid(Decoder.string(guid)));
>>
>>     // or even
>>
>>     record Guid(String value) {
>>         Guid {
>>             // some assertions on the structure of value
>>         }
>>
>>         static Guid fromJson(Json json) {
>>             return new Guid(Decoder.string(guid));
>>         }
>>     }
>>
>>     Decoder.string(json, "guid", Guid::fromJson);
>>
>>
>> - When something goes wrong, the API can handle the fiddlyness of
>> capturing information for feedback.
>>
>>     In the code I've sketched out its just what field/index things went
>> wrong at. Potentially
>>     capturing metadata like row/col numbers of the source would be
>> sensible too.
>>
>>     Its just not reasonable to expect devs to do extra work to get that
>> and its really nice to give it.
>>
>> There are also some downsides like
>>
>> -  I do not know how compatible it would be with lazy trees.
>>
>>      Lazy trees being the only way that a tree api could handle big or
>> deep documents.
>>      The general concept as applied in libraries like json-tree[9] is to
>> navigate without
>>      doing any work, and that clashes with wanting to instanceof check
>> the info at the
>>      current path.
>>
>> - It *almost* gives enough information to be a general schema approach
>>
>>     If one field fails, that in the model throws an exception
>> immediately. If an API should
>>     return "errors": [...], that is inconvenient to construct.
>>
>> - None of the existing popular libraries are doing this
>>
>>      The only mechanics that are strictly required to give this sort of
>> API is lambdas. Those have
>>      been out for a decade. Yes sealed interfaces make the data model
>> prettier but in concept you
>>      can build the same thing on top of anything.
>>
>>      I could argue that this is because of "cultural momentum" of
>> databind or some other reason,
>>      but the fact remains that it isn't a proven out approach.
>>
>>      Writing Json libraries is a todo list[10]. There are a lot of bad
>> ideas and this might be one of the,
>>
>> - Performance impact of so many instanceof checks
>>
>>     I've gotten a 4.2% slowdown compared to the "regular" tree code
>> without the repeated casts.
>>
>>     But that was with a parser that is 5x slower than Jacksons. (using
>> the same benchmark project as for the snippets).
>>     I think there could be reason to believe that the JIT does well
>> enough with repeated instanceof
>>     checks to consider it.
>>
>>
>> My current thinking is that - despite not solving for large or deep
>> documents - starting with a really "dumb" realized tree api
>> might be the right place to start for the read side of a potential
>> incubator module.
>>
>> But regardless - this feels like a good time to start more concrete
>> conversations. I fell I should cap this email since I've reached the point
>> of decoherence and haven't even mentioned the write side of things
>>
>>
>>
>>
>> [1]: http://www.cowtowncoder.com/blog/archives/2009/01/entry_131.html
>> [2]: https://security.snyk.io/vuln/maven?search=jackson-databind
>> [3]: I only know like 8 people
>> [4]:
>> https://github.com/fabienrenaud/java-json-benchmark/blob/master/src/main/java/com/github/fabienrenaud/jjb/stream/UsersStreamDeserializer.java
>> [5]: When I say "intent", I do so knowing full well no one has been
>> actively thinking of this for an entire Game of Thrones
>> [6]:
>> https://github.com/yahoo/mysql_perf_analyzer/blob/master/myperf/src/main/java/com/yahoo/dba/perf/myperf/common/SNMPSettings.java
>> [7]: https://www.infoq.com/articles/data-oriented-programming-java/
>> [8]: https://package.elm-lang.org/packages/elm/json/latest/Json-Decode
>> [9]: https://github.com/jbee/json-tree
>> [10]: https://stackoverflow.com/a/14442630/2948173
>> [11]: In 30 days JEP-198 it will be recognizably PI days old for the 2nd
>> time in its history.
>> [12]: To me, the fact that is still an open JEP is more a social
>> convenience than anything. I could just as easily writing this exact same
>> email about TOML.
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20221216/ba48583c/attachment-0001.htm>