JEP-198 - Lets start talking about JSON
Ethan McCue
ethan at mccue.dev
Thu Dec 15 20:30:17 UTC 2022
I'm writing this to drive some forward motion and to nerd-snipe those who
know better than I do into putting their thoughts into words.
There are three ways to process JSON[1]
- Streaming (Push or Pull)
- Traversing a Tree (Realized or Lazy)
- Declarative Databind (N ways)
Of these, JEP-198 explicitly ruled out providing "JAXB style type safe data
binding."
No justification is given, but if I had to insert my own: mapping the Json
model to/from the Java/JVM object model is a cursed combo of
- Huge possible design space
- Unpalatably large surface for backwards compatibility
- Serialization! Boo![2]
So for an artifact like the JDK, it probably doesn't make sense to include.
That tracks.
It won't make everyone happy, people like databind APIs, but it tracks.
So for the "read flow" these are the things to figure out.
| Should Provide? | Intended User(s) |
----------------+-----------------+------------------+
Streaming Push | | |
----------------+-----------------+------------------+
Streaming Pull | | |
----------------+-----------------+------------------+
Realized Tree | | |
----------------+-----------------+------------------+
Lazy Tree | | |
----------------+-----------------+------------------+
At which point, we should talk about what "meets needs of Java developers
using JSON" implies.
JSON is ubiquitous. Most kinds of software us schmucks write could have a
reason to interact with it.
The full set of "user personas" therefore aren't practical for me to talk
about.[3]
JSON documents, however, are not so varied.
- There are small ones (1-10kb)
- There are medium ones (10-1000kb)
- There are big ones (1000kb-???)
- There are shallow ones
- There are deep ones
So that feels like an easier direction to talk about it from.
This repo[4] has some convenient toy examples of how some of those APIs
look in libraries
in the ecosystem. Specifically the Streaming Pull and Realized Tree models.
User r = new User();
while (true) {
JsonToken token = reader.peek();
switch (token) {
case BEGIN_OBJECT:
reader.beginObject();
break;
case END_OBJECT:
reader.endObject();
return r;
case NAME:
String fieldname = reader.nextName();
switch (fieldname) {
case "id":
r.setId(reader.nextString());
break;
case "index":
r.setIndex(reader.nextInt());
break;
...
case "friends":
r.setFriends(new ArrayList<>());
Friend f = null;
carryOn = true;
while (carryOn) {
token = reader.peek();
switch (token) {
case BEGIN_ARRAY:
reader.beginArray();
break;
case END_ARRAY:
reader.endArray();
carryOn = false;
break;
case BEGIN_OBJECT:
reader.beginObject();
f = new Friend();
break;
case END_OBJECT:
reader.endObject();
r.getFriends().add(f);
break;
case NAME:
String fn = reader.nextName();
switch (fn) {
case "id":
f.setId(reader.nextString());
break;
case "name":
f.setName(reader.nextString());
break;
}
break;
}
}
break;
}
}
I think its not hard to argue that the streaming apis are brutalist. The
above is Gson, but Jackson, moshi, etc
seem at least morally equivalent.
Its hard to write, hard to write *correctly*, and theres is a curious
protensity towards pairing it
with anemic, mutable models.
That being said, it handles big documents and deep documents really well.
It also performs
pretty darn well and is good enough as a "fallback" when the intended user
experience
is through something like databind.
So what could we do meaningfully better with the language we have
today/will have tommorow?
- Sealed interfaces + Pattern matching could give a nicer model for tokens
sealed interface JsonToken {
record Field(String name) implements JsonToken {}
record BeginArray() implements JsonToken {}
record EndArray() implements JsonToken {}
record BeginObject() implements JsonToken {}
record EndObject() implements JsonToken {}
// ...
}
// ...
User r = new User();
while (true) {
JsonToken token = reader.peek();
switch (token) {
case BeginObject __:
reader.beginObject();
break;
case EndObject __:
reader.endObject();
return r;
case Field("id"):
r.setId(reader.nextString());
break;
case Field("index"):
r.setIndex(reader.nextInt());
break;
// ...
case Field("friends"):
r.setFriends(new ArrayList<>());
Friend f = null;
carryOn = true;
while (carryOn) {
token = reader.peek();
switch (token) {
// ...
- Value classes can make it all more efficient
sealed interface JsonToken {
value record Field(String name) implements JsonToken {}
value record BeginArray() implements JsonToken {}
value record EndArray() implements JsonToken {}
value record BeginObject() implements JsonToken {}
value record EndObject() implements JsonToken {}
// ...
}
- (Fun One) We can transform a simpler-to-write push parser into a pull
parser with Coroutines
This is just a toy we could play with while making something in the
JDK. I'm pretty sure
we could make a parser which feeds into something like
interface Listener {
void onObjectStart();
void onObjectEnd();
void onArrayStart();
void onArrayEnd();
void onField(String name);
// ...
}
and invert a loop like
while (true) {
char c = next();
switch (c) {
case '{':
listener.onObjectStart();
// ...
// ...
}
}
by putting a Coroutine.yield in the callback.
That might be a meaningful simplification in code structure, I don't
know enough to say.
But, I think there are some hard questions like
- Is the intent[5] to be make backing parser for ecosystem databind apis?
- Is the intent that users who want to handle big/deep documents fall back
to this?
- Are those new language features / conveniences enough to offset the cost
of committing to a new api?
- To whom exactly does a low level api provide value?
- What benefit is standardization in the JDK?
and just generally - who would be the consumer(s) of this?
The other kind of API still on the table is a Tree. There are two ways to
handle this
1. Load it into `Object`. Use a bunch of instanceof checks/casts to confirm
what it actually is.
Object v;
User u = new User();
if ((v = jso.get("id")) != null) {
u.setId((String) v);
}
if ((v = jso.get("index")) != null) {
u.setIndex(((Long) v).intValue());
}
if ((v = jso.get("guid")) != null) {
u.setGuid((String) v);
}
if ((v = jso.get("isActive")) != null) {
u.setIsActive(((Boolean) v));
}
if ((v = jso.get("balance")) != null) {
u.setBalance((String) v);
}
// ...
if ((v = jso.get("latitude")) != null) {
u.setLatitude(v instanceof BigDecimal ? ((BigDecimal)
v).doubleValue() : (Double) v);
}
if ((v = jso.get("longitude")) != null) {
u.setLongitude(v instanceof BigDecimal ? ((BigDecimal)
v).doubleValue() : (Double) v);
}
if ((v = jso.get("greeting")) != null) {
u.setGreeting((String) v);
}
if ((v = jso.get("favoriteFruit")) != null) {
u.setFavoriteFruit((String) v);
}
if ((v = jso.get("tags")) != null) {
List<Object> jsonarr = (List<Object>) v;
u.setTags(new ArrayList<>());
for (Object vi : jsonarr) {
u.getTags().add((String) vi);
}
}
if ((v = jso.get("friends")) != null) {
List<Object> jsonarr = (List<Object>) v;
u.setFriends(new ArrayList<>());
for (Object vi : jsonarr) {
Map<String, Object> jso0 = (Map<String, Object>) vi;
Friend f = new Friend();
f.setId((String) jso0.get("id"));
f.setName((String) jso0.get("name"));
u.getFriends().add(f);
}
}
2. Have an explicit model for Json, and helper methods that do said casts[6]
this.setSiteSetting(readFromJson(jsonObject.getJsonObject("site")));
JsonArray groups = jsonObject.getJsonArray("group");
if(groups != null)
{
int len = groups.size();
for(int i=0; i<len; i++)
{
JsonObject grp = groups.getJsonObject(i);
SNMPSetting grpSetting = readFromJson(grp);
String grpName = grp.getString("dbgroup", null);
if(grpName != null && grpSetting != null)
this.groupSettings.put(grpName, grpSetting);
}
}
JsonArray hosts = jsonObject.getJsonArray("host");
if(hosts != null)
{
int len = hosts.size();
for(int i=0; i<len; i++)
{
JsonObject host = hosts.getJsonObject(i);
SNMPSetting hostSetting = readFromJson(host);
String hostName = host.getString("dbhost", null);
if(hostName != null && hostSetting != null)
this.hostSettings.put(hostName, hostSetting);
}
}
I think what has become easier to represent in the language nowadays is
that explicit model for Json.
Its the 101 lesson of sealed interfaces.[7] It feels nice and clean.
sealed interface Json {
final class Null implements Json {}
final class True implements Json {}
final class False implements Json {}
final class Array implements Json {}
final class Object implements Json {}
final class String implements Json {}
final class Number implements Json {}
}
And the cast-and-check approach is now more viable on account of pattern
matching.
if (jso.get("id") instanceof String v) {
u.setId(v);
}
if (jso.get("index") instanceof Long v) {
u.setIndex(v.intValue());
}
if (jso.get("guid") instanceof String v) {
u.setGuid(v);
}
// or
if (jso.get("id") instanceof String id &&
jso.get("index") instanceof Long index &&
jso.get("guid") instanceof String guid) {
return new User(id, index, guid, ...); // look ma, no setters!
}
And on the horizon, again, is value types.
But there are problems with this approach beyond the performance
implications of loading into
a tree.
For one, all the code samples above have different behaviors around null
keys and missing keys
that are not obvious from first glance.
This won't accept any null or missing fields
if (jso.get("id") instanceof String id &&
jso.get("index") instanceof Long index &&
jso.get("guid") instanceof String guid) {
return new User(id, index, guid, ...);
}
This will accept individual null or missing fields, but also will silently
ignore
fields with incorrect types
if (jso.get("id") instanceof String v) {
u.setId(v);
}
if (jso.get("index") instanceof Long v) {
u.setIndex(v.intValue());
}
if (jso.get("guid") instanceof String v) {
u.setGuid(v);
}
And, compared to databind where there is information about the expected
structure of the document
and its the job of the framework to assert that, I posit that the errors
that would be encountered
when writing code against this would be more like
"something wrong with user"
than
"problem at users[5].name, expected string or null. got 5"
Which feels unideal.
One approach I find promising is something close to what Elm does with its
decoders[8]. Not just combining assertion
and binding like what pattern matching with records allows, but including a
scheme for bubbling/nesting errors.
static String string(Json json) throws JsonDecodingException {
if (!(json instanceof Json.String jsonString)) {
throw JsonDecodingException.of(
"expected a string",
json
);
} else {
return jsonString.value();
}
}
static <T> T field(Json json, String fieldName, Decoder<? extends T>
valueDecoder) throws JsonDecodingException {
var jsonObject = object(json);
var value = jsonObject.get(fieldName);
if (value == null) {
throw JsonDecodingException.atField(
fieldName,
JsonDecodingException.of(
"no value for field",
json
)
);
}
else {
try {
return valueDecoder.decode(value);
} catch (JsonDecodingException e) {
throw JsonDecodingException.atField(
fieldName,
e
);
} catch (Exception e) {
throw JsonDecodingException.atField(fieldName,
JsonDecodingException.of(e, value));
}
}
}
Which I think has some benefits over the ways I've seen of working with
trees.
- It is declarative enough that folks who prefer databind might be happy
enough.
static User fromJson(Json json) {
return new User(
Decoder.field(json, "id", Decoder::string),
Decoder.field(json, "index", Decoder::long_),
Decoder.field(json, "guid", Decoder::string),
);
}
/ ...
List<User> users = Decoders.array(json, User::fromJson);
- Handling null and optional fields could be less easily conflated
Decoder.field(json, "id", Decoder::string);
Decoder.nullableField(json, "id", Decoder::string);
Decoder.optionalField(json, "id", Decoder::string);
Decoder.optionalNullableField(json, "id", Decoder::string);
- It composes well with user defined classes
record Guid(String value) {
Guid {
// some assertions on the structure of value
}
}
Decoder.string(json, "guid", guid -> new Guid(Decoder.string(guid)));
// or even
record Guid(String value) {
Guid {
// some assertions on the structure of value
}
static Guid fromJson(Json json) {
return new Guid(Decoder.string(guid));
}
}
Decoder.string(json, "guid", Guid::fromJson);
- When something goes wrong, the API can handle the fiddlyness of capturing
information for feedback.
In the code I've sketched out its just what field/index things went
wrong at. Potentially
capturing metadata like row/col numbers of the source would be sensible
too.
Its just not reasonable to expect devs to do extra work to get that and
its really nice to give it.
There are also some downsides like
- I do not know how compatible it would be with lazy trees.
Lazy trees being the only way that a tree api could handle big or deep
documents.
The general concept as applied in libraries like json-tree[9] is to
navigate without
doing any work, and that clashes with wanting to instanceof check the
info at the
current path.
- It *almost* gives enough information to be a general schema approach
If one field fails, that in the model throws an exception immediately.
If an API should
return "errors": [...], that is inconvenient to construct.
- None of the existing popular libraries are doing this
The only mechanics that are strictly required to give this sort of API
is lambdas. Those have
been out for a decade. Yes sealed interfaces make the data model
prettier but in concept you
can build the same thing on top of anything.
I could argue that this is because of "cultural momentum" of databind
or some other reason,
but the fact remains that it isn't a proven out approach.
Writing Json libraries is a todo list[10]. There are a lot of bad
ideas and this might be one of the,
- Performance impact of so many instanceof checks
I've gotten a 4.2% slowdown compared to the "regular" tree code without
the repeated casts.
But that was with a parser that is 5x slower than Jacksons. (using the
same benchmark project as for the snippets).
I think there could be reason to believe that the JIT does well enough
with repeated instanceof
checks to consider it.
My current thinking is that - despite not solving for large or deep
documents - starting with a really "dumb" realized tree api
might be the right place to start for the read side of a potential
incubator module.
But regardless - this feels like a good time to start more concrete
conversations. I fell I should cap this email since I've reached the point
of decoherence and haven't even mentioned the write side of things
[1]: http://www.cowtowncoder.com/blog/archives/2009/01/entry_131.html
[2]: https://security.snyk.io/vuln/maven?search=jackson-databind
[3]: I only know like 8 people
[4]:
https://github.com/fabienrenaud/java-json-benchmark/blob/master/src/main/java/com/github/fabienrenaud/jjb/stream/UsersStreamDeserializer.java
[5]: When I say "intent", I do so knowing full well no one has been
actively thinking of this for an entire Game of Thrones
[6]:
https://github.com/yahoo/mysql_perf_analyzer/blob/master/myperf/src/main/java/com/yahoo/dba/perf/myperf/common/SNMPSettings.java
[7]: https://www.infoq.com/articles/data-oriented-programming-java/
[8]: https://package.elm-lang.org/packages/elm/json/latest/Json-Decode
[9]: https://github.com/jbee/json-tree
[10]: https://stackoverflow.com/a/14442630/2948173
[11]: In 30 days JEP-198 it will be recognizably PI days old for the 2nd
time in its history.
[12]: To me, the fact that is still an open JEP is more a social
convenience than anything. I could just as easily writing this exact same
email about TOML.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20221215/67be6730/attachment-0001.htm>
More information about the core-libs-dev
mailing list