Proposal: UPSTREAM.md -- better tracking of upstream code in the JDK

Thu Apr 21 23:40:52 UTC 2022

On 4/21/2022 4:24 PM, Magnus Ihse Bursie wrote:
> On 2022-04-22 00:12, Kevin Rushforth wrote:
>>
>>
>> On 4/21/2022 2:05 PM, Magnus Ihse Bursie wrote:
>>> On 2022-04-21 22:19, Philip Race wrote:
>>>>
>>>> A "marker" file indicating something is 3rd party that may be 
>>>> updated from time to time seems fine
>>>> but upgrading 3rd party libraries is already a pain so I'm not sure 
>>>> how prescriptive I'd want to
>>>> be about required content beyond simple basics.
>>>>
>>>> In the client area we've started to add files called UPDATING.txt 
>>>> where we put the information
>>>> related to tasks when updating. Whilst some library might want to 
>>>> put that in an UPSTREAM.md
>>>> I'd want to have the option to just have one line saying "See 
>>>> UPDATING.txt for ..."
>>>
>>> As I said to Kevin, I'm basically viewing UPSTREAM.md as an 
>>> evolution of UPDATING.txt, so in effect you would update UPSTREAM.md 
>>> instead of UPDATING.txt, but in exactly the same way (and with 
>>> almost the very same content!). Of course, you can put "See 
>>> UPDATINNG.txt for ..." in UPSTREAM.md, but then you'd get two files 
>>> where one would do.
>>
>> Combining the instructions to update the various third-party 
>> libraries into a single monolithic file seems like the wrong approach 
>> to me. I think it make much more sense to have the instructions live 
>> in the module in question with the third-party code being updated.
>
> I think we're just talking past each other here. I am not suggesting 
> that we have a *single* UPSTREAM.md file. I am suggesting that we have 
> one UPSTREAM.md file per third party library, placed exactly as you 
> say with the third party code.

Yes, I definitely thought you were talking about a single file to 
aggregate them all. Sorry for the misunderstanding!

>>>> I'm not sure we really need to include the current version in there.
>>>> Then we'd perhaps be able to avoid updating this file every time.
>>> If we want to keep track of the URL where we downloaded the actual 
>>> source release from, we will need to update the file anyway.
>>>
>>> If the cost of updating the file is too high, we can do with a more 
>>> "static" file that just serves as a marker for external code. That 
>>> would indeed solve the problems I was running into, that triggered 
>>> my thinking about this. And it is probably possible in most cases to 
>>> trace what version where included by finding the latest changed 
>>> files in the directory, and looking up the corresponding issue on JBS.
>>>
>>> But I still can't help thinking it would be good to have it stored 
>>> in the source code repo what version we actually included. I think 
>>> the cost of maintaining this would be low (compared to the other 
>>> work required when upgrading, updating two lines in a text file is 
>>> not really a big thing), and it would mean that the version 
>>> information will be "co-located" with the source code. You can check 
>>> out any commit whatsoever, and find out what versions of external 
>>> source code where included.
>>>
>>> As I said to Kevin, I think it would be a missed opportunity not to 
>>> track versions systematically.
>>
>> But we do track the version systematically -- in the xxx.md file for 
>> each third-party software. Updating that legal/xxx.md file is a 
>> requirement which doesn't go away if you store it in a second 
>> location. It just leads to duplication.
>
> Well, it's kind of semi-systematically, if you ask me. Here are some 
> excerpts:
>
> java.base/share/legal/icu.md:## International Components for Unicode 
> (ICU4J) v70.1
> java.base/share/legal/public_suffix.md:## Mozilla Public Suffix List
> java.base/share/legal/unicode.md:## The Unicode Standard, Unicode 
> Character Database, Version 14.0.0
>
> But sure, I get your point that this is already stored here. Let's 
> drop that part of my proposal. (And maybe we can try to be more 
> rigorous in the future on how we describe project name and version in 
> the legal .md files.)

OK.

-- Kevin

>
> /Magnus
>
>>
>> -- Kevin
>>
>>>> BTW the true "upstream location" is more usually a site to download 
>>>> foo-1.2.3.tar.gz .. not some  repo tag.
>>> I agree, a curl:able link to the source tar ball is probably better.
>>>
>>>> We even have some open source 3rd party code for which you won't 
>>>> find a repo anywhere.
>>>>
>>>> And I don't think it fair to call the locations of the upstream 
>>>> libraries "haphazard".
>>> That was not really directed at you. :-) The client native libraries 
>>> are very well organized, thank you very much!
>>>
>>>> They are in the places they need to be, in many cases partly 
>>>> determined by the build team,
>>>> within the necessities of the modular JDK.
>>>>
>>>> I'm curious what "possible for the build system to automatically 
>>>> disable warnings-as-errors for such code"
>>>> means in practice.
>>>
>>> I have no prototype code to show you, but it would not be too hard 
>>> to look for such a file, and to treat all files residing in 
>>> directories below an UPSTREAM.md file differently. For instance, 
>>> disable warnings-as-errors. Or disabling a broader set of warnings.
>>>
>>> For client native libraries in particular, it means that we could 
>>> set a high bar for warnings on code we write ourselves, but add 
>>> exceptions that disable warnings just for imported code. Even if we 
>>> mix "own" code with imported, in the same lib. And we would be able 
>>> to separate these files into two sets (imported and "our"), 
>>> automatically.
>>>
>>>> Note that there are some cases where JDK "glue" code is co-mingled 
>>>> in the same directory,
>>>> so you'd have to refactor that if this were applied universally and 
>>>> always. 
>>>
>>> Yeah, I know. Many client libraries have glue code like that. But 
>>> most of them are already refactored to have imported code in a 
>>> separate directory. I can help with refactoring the remaining.
>>>
>>>> And perhaps we'd prefer to know about those warnings rather than 
>>>> just have them re-accumulate ..
>>> If we can separate this automatically, we can chose warning levels 
>>> for "our" code and imported code separately. So we could have like 
>>> "enable-warnings-for-imported-code", which can be on -- or off -- by 
>>> default. Or whatever. I think we have plenty of opportunity, as long 
>>> as there is a programmatic way to distinguish imported source code.
>>>
>>> /Magnus
>>>
>>>>
>>>> -phil.
>>>>
>>>> On 4/21/22 11:58 AM, Magnus Ihse Bursie wrote:
>>>>> The JDK project depends on many different open source projects. 
>>>>> Some of them are linked to as libraries at runtime, but others 
>>>>> have their source code directly incorporated into our source tree, 
>>>>> known as "3rd party code".
>>>>>
>>>>> Unfortunately, the haphazard way this code is sprinkled throughout 
>>>>> our code base makes it very hard to tell at a glance if some code 
>>>>> originated with the JDK project, or is imported from elsewhere 
>>>>> ("upstream"). Many times, you need to be well acquainted with 
>>>>> these parts of the code to know whether a file is 3rd party code 
>>>>> or not. If you do not know, you will need to rely on heuristics 
>>>>> such as looking at the path name, checking for unusual copyright 
>>>>> headers, or looking at the git history for commits that indicate a 
>>>>> refresh from upstream.
>>>>>
>>>>> I propose we do something about this situation.
>>>>>
>>>>> My suggestion is that we add a file, UPSTREAM.md, in the top 
>>>>> directory of the imported 3rd party code. These files will follow 
>>>>> a pattern, with a set of formalized headers on the top, a blank 
>>>>> line of separation, and then a free-form markdown text, with e.g. 
>>>>> relevant notes about the project, important information about the 
>>>>> latest update, or instructions or hints on how to update the 
>>>>> source to a newer version.
>>>>>
>>>>> Here are two examples on how this might look. (Note that the 
>>>>> free-form text here is just some offhand examples I invented. In 
>>>>> real life I assume they would be more detailed.)
>>>>>
>>>>> Example 1: src/java.xml.crypto/share/classes/com/sun/UPSTREAM.md:
>>>>> ===
>>>>> Name: Apache Santuario
>>>>> Homepage: https://santuario.apache.org/
>>>>> License: src/java.xml.crypto/share/legal/santuario.md
>>>>> Version: 2.2.1
>>>>> Upstream-release-URL: 
>>>>> https://github.com/apache/santuario-xml-security-java/releases/tag/xmlsec-2.2.1
>>>>>
>>>>> # Upgrade instructions
>>>>>
>>>>> To upgrade the package, copy the source code from 
>>>>> `src/main/java/org/apache` in the upstream git repo into 
>>>>> `src/java.xml.crypto/share/classes/com/sun/org/apache`. Then 
>>>>> update the package name space by running `find 
>>>>> src/java.xml.crypto/share/classes/com/sun/org/apache | xargs sed 
>>>>> -e 's/^package org\.apache/package com.sun.org.apache/'`.
>>>>> ===
>>>>>
>>>>> Example 2: src/java.desktop/share/native/libharfbuzz/UPSTREAM.md:
>>>>> ===
>>>>> Name: Harfbuzz
>>>>> Homepage: https://harfbuzz.github.io/
>>>>> License: src/java.desktop/share/legal/harfbuzz.md
>>>>> Version: 2.8.0
>>>>> Upstream-release-URL: 
>>>>> https://github.com/harfbuzz/harfbuzz/releases/tag/2.8.0
>>>>>
>>>>> # How to update
>>>>>
>>>>> To update to a new version of Harfbuzz, copy all `.cc`, `.hh` and 
>>>>> `.h` files from `src` into 
>>>>> `src/java.desktop/share/native/libharfbuzz`. Check if the build 
>>>>> scripts in upstream has changed since the last version, and update 
>>>>> our makefiles accordingly.
>>>>> ===
>>>>>
>>>>>
>>>>> These files will serve many purposes:
>>>>>
>>>>> 1) They will be a strong signal to developers coming to an 
>>>>> unfamiliar part of the code base that the files here originated 
>>>>> upstream.
>>>>>
>>>>> 2) It will be possible for tooling to understand that code in 
>>>>> these directories might not live up to normal JDK standards. It 
>>>>> would e.g. be possible for the build system to automatically 
>>>>> disable warnings-as-errors for such code, or for upcoming tools 
>>>>> that support code quality efforts such as blessed modifier order 
>>>>> or spell checks to skip those parts of the code.
>>>>>
>>>>> 3) It will be possible to get an at-a-glance overview of what 
>>>>> versions of 3rd party code are included in a build of the JDK, for 
>>>>> all included projects -- not just as of right now, but at any 
>>>>> point in history (since these files gets updated when upstream 
>>>>> code is updated in the JDK). The build system could, for instance, 
>>>>> collect such information and provide it with the built JDK, just 
>>>>> as it now collects the licenses from the src/$MODULE/legal 
>>>>> directories.
>>>>>
>>>>> 4) The git history for these files will clearly show when the code 
>>>>> were last refreshed from upstream, and by whom.
>>>>>
>>>>> 5) And finally, the free-text part gives a well-defined place to 
>>>>> store important information about how to upgrade, common mistakes, 
>>>>> etc -- knowledge that right now sometimes is put down into README 
>>>>> files, but most often just resides in the head of the developer 
>>>>> who last did a refresh.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> /Magnus
>>>>
>>>
>>
>