Proposal: UPSTREAM.md -- better tracking of upstream code in the JDK

Thu Apr 21 22:12:16 UTC 2022

On 4/21/2022 2:05 PM, Magnus Ihse Bursie wrote:
> On 2022-04-21 22:19, Philip Race wrote:
>>
>> A "marker" file indicating something is 3rd party that may be updated 
>> from time to time seems fine
>> but upgrading 3rd party libraries is already a pain so I'm not sure 
>> how prescriptive I'd want to
>> be about required content beyond simple basics.
>>
>> In the client area we've started to add files called UPDATING.txt 
>> where we put the information
>> related to tasks when updating. Whilst some library might want to put 
>> that in an UPSTREAM.md
>> I'd want to have the option to just have one line saying "See 
>> UPDATING.txt for ..."
>
> As I said to Kevin, I'm basically viewing UPSTREAM.md as an evolution 
> of UPDATING.txt, so in effect you would update UPSTREAM.md instead of 
> UPDATING.txt, but in exactly the same way (and with almost the very 
> same content!). Of course, you can put "See UPDATINNG.txt for ..." in 
> UPSTREAM.md, but then you'd get two files where one would do.

Combining the instructions to update the various third-party libraries 
into a single monolithic file seems like the wrong approach to me. I 
think it make much more sense to have the instructions live in the 
module in question with the third-party code being updated.

>> I'm not sure we really need to include the current version in there.
>> Then we'd perhaps be able to avoid updating this file every time.
> If we want to keep track of the URL where we downloaded the actual 
> source release from, we will need to update the file anyway.
>
> If the cost of updating the file is too high, we can do with a more 
> "static" file that just serves as a marker for external code. That 
> would indeed solve the problems I was running into, that triggered my 
> thinking about this. And it is probably possible in most cases to 
> trace what version where included by finding the latest changed files 
> in the directory, and looking up the corresponding issue on JBS.
>
> But I still can't help thinking it would be good to have it stored in 
> the source code repo what version we actually included. I think the 
> cost of maintaining this would be low (compared to the other work 
> required when upgrading, updating two lines in a text file is not 
> really a big thing), and it would mean that the version information 
> will be "co-located" with the source code. You can check out any 
> commit whatsoever, and find out what versions of external source code 
> where included.
>
> As I said to Kevin, I think it would be a missed opportunity not to 
> track versions systematically.

But we do track the version systematically -- in the xxx.md file for 
each third-party software. Updating that legal/xxx.md file is a 
requirement which doesn't go away if you store it in a second location. 
It just leads to duplication.

-- Kevin

>> BTW the true "upstream location" is more usually a site to download 
>> foo-1.2.3.tar.gz .. not some  repo tag.
> I agree, a curl:able link to the source tar ball is probably better.
>
>> We even have some open source 3rd party code for which you won't find 
>> a repo anywhere.
>>
>> And I don't think it fair to call the locations of the upstream 
>> libraries "haphazard".
> That was not really directed at you. :-) The client native libraries 
> are very well organized, thank you very much!
>
>> They are in the places they need to be, in many cases partly 
>> determined by the build team,
>> within the necessities of the modular JDK.
>>
>> I'm curious what "possible for the build system to automatically 
>> disable warnings-as-errors for such code"
>> means in practice.
>
> I have no prototype code to show you, but it would not be too hard to 
> look for such a file, and to treat all files residing in directories 
> below an UPSTREAM.md file differently. For instance, disable 
> warnings-as-errors. Or disabling a broader set of warnings.
>
> For client native libraries in particular, it means that we could set 
> a high bar for warnings on code we write ourselves, but add exceptions 
> that disable warnings just for imported code. Even if we mix "own" 
> code with imported, in the same lib. And we would be able to separate 
> these files into two sets (imported and "our"), automatically.
>
>> Note that there are some cases where JDK "glue" code is co-mingled in 
>> the same directory,
>> so you'd have to refactor that if this were applied universally and 
>> always. 
>
> Yeah, I know. Many client libraries have glue code like that. But most 
> of them are already refactored to have imported code in a separate 
> directory. I can help with refactoring the remaining.
>
>> And perhaps we'd prefer to know about those warnings rather than just 
>> have them re-accumulate ..
> If we can separate this automatically, we can chose warning levels for 
> "our" code and imported code separately. So we could have like 
> "enable-warnings-for-imported-code", which can be on -- or off -- by 
> default. Or whatever. I think we have plenty of opportunity, as long 
> as there is a programmatic way to distinguish imported source code.
>
> /Magnus
>
>>
>> -phil.
>>
>> On 4/21/22 11:58 AM, Magnus Ihse Bursie wrote:
>>> The JDK project depends on many different open source projects. Some 
>>> of them are linked to as libraries at runtime, but others have their 
>>> source code directly incorporated into our source tree, known as 
>>> "3rd party code".
>>>
>>> Unfortunately, the haphazard way this code is sprinkled throughout 
>>> our code base makes it very hard to tell at a glance if some code 
>>> originated with the JDK project, or is imported from elsewhere 
>>> ("upstream"). Many times, you need to be well acquainted with these 
>>> parts of the code to know whether a file is 3rd party code or not. 
>>> If you do not know, you will need to rely on heuristics such as 
>>> looking at the path name, checking for unusual copyright headers, or 
>>> looking at the git history for commits that indicate a refresh from 
>>> upstream.
>>>
>>> I propose we do something about this situation.
>>>
>>> My suggestion is that we add a file, UPSTREAM.md, in the top 
>>> directory of the imported 3rd party code. These files will follow a 
>>> pattern, with a set of formalized headers on the top, a blank line 
>>> of separation, and then a free-form markdown text, with e.g. 
>>> relevant notes about the project, important information about the 
>>> latest update, or instructions or hints on how to update the source 
>>> to a newer version.
>>>
>>> Here are two examples on how this might look. (Note that the 
>>> free-form text here is just some offhand examples I invented. In 
>>> real life I assume they would be more detailed.)
>>>
>>> Example 1: src/java.xml.crypto/share/classes/com/sun/UPSTREAM.md:
>>> ===
>>> Name: Apache Santuario
>>> Homepage: https://santuario.apache.org/
>>> License: src/java.xml.crypto/share/legal/santuario.md
>>> Version: 2.2.1
>>> Upstream-release-URL: 
>>> https://github.com/apache/santuario-xml-security-java/releases/tag/xmlsec-2.2.1
>>>
>>> # Upgrade instructions
>>>
>>> To upgrade the package, copy the source code from 
>>> `src/main/java/org/apache` in the upstream git repo into 
>>> `src/java.xml.crypto/share/classes/com/sun/org/apache`. Then update 
>>> the package name space by running `find 
>>> src/java.xml.crypto/share/classes/com/sun/org/apache | xargs sed -e 
>>> 's/^package org\.apache/package com.sun.org.apache/'`.
>>> ===
>>>
>>> Example 2: src/java.desktop/share/native/libharfbuzz/UPSTREAM.md:
>>> ===
>>> Name: Harfbuzz
>>> Homepage: https://harfbuzz.github.io/
>>> License: src/java.desktop/share/legal/harfbuzz.md
>>> Version: 2.8.0
>>> Upstream-release-URL: 
>>> https://github.com/harfbuzz/harfbuzz/releases/tag/2.8.0
>>>
>>> # How to update
>>>
>>> To update to a new version of Harfbuzz, copy all `.cc`, `.hh` and 
>>> `.h` files from `src` into 
>>> `src/java.desktop/share/native/libharfbuzz`. Check if the build 
>>> scripts in upstream has changed since the last version, and update 
>>> our makefiles accordingly.
>>> ===
>>>
>>>
>>> These files will serve many purposes:
>>>
>>> 1) They will be a strong signal to developers coming to an 
>>> unfamiliar part of the code base that the files here originated 
>>> upstream.
>>>
>>> 2) It will be possible for tooling to understand that code in these 
>>> directories might not live up to normal JDK standards. It would e.g. 
>>> be possible for the build system to automatically disable 
>>> warnings-as-errors for such code, or for upcoming tools that support 
>>> code quality efforts such as blessed modifier order or spell checks 
>>> to skip those parts of the code.
>>>
>>> 3) It will be possible to get an at-a-glance overview of what 
>>> versions of 3rd party code are included in a build of the JDK, for 
>>> all included projects -- not just as of right now, but at any point 
>>> in history (since these files gets updated when upstream code is 
>>> updated in the JDK). The build system could, for instance, collect 
>>> such information and provide it with the built JDK, just as it now 
>>> collects the licenses from the src/$MODULE/legal directories.
>>>
>>> 4) The git history for these files will clearly show when the code 
>>> were last refreshed from upstream, and by whom.
>>>
>>> 5) And finally, the free-text part gives a well-defined place to 
>>> store important information about how to upgrade, common mistakes, 
>>> etc -- knowledge that right now sometimes is put down into README 
>>> files, but most often just resides in the head of the developer who 
>>> last did a refresh.
>>>
>>> Thoughts?
>>>
>>> /Magnus
>>
>