RFR: 8313693: Introduce an internal utility for the Damerau–Levenshtein distance calculation

Pavel Rappo prappo at openjdk.org
Fri Aug 4 15:57:54 UTC 2023


On Fri, 4 Aug 2023 13:29:45 GMT, Pavel Rappo <prappo at openjdk.org> wrote:

> Please review this PR to introduce the Damerau–Levenshtein distance calculation for later use in REPL and CLI.
> 
> The Damerau–Levenshtein distance (DL) is a string similarity metric, variants of which have been successfully used in REPL and CLI:
> 
>     $ git hlpe
>     git: 'hlpe' is not a git command. See 'git --help'.
>     
>     The most similar commands are
> 	    grep
> 	    help
> 
> or
> 
>     $ hg clonmit
>     hg: unknown command 'clonmit'
>     (did you mean one of clone, commit, config?)
> 
> Originally, DL was requested by https://bugs.openjdk.org/browse/JDK-8288660 to provide helpful error messages for mistyped javadoc tags. However, given its wider applicability (e.g. mistyped command-line options), it was suggested that DL is put somewhere more internally accessible, for example, in `com.sun.tools.javac.util`, which is available in `jdk.compiler` and already exported to some other potential clients:
> 
>     exports com.sun.tools.javac.util to
>             jdk.jdeps,
>             jdk.javadoc,
>             jdk.jshell;
> 
> ---
> 
> 
> The Levenshtein edit distance is concerned with three types of edits: insertions, deletions, and substitutions. The Damerau-Levenshtein distance adds the fourth type: transpositions of two adjacent characters. This considerably complicates the algorithm, which is currently known in two main forms: restricted and unrestricted.
> 
> The difference between the two forms is that the simpler form, restricted, operates under the assumption that no substring is edited more than once. This results in different distances between some strings, for example, "CA" and "ABC". Restricted DL gives 3, whereas unrestricted DL gives 2.
> 
> Git seems to have [chosen](https://git.kernel.org/pub/scm/git/git.git/commit/?id=8af84dadb142f7321ff0ce8690385e99da8ede2f) restricted DL. This PR proposes **un**restricted DL. From my nonexpert perspective, while unrestricted DL does not necessarily guarantee a better end result, it feels a better place to start. Additionally, unrestricted DL is a proper metric. That property could probably be used to quickly estimate the upper bound of DL in some applications.
> 
> Like restricted DL, unrestricted DL runs in `O(s1.length() * s2.length())` in time. However, unrestricted DL requires more space: `O(s1.length() * s2.length())` plus the space for the alphabet, which for the purposes stated in this PR should be relatively small: `[a-zA-Z0-9]` plus maybe a dozen of additional symbols, such as `_` and `-`.
> 
> The algorithm that this PR proposes is my port of ...

I was asked elsewhere to provide demonstration of intended use, for example in jdk.javadoc. Okay:

        var knownTags = """
                author
                code
                deprecated
                docRoot
                exception
                hidden
                index
                inheritDoc
                link
                linkplain
                literal
                param
                provides
                return
                see
                serial
                serialData
                serialField
                since
                snippet
                summary
                systemProperty
                throws
                uses
                value
                version"""
                .lines().toList();

        record Pair(String tag, int distance) { }

        for (var misspelled : """
                serial-filed
                inherotDoc
                inheritdoc
                param
                parm
                returns
                throw
                depreciated
                """.lines().toList()) {
            knownTags.stream()
                    .map(tag -> new Pair(tag, DamerauLevenshteinDistance.of(misspelled, tag)))
                    .filter(i -> Double.compare(1 / 3.0, ((double) i.distance()) / i.tag().length()) >= 0)
                    .sorted(Comparator.comparingDouble(Pair::distance))
                    .limit(3)
                    .forEach(p -> System.out.println(misspelled + "\n\tdid you mean " + p.tag() + "?"));
        }

results in this

serial-filed
	did you mean serialField?
inherotDoc
	did you mean inheritDoc?
inheritdoc
	did you mean inheritDoc?
param
	did you mean param?
parm
	did you mean param?
returns
	did you mean return?
throw
	did you mean throws?
depreciated
	did you mean deprecated?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/15157#issuecomment-1665830370


More information about the compiler-dev mailing list