String.indexOf optimization

Fri Apr 4 17:13:14 UTC 2014

Summary:

Many people (myself included) have looked at this problem.  It's unlikely
that String.indexOf will change.  It's hard to beat the naive
implementation in the typical case.  The 64k size of the character set will
make Boyer-Moore harder to implement efficiently.

On Fri, Apr 4, 2014 at 7:49 AM, Zoltan Sziladi <kissziszi at gmail.com> wrote:

> Hi,
>
> I am new to this mailing list so please forgive me if this has been
> discussed before.
>
> I was looking at the implementation of String.indexOf and I see that
> it uses the O(n^2) naive implementation. I have been trying to find
> out why it does not use some kind of a modern, sophisticated O(n)
> algorithm but I have no clear answer as of now.
>
> My guess is that the average case should be quite good for this
> algorithm because in practice the partial matches are actually quite
> rare, so it should work well... usually. Also, I saw that this code
> was last touched about 6 years ago, so maybe it was just left like
> this?
> My concern is actually the worst case scenario. If we compared two
> long strings with lots of partial matches, then this would perform
> quite poorly. Wouldn't it be worth having an O(n) implementation here
> then? Modern O(n) pattern matching algorithms don't use much extra
> space either.
> The Collections.sort method also uses an algorithm that prepares for
> worst case. Maybe a highly optimized quicksort could outperform the
> current mergesort implementation but I suppose one of the design
> principles behind that was also to prepare for the worst case. (Even
> though a nicely implemented quicksort has an expected average case
> runtime of O(nlogn) regardless of the input).
>
> If anyone knows why it is implemented this way or if it were possible
> to change the implementation, I'd be happy to hear your opinion.
> Thanks!
>
> Regards,
> Zoltan
>