Question on String#indexOf(String)

Jim Andreou jim.andreou at gmail.com
Tue Apr 28 18:41:54 UTC 2009


Many thanks Mike! All this information is very interesting. Yes, I'm sure
that even a small degradation in performance in such a common method would
be considered a performance bug (I wouldn't like it either), at least it's
good to know that the alternatives have been analyzed.
For some people out there it must be surprising that substring search can
sometimes be more fast via java.util.regex. That's a point to take home, to
also keep life sufficiently complicated.

Indeed, as you say, the cases I made regex win were when it could jump 15-20
chars at a time, while indexOf was trying to match (unsuccessfully down the
road) quite more characters of the pattern than just the first. Although the
latter should be quite good in search strings with few repeated characters.

In specific cases though with small alphabets (i.e. aligning genes in DNA),
I think both would be bad. But perhaps such rare needs are better left to be
developed in user space (and this looks promising:
http://www.cs.utexas.edu/users/moore/publications/sustik-moore.pdf
) <http://www.cs.utexas.edu/users/moore/publications/sustik-moore.pdf>.

Regards,
Dimitris

2009/4/28 madbot <madbot at gmail.com>

> The overhead of setting up one of the faster algorithms was only worth it
> when the skip ahead is large and/or the data to search is large. So you need
> heuristics to determine when the overhead is worth it. Those heuristics
> would also have to be fast.
>
> Of course you could also leave that decision to the client invoking the
> search. They may know, for instance, if they will be repeatedly invoking the
> search. If you're searching for a string with a good jump ahead in a lot of
> data and/or you will be doing it repeatedly, it's a good idea to use
> java.util.regex where you can get the boyer moore speed.
>
> Remember, also, that once a large number of programmers get used to certain
> performance tradeoffs it is painful to pull the rug out from underneath them
> and change those tradeoffs.
>
> Anyway, all that said, I don't think there's anything inherently wrong with
> trying to apply some smarter algorithms. Just be prepared for a lot of
> backlash because you made searching for fred in fredrick take longer.
>
> madbot
>
>
>
> On Tue, Apr 28, 2009 at 8:53 AM, Joshua Bloch <jjb at google.com> wrote:
>
>> I vaguely recall that madbot (Mike McCloskey) did some performance work on
>> this method, and came to the conclusion that more sophisticated algorithms
>> didn't actually pay for themselves in the common cases.  I'm copying him so
>> he can confirm or deny.
>>           Josh
>>
>> On Tue, Apr 28, 2009 at 3:50 AM, Jim Andreou <jim.andreou at gmail.com>wrote:
>>
>>> Hi,
>>> I wonder why String#indexOf(String) is implemented as it is. Apparently,
>>> when a character mismatch with the searched pattern is found, the pattern is
>>> only shifted by one character, but there are faster algorithms, for example
>>> see
>>> http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/index.html.
>>> Was anything smarter tried out but had significant disadvantages for general
>>> use? What advantages does the current implementation have? It looks very
>>> pessimistic.
>>>
>>> Regards,
>>> Dimitris Andreou
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20090428/d5769322/attachment.html>


More information about the core-libs-dev mailing list