Re: [PATCH] 4837946: Implement Karatsuba multiplication algorithm in BigInteger
Anonymous member wrote:
Out of curiousity, how does the Java implementation compare to GMP in terms of speed?
(Note: These numbers apply to the *revised* patch that I'll be posting in a few minutes.) It depends on the size of the numbers. When I run the regression tests that I wrote, my updated Java version runs in about 2 minutes. When I run the same regression test in Kaffe using GMP, it takes about 2 *hours*. (Some of the same tests using Java 1.6 take many, many hours without my optimizations for the pow() function). But those are a lot of small numbers. There is a lot of overhead in converting from Java to C types and back, and in Kaffe's relative slowness. And currently, only multiply() has been improved in my Java patches that I've submitted. Kaffe is about 25 times slower on the average program anyway. When you run Kaffe/GMP for very large numbers, Kaffe/GMP starts being very much faster. OpenJDK still can't compete for numbers that are on the cutting edge of number theory. But we'll be much better than we were before. If I were to write the same code in pure C using GMP, then GMP would be much faster. But I haven't done that. So it's hard to compare GMP to Java exactly. But some numbers. For multiplying the numbers 3^1000000 * 3^1000001, (with my fixes to do the exponentiation hundreds of thousands of times faster factored out; without these, JDK 1.6 would be thousands of times slower,) the times for 20 iterations are: JDK 1.6 OpenJDK1.7 with my patches Kaffe w/GMP 292987 ms 28650 ms 866 ms Thus, for numbers of this size, my algorithms are more than 10 times faster than JDK 1.6, but 33 times slower than Kaffe/GMP. For multiplying the somewhat special form 2^1000000 * 2^1000001, (the arguments of this in binary are a 1 followed by a million zeroes) JDK 1.6 OpenJDK1.7 with my patches Kaffe w/GMP 117298 ms 386 ms 474 ms So, my algorithm is 303 times faster than JDK, and just slightly faster than Kaffe/GMP. For multiplying numbers 3^14000000 * 3^14000001, the time for 1 iteration is: JDK 1.6 OpenJDK1.7 with my patches Kaffe w/GMP 3499115 ms 89505 ms 910 ms Thus, for numbers of this size, my patches make it 38.9 times faster, but still 99 times slower than GMP. Sigh. Well, to be expected. If you work with large numbers, GMP becomes ridiculously faster. Kaffe with GMP becomes only slightly slower than pure GMP as the portion of time in Java gets small. GMP includes 3-way Toom-Cook multiplication, and the much faster FFT multiplication (which is O(n)) for very large numbers, and hand-crafted assembly language that uses 64-bit instructions and limbs on my 64-bit architecture (as opposed to the 32-bit ints used in BigInteger. Hopefully some day in the future, BigInteger will be rewritten to use longs internally instead of ints. Multiplying two 64-bit longs does 4 times as much work as multiplying two 32-bit ints, and will thus likely be significantly faster, especially on 64-bit architectures.) So GMP is still very much faster for very large numbers. It is also *much* faster in turning numbers into strings, and in exponentiation. The algorithms in BigInteger are horrible for pow() and toString(). See the following bugs: 4641897: BigInteger.toString() algorithm slow for large numbers http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4641897 4646474: BigInteger.pow() algorithm slow in 1.4.0 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4646474 I have several fixes for both of these. My JVM-based programming language Frink ( http://futureboy.us/frinkdocs/ ) implements many of these improvements, and is much faster for a variety of arguments. I just need to finish improving these for a variety of different arguments, and considering the threading and memory-size-vs-speed tradeoffs in implementing toString efficiently. -- Alan Eliasen | "Furious activity is no substitute eliasen@mindspring.com | for understanding." http://futureboy.us/ | --H.H. Williams
A few weeks ago, I posted some performance numbers for the improvements I'm making to the BigInteger class. Those performance numbers were for the Karatsuba multiplication algorithm that I implemented. I've now implemented 3-way Toom-Cook multiplication (and Toom-Cook squaring) which has better asymptotic performance than Karatsuba multiplication. (Karatsuba is O(n^1.585), 3-way Toom Cook is O(n^1.465). Compare this to Java 1.6 which used only O(n^2) algorithms.) One of the three algorithms are then chosen according to the size of the numbers. There's also a lot of room for doing asymmetrical Toom-Cook multiplication, with, say, one side being broken into 4 parts, and the other into 2 parts, but that's lower on my priority list. Since I haven't heard of any progress on including my previous patch, I'll be submitting a revised patch with Toom-Cook multiplication included soon. Below, interspersed among my previous statements, are some updated performance numbers with the Toom-Cook algorithm: Alan Eliasen wrote:
But some numbers. For multiplying the numbers 3^1000000 * 3^1000001, (with my fixes to do the exponentiation hundreds of thousands of times faster factored out; without these, JDK 1.6 would be thousands of times slower,) the times for 20 iterations are:
JDK 1.6 OpenJDK1.7 with my patches Kaffe w/GMP 292987 ms 28650 ms 866 ms
With Toom/Cook: 18054 ms Performance improvement over Java 1.6: 16.2x
For multiplying numbers 3^14000000 * 3^14000001, the time for 1 iteration is:
JDK 1.6 OpenJDK1.7 with my patches Kaffe w/GMP 3499115 ms 89505 ms 910 ms
With Toom/Cook: 43636 ms Performance improvement over Java 1.6: 80.18x -- Alan Eliasen | "Furious activity is no substitute eliasen@mindspring.com | for understanding." http://futureboy.us/ | --H.H. Williams
Alan Eliasen wrote: [snip]
Since I haven't heard of any progress on including my previous patch,
Sorry, still saturated with OpenJDK 6, -Joe
16 months ago, I posted the first of several patches to vastly improve the performance of the BigInteger class. These included implementation of Karatsuba and 3-way Toom-Cook multiplication and Karatsuba and Toom-Cook squaring. The particular algorithm used for multiplication or squaring is chosen according to the size of the numbers, and are these algorithms are asymptotically faster than the O(n^2) algorithms in previous versions of Java. (Karatsuba is O(n^1.585), 3-way Toom Cook is O(n^1.465)). Unfortunately, none of these patches has yet been applied or reviewed in any way by Sun. (Although they have been reviewed by some of the top researchers in the field of high-performance multiplication.) Nor have I received answers to my questions about how long Sun wants regression tests to run, size of output, etc., nor have I been provided with promised existing regression tests for BigInteger to see if they need improvement. (If I can find these somewhere, please let me know. They're not apparently in my checkout.) I'm still hoping to get these changes into Java 1.7, because the performance increase is so significant, and because this is necessary to make Java a feasible platform for number theory and numerics. Recently, Xiaobin Lu submitted patches to improve the performance of BigDecimal, and also improved BigInteger's performance in the process. Great job! Luckily, the overlap between our patches was negligible. Xiaobin may now be the person most suited to look over my patches, as he has worked intimately in the class recently. My fixes are further improved by his fixes to give some truly spectacular improvements in performance for big numbers, which should also help large BigDecimals. I was disappointed, though, that Sun knew about my larger performance improvements for a long time and didn't use this opportunity to evaluate and apply them, nor communicate with me, despite my repeated attempts and my frankly hard work on this. Below, interspersed among my previous statements, are some updated performance numbers with my patches and Xiaobin's. For multiplying the numbers 3^1000000 * 3^1000001, (with my fixes to do the exponentiation hundreds of thousands of times faster factored out; without these, JDK 1.6 would be thousands of times slower,) the times for 20 iterations are: JDK 1.6 292987 ms OpenJDK1.7 + my Karatsuba 28650 ms OpenJDK1.7 + my Toom/Cook 18054 ms Plus Xiaobin's improvements 12880 ms *** Performance improvement over Java 1.6: 22.7x For multiplying numbers 3^14000000 * 3^14000001, the time for 1 iteration is: JDK 1.6 3499115 ms OpenJDK1.7 + my Karatsuba 89505 ms OpenJDK1.7 + my Toom/Cook 43636 ms Plus Xiaobin's improvements 29813 ms *** Performance improvement over Java 1.6: 117.3x You can see that operations that weren't even feasible to consider in JDK 1.6 are now possible in Java. Operations that used to take almost an hour can be done in less than 30 seconds. This encompasses a lot of different bug reports: 4228681: Some BigInteger operations are slow with very large numbers http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4228681 (This was closed but never fixed.) 4837946: Implement Karatsuba multiplication algorithm in BigInteger http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4837946 This is done, along with Toom-Cook multiplication. My implementation is intended to be easy to read, understand, and check. It significantly improves multiplication performance for large numbers. 4646474: BigInteger.pow() algorithm slow in 1.4.0 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4646474 This is improved in many ways: * Rewrote pow() to use Karatsuba and Toom-Cook multiplication * Implemented Karatsuba squaring * Immplemented 3-way Toom-Cook squaring * Found good threshholds for Karatsuba and Toom-Cook squaring * Added an optimization to use left-shifting for multiples of 2 in the base. This improved speed by thousands of times for things like Mersenne numbers, and may be one of the most significant improvements for practical prgrams. 4641897: BigInteger.toString() algorithm slow for large numbers http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4641897 This algorithm uses a very inefficient algorithm for large numbers. I plan to replace it with a recursive divide-and-conquer algorithm devised by Schoenhage and Strassen. I have developed and tested this in my own software. This operates hundreds or thousands of times faster than the current version for large numbers. What takes a day in Java 1.6, I can do in 5 minutes. It will also benefit from faster multiplication and exponentiation. This is likely to be the most controversial algorithm because it has potential threading and synchronization and caching and memory-vs-speed tradeoffs, so I'm going to submit it separately, if I can get Sun to review the multiply and pow patches first. My patches are designed to be as readable and simple as possible. They all build on existing functions, and eschew lots of low-level bit-fiddling, as those types of changes are harder to understand and debug. I think it's best to get working algorithms with better asymptotic efficiency, as those will vastly improve performance for large numbers, and tune them by doing more low-level bit fiddling later if necessary. Even without being tuned to the nth degree, the new algorithms are vastly faster for large numbers, and identical for small numbers. I've been asked by Sun to submit my patches in small chunks, so I plan to submit just the multiplication and squaring patch, and leave the patches for pow() for later unless I hear otherwise. I'd rather they had it *all* and got it tested and integrated as soon as possible, after all this waiting, and the work it takes to separate out working code to make a smaller patch. I will be submitting a patch in the next day or two. I'm running my *huge* regression tests which produce about 250 GB of output and take at least a day to run. (And you have to run it twice against a known good platform to have something to compare to... luckily I've done that.) For those who'd just like to replace their file with my improved version (includes Xiaobin's patches), it's available at: http://futureboy.us/temp/BigInteger.java From the queries I get, this is important to a lot of people. The performance of BigInteger can be improved by tens or hundreds or thousands of times (or even more in the case of certain arguments of pow()), and should be done to make Java a more viable platform for numerics. This work has been in Sun's hands for a long time, and really needs to get into 1.7. -- Alan Eliasen eliasen@mindspring.com http://futureboy.us/
Alan Eliasen wrote:
From the queries I get, this is important to a lot of people. The performance of BigInteger can be improved by tens or hundreds or thousands of times (or even more in the case of certain arguments of pow()), and should be done to make Java a more viable platform for numerics.
This work has been in Sun's hands for a long time, and really needs to get into 1.7.
You give examples of the speedup for very large bignums, but you don't say the size of numbers at which your approach becomes faster than the current code. Of course any asymptotic improvement helps with numbers that are half a million decimal digits long, but where's the crossover? Andrew.
Andrew Haley wrote:
You give examples of the speedup for very large bignums, but you don't say the size of numbers at which your approach becomes faster than the current code. Of course any asymptotic improvement helps with numbers that are half a million decimal digits long, but where's the crossover?
Andrew, Good question. Like most Karatsuba and Toom-Cook implementations, the optimal crossover point is vague, as it can depend on the size of both operands, and the curves don't separate fast at that point. If you look at the updated file (again, at http://futureboy.us/temp/BigInteger.java ) you'll see the crossover points: /** * The threshold value for using Karatsuba multiplication. If the number * of ints in both mag arrays are greater than this number, then * Karatsuba multiplication will be used. This value is found * experimentally to work well. */ private static final int KARATSUBA_THRESHOLD = 50; /** * The threshold value for using 3-way Toom-Cook multiplication. * If the number of ints in both mag arrays are greater than this number, * then Toom-Cook multiplication will be used. This value is found * experimentally to work well. */ private static final int TOOM_COOK_THRESHOLD = 75; /** * The threshold value for using Karatsuba squaring. If the number * of ints in the number are larger than this value, * Karatsuba squaring will be used. This value is found * experimentally to work well. */ private static final int KARATSUBA_SQUARE_THRESHOLD = 90; /** * The threshold value for using Toom-Cook squaring. If the number * of ints in the number are larger than this value, * Karatsuba squaring will be used. This value is found * experimentally to work well. */ private static final int TOOM_COOK_SQUARE_THRESHOLD = 140; Thus, the first crossover for Karatsuba multiplication is at 50*32 bits (1600 bits), or about 482 decimal digits. Toom-Cook at 2400 bits. (Hint: divide the number of bits by 3.32 to get the approximate number of decimal digits.) These crossovers are determined experimentally, of course, and constants are chosen that work well for a variety of number forms and operand combinations. These numbers were found to work well after a lot of timing runs with a lot of different number sizes and forms. Of course, it's possible that different thresholds can work better on different architectures and VM settings. I'm not proposing we tune for each architecture, but rather choose conservative crossover thresholds which work well, which I believe these are. It generally isn't damaging if you're just "in the ballpark" when setting your thresholds; the algorithms will still work fine, and performance will still be very similar for most operands. The curves don't diverge too quickly around the crossover point, and there's a lot of random variation for different size operands, and all of your threshold points interact with each other, so it's a multi-dimensional optimization problem. In fact, the effects of choosing different thresholds is hard to predict. There may be little local minima and maxima in performance whether you choose, say, 45 or 48 or 50 or 52 for the crossover, and operands of different lengths may behave quite differently with different crossovers. You won't determine a good crossover point by asymptotic analysis. You also have to see if a certain combination of crossovers exhibits anomalously bad behavior for certain combinations of operand sizes and avoid those, which is why it's as much black art as science. The crossover point also depends on how efficient your "grade school" multiplication vs. your Karatsuba multiplication vs. your Toom-Cook implementation is (and there are *lots* of different ways to split and combine numbers in the Toom-Cook method.) My method for Toom-Cook is the "optimal" algorithm found by Marco Bodrato, (see citations in source code) who is perhaps the leading researcher into Toom-Cook multiplication. He also analyzed this patch, corrected my original Toom-Cook to a slightly different, more optimal variation, (and this changed the Toom-Cook threshold downwards somewhat, as it was a bit faster.) There are also possible optimizations in the future for "unbalanced" operand sizes, where one operand is significantly larger than the other. There are unbalanced Toom-Cook multiplies that can achieve better performance than the straight 3-way Toom-Cook. And of course there are higher-order Toom-Cook multiplications possible, for even larger numbers, but the marginal benefits of higher orders diminishes, and the code is more complex. Again, these crossovers were found experimentally to work well. They still work well with tests of Xiaobin's improvements to multiplication, but I may run yet another set of exhaustive tuning tests to see if the thresholds can be again improved after my massive regression test finishes, hopefully in the next 24 hours. Note that my optimizations for the pow() function give vastly better performance at even small bit sizes for many operands, as they factor out powers of 2 in the exponent and perform these very rapidly as bit-shifts. -- Alan Eliasen eliasen@mindspring.com http://futureboy.us/
Alan Eliasen wrote:
Note that my optimizations for the pow() function give vastly better performance at even small bit sizes for many operands, as they factor out powers of 2 in the exponent and perform these very rapidly as bit-shifts.
Oops. I mean powers of 2 in the *base*, of course. The exponentiation of these can be done extremely rapidly as left-shifts, and the remaining portion of the number can be exponentiated more rapidly because it's smaller. Otherwise, exponentiation is done by repeated squaring as before, but with much better algorithms that take advantage of new implementations of Karatsuba and Toom-Cook squaring for larger operands. -- Alan Eliasen eliasen@mindspring.com http://futureboy.us/
* Alan Eliasen:
Alan Eliasen wrote:
Note that my optimizations for the pow() function give vastly better performance at even small bit sizes for many operands, as they factor out powers of 2 in the exponent and perform these very rapidly as bit-shifts.
Oops. I mean powers of 2 in the *base*, of course. The exponentiation of these can be done extremely rapidly as left-shifts, and the remaining portion of the number can be exponentiated more rapidly because it's smaller.
To provide some more background, most of us probably worry about BigInteger performance in the 512 to 2048 bit range because that's the range used for RSA cryptography (assuming that Java uses the Chinese Reminder Theorem optimization for private key operations).
Alan Eliasen wrote:
Andrew Haley wrote:
You give examples of the speedup for very large bignums, but you don't say the size of numbers at which your approach becomes faster than the current code. Of course any asymptotic improvement helps with numbers that are half a million decimal digits long, but where's the crossover?
Good question. Like most Karatsuba and Toom-Cook implementations, the optimal crossover point is vague, as it can depend on the size of both operands, and the curves don't separate fast at that point. If you look at the updated file (again, at http://futureboy.us/temp/BigInteger.java ) you'll see the crossover points:
Thus, the first crossover for Karatsuba multiplication is at 50*32 bits (1600 bits), or about 482 decimal digits. Toom-Cook at 2400 bits.
OK, that's a little more encouraging. The measures you posted before were for nubers about half a million decimal digits long, which is well into FFT multiplication territory. That we can see some improvement for ~ 500-digit numbers is good to see.
These crossovers are determined experimentally, of course, and constants are chosen that work well for a variety of number forms and operand combinations. These numbers were found to work well after a lot of timing runs with a lot of different number sizes and forms. Of course, it's possible that different thresholds can work better on different architectures and VM settings. I'm not proposing we tune for each architecture, but rather choose conservative crossover thresholds which work well, which I believe these are.
It generally isn't damaging if you're just "in the ballpark" when setting your thresholds; the algorithms will still work fine, and performance will still be very similar for most operands.
Sure, that makes sense. Andrew.
participants (4)
-
Alan Eliasen
-
Andrew Haley
-
Florian Weimer
-
Joseph D. Darcy