20120704, 16:17  #45 
"Ben"
Feb 2007
2·1,789 Posts 
Outstanding, axn!
All together, your optimizations sped it up nearly 20%! Code:
3.889000 milliseconds for big sieve 5761455 big primes (< 100000000) found 47.001999 milliseconds for big sieve 50847534 big primes (< 1000000000) found 741.184021 milliseconds for big sieve 455052511 big primes (< 10000000000) found And, it made me think of one more thing to try... hopefully I'll get a chance to later today. 
20120705, 14:16  #46 
"Ben"
Feb 2007
2×1,789 Posts 
+ "Continue" can be "break" ()
+ added three more primes to the small prime presieve + a few more code cleanups now 15% faster: Code:
3.314000 milliseconds for big sieve 5761455 big primes (< 100000000) found 39.870998 milliseconds for big sieve 50847534 big primes (< 1000000000) found 654.596985 milliseconds for big sieve 455052511 big primes (< 10000000000) found 
20120705, 15:34  #47 
If I May
"Chris Halsall"
Sep 2002
Barbados
37·269 Posts 
Great work bsquared and axn!
Now I'm going to ask a question which will probably reveal my naivety on such matters... At what point can such GPU processing be incorporated into mfakt* to eliminate the current need for CPU presieving? I know that rcv did some work on this, but I don't know if it actually went anywhere. As it currently is, for higherend GPUs, it can take many CPU cores just to saturate a single GPU. Insight from those who actually understand how all this works would be greatly appreciated. 
20120705, 16:10  #48  
"Ben"
Feb 2007
2·1,789 Posts 
Quote:
I've been talking to rcv about this a little. His code was incorporated into Bdot's github repository, but I don't know how far anyone has taken it toward merging it into the master (it is currently a branch). I also don't know how much work, if any, has been done to figure out how much of a hit TF performance will take if the GPU is also doing sieving, whether or not that performance hit is a net win given that freed CPU cycles can be used for other work (it depends on what other work is done, of course), or to figure out optimal sieve/TF crossover points, etc. There is likely a lot of work to do there. I don't have any plans at the moment to try to port this code into mfakt*. My only ulterior motive (hope? notion?) is to someday use the sieving tricks I've learned to make a fast tinyqs engine for factoring composite residues in multilargeprime variants of SIQS or NFS. 

20120705, 16:14  #49 
"Ben"
Feb 2007
2×1,789 Posts 
+ special main loop for larger primes
Only really helps if many larger primes are needed, of course: Code:
3.324000 milliseconds for big sieve 5761455 big primes (< 100000000) found 39.411999 milliseconds for big sieve 50847534 big primes (< 1000000000) found 587.780029 milliseconds for big sieve 455052511 big primes (< 10000000000) found 
20120705, 16:34  #50  
If I May
"Chris Halsall"
Sep 2002
Barbados
23341_{8} Posts 
Quote:
I would argue that while GPU TFing will of course take a "hit" by not having the candidates presieved by a CPU, it would be better to have the GPUs do all the work required for TFing. This would mean that only a single instance of mfakt* would need to be run per card, and the CPU(s) would be free to do other work. As the musical davieddy once said, "a CPU shouldn't get out of bed to do trial factoring." 

20120706, 03:18  #51 
Jun 2003
5149_{10} Posts 
I would think that the logical thing would be to start with rcv's code, and see if you can incorporate some of the concepts there.

20120715, 16:26  #52 
P90 years forever!
Aug 2002
Yeehaw, FL
1DD0_{16} Posts 
More ideas for you:
pinfo is an array of structures. Structure of arrays usually works better (threads coalesce global memory references better). That is, turn pinfo into two halfsized arrays (primes and pinv). Division is bad, oh so bad. Eliminate your divby3s and divby6s by storing p/6 in pinfo instead of p. Won't you also have to store p mod 6  increasing global memory accessing costs? You could, or you could have two pinfo arrays, one for 1 mod 6 and another for 5 mod 6 primes and simply double the amount of kernel code that processes primes. 
20120716, 03:28  #53  
Jun 2003
141D_{16} Posts 
Quote:


20120716, 13:36  #54 
"Ben"
Feb 2007
2·1,789 Posts 
Thanks for the suggestions George.
As axn said, though, the divisions by constants get optimized away, so anything I try to do to get rid of them only makes things slower. Using two half sized arrays did help a bit  thanks again! Code:
3.307000 milliseconds for big sieve 5761455 big primes (< 100000000) found 38.532001 milliseconds for big sieve 50847534 big primes (< 1000000000) found 556.763000 milliseconds for big sieve 455052511 big primes (< 10000000000) found 
20120716, 15:17  #55 
P90 years forever!
Aug 2002
Yeehaw, FL
2^{4}·3^{2}·53 Posts 
Glad to help! Something else to try to reduce memory usage. Change the primes array from a uint32 to a uint16. Instead of the prime number, store the difference between the new prime to test and the last prime tested by the thread. In other words, it will be the distance to the 256th previous prime. This reduces memory accesses by half at the cost of one addition.
I haven't thought it through, but you should be able to do the same thing to pinv once the prime gets large enough. P.S. You may have to read 32bits at a time from memory and split it into uint16s yourself (there is a PTX instruction to do that) to get the best performance. I'm still learning CUDA myself, so let me know what works and what doesn't. Thanks! 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
mfaktc: a CUDA program for Mersenne prefactoring  TheJudger  GPU Computing  3506  20210918 00:04 
The P1 factoring CUDA program  firejuggler  GPU Computing  753  20201212 18:07 
End of the world as we know it (in music)  firejuggler  Lounge  3  20121222 01:43 
World Cup Soccer  davieddy  Hobbies  111  20110528 19:21 
World's dumbest CUDA program?  xilman  Programming  1  20091116 10:26 