y-cruncher - A Multi-Threaded Pi-Program

From a high-school project that went a little too far...

By Alexander J. Yee

(Last updated: July 2, 2018)

 

Shortcuts:

 

The first scalable multi-threaded Pi-benchmark for multi-core systems...

 

How fast can your computer compute Pi?

 

y-cruncher is a program that can compute Pi and other constants to trillions of digits.

It is the first of its kind that is multi-threaded and scalable to multi-core systems. Ever since its launch in 2009, it has become a common benchmarking and stress-testing application for overclockers and hardware enthusiasts.

 

y-cruncher has been used to set several world records for the most digits of Pi ever computed.

 

Current Release:

Windows: Version 0.7.5 Build 9481 (Released: February 24, 2018)

Linux      : Version 0.7.5 Build 9481 (Released: February 24, 2018)

 

Official HWBOT thread.

Official XtremeSystems Forums thread.

 

News:

 

AVX512 Scalability - One Year In (July 2, 2018) - permalink

 

When AVX512 (for consumer hardware) launched a year ago with Skylake X, few applications supported it. y-cruncher was one of those few, but the performance was very disappointing due to all sorts of hardware issues and unexpected performance bottlenecks.

 

Over the past year, work was done to target those bottlenecks. And while they failed to eliminate them, they still improved performance by a lot. So now that there's been enough time to properly optimize and tune for AVX512, we can finally do a fair evaluation of this instruction set for bignum crunching.

 

The table below shows scalability across two dimensions: instruction set (AVX2 -> AVX512) and parallelism. This processor has 10 cores hyperthreaded to 20 threads. The clock speed is normalized to 3.6 GHz for all workloads. This is admittedly unrealistic since AVX512 will usually run at a lower frequency. But it does provide a perfect clock-for-clock comparison between AVX2 and AVX512.

1 Billion Digits of Pi - (Times in Seconds)

Core i9 7900X @ 3.6 GHz - 4 x DDR4 @ 3000 MT/s

  y-cruncher v0.7.3 (July 2017) y-cruncher v0.7.6 (ETA 2018)
  14-BDW (AVX2) 17-SKX (AVX512) Speedup 14-BDW (AVX2) 17-SKX (AVX512) Speedup
1 thread 473.317 331.888 1.43 x 439.967 272.042 1.62 x
20 threads 48.658 39.862 1.22 x 41.833 30.645 1.37 x
Speedup 9.73 x 8.33 x   10.50 x 8.88 x  

The single-threaded benchmarks show the raw speed of AVX2 vs. AVX512. While everything got faster from v0.7.3 to v0.7.6, the AVX512 improved more. This is because v0.7.6 has all the new optimizations that can only be done with real AVX512 hardware. In contrast, v0.7.3 was released only 2 weeks after Skylake X was launched. Much of the AVX512 in v0.7.3 was written years ago using only emulators and before the hardware ever existed in silicon.

 

The multi-threaded runs show smaller speedups from AVX2 to AVX512. This is due to memory bandwidth becoming a factor. Even with 4 channels of high-clocked memory, Skylake X with more than a few cores does not have enough memory bandwidth to feed all the cores. Many of the optimizations between v0.7.3 and v0.7.6 were targetted at alleviating this bottleneck.

 

 

So as of 2018, the raw AVX2 -> AVX512 speedup is 62% on Skylake X with dual 512-bit FMAs. Given the amount of code that remains unvectorized, 62% is reasonable due to Amdahl's Law. Furthermore, AVX512 only doubles up the floating-point capability. Many integer operations fall well short of that.

 

But once everything else is factored in (AVX512 clock speed throttle and memory bandwidth with multi-threading), the real speedup of AVX512 on a large Skylake X processor with a lot of cores dwindles to around 20 - 30%. Disappointing? Yes. But not unsurprising for new technology. Perhaps things will be better when DDR5 becomes a thing.

 

 

Cannon Lake?

 

Maybe it's too early to start talking about Cannon Lake since it will still be a long time before they hit the market in volume.

250 Million Digits of Pi - (Times in Seconds)

Core i3-8121U @ unknown fixed clock speed*

  y-cruncher v0.7.6 (ETA 2018)
  14-BDW (AVX2) 17-SKX (AVX512-DQ) 18-CNL (AVX512-VBMI)
1 thread 100.057 87.840 66.510

The Core i3-8121U is a 10nm Cannon Lake processor with only one 512-bit FMA. Having only one FMA severely limits the performance of the baseline AVX512. But the new binary seems to care less. It's way too early to draw any conclusions yet.

 

*Credit to Jzw for testing this.

 

 

 

Older News

 

Records Set by y-cruncher:

y-cruncher has been used to set a number of world record sized computations.

 

Blue: Current World Record

Green: Former World Record

Red: Unverified computation. Does not qualify as a world record until verified using an alternate formula.

Date Announced Date Completed: Source: Who: Constant: Decimal Digits: Time: Computer:
August 24, 2017 August 23, 2017   Ron Watkins Euler-Mascheroni Constant 477,511,832,674

Compute:  34.4 days

Verify:  141 days

4 x Xeon E5-4660 v3 @ 2.1 GHz - 1 TB
2 x Xeon X5690 @ 3.47 GHz - 128 GB
August 14, 2017 August 13, 2017   Ron Watkins Zeta(3) - Apery's Constant 500,000,000,000

Compute:  19.7 days

Verify:  29.8 days

8 x Xeon 6550 @ 2.0 GHz - 512 GB

2 x Xeon X5690 @ 3.46 GHz - 142 GB

November 15, 2016 November 11, 2016 Blog
Sponsor
Peter Trueb Pi 22,459,157,718,361 Compute:  105 days

Verify:  28 hours

Validation File

4 x Xeon E7-8890 v3 @ 2.50 GHz
1.25 TB DDR4
20 x 6 TB 7200 RPM Seagate
September 3, 2016 August 29, 2016   Ron Watkins e 5,000,000,000,000

Compute:  48.6 days

Verify:  48.7 days

2 x Xeon X5690 @ 3.47 GHz
141 GB
July 11, 2016 July 5, 2016   "yoyo" Golden Ratio 10,000,000,000,000

Compute:  6.2 days

Not Verified

2 x Intel Xeon E5-2696 v4 @ 2.2 GHz
768 GB
June 28, 2016 June 19, 2016   Ron Watkins Square Root of 2 10,000,000,000,000

Compute:  18.8 days

Verify:  25.2 days

2 x Xeon X5690 @ 3.47 GHz
141 GB
June 4, 2016 May 29, 2016   Ron Watkins Lemniscate 250,000,000,000

Compute:  91.7 hours

Verify:  270 hours

4 x Xeon E5-4660 v3 @ 2.1 GHz - 1TB
4 x Xeon X6550 @ 2 GHz - 512 GB
June 4, 2016 June 2, 2016   "yoyo" Golden Ratio 5,000,000,000,000

Compute:  67.9 hours

Not Verified

2 x Intel Xeon E5-2696 v4 @ 2.2 GHz
768 GB
April 24, 2016 April 18, 2016   Ron Watkins Log(2) 500,000,000,000

Compute:  12.8 days

Verify:  14.4 days

4 x Xeon X5690 @ 3.47 GHz - 141 GB
April 17, 2016 April 12, 2016   Ron Watkins Catalan's Constant 250,000,000,000

Compute:  204 hours

Verify:  207 hours

4 x Xeon E5-4660 v3 @ 2.1 GHz
1 TB
April 9, 2016 April 3, 2016   Ron Watkins Log(10) 500,000,000,000

Compute:  14.4 days

Verify:  15.2 days

2 x Xeon X5690 @ 3.47 GHz
141 GB
February 8, 2016 February 6, 2016   Mike A Catalan's Constant 500,000,000,000

Compute:  26.1 days

Not Verified

2 x Intel Xeon E5-2697 v3 @ 2.6 GHz
128 GB
July 24, 2015 July 22, 2015
July 23, 2015
Source Ron Watkins
Dustin Kirkland
Golden Ratio 2,000,000,000,000

Compute:  77.3 hours

Verify:  76.33 hours

Compute:  79.3 hours

Verify:  80.8 hours

4 x Xeon X6550 @ 2 GHz - 512 GB
Xeon E5-2676 v3 @ 2.4 GHz - 64 GB
October 8, 2014 October 7, 2014  

Sandon Van Ness

(houkouonchi)

Pi 13,300,000,000,000

Compute:  208 days

Verify:  182 hours

Validation File

2 x Xeon E5-4650L @ 2.6 GHz
192 GB DDR3 @ 1333 MHz
24 x 4 TB + 30 x 3 TB
December 28, 2013 December 28, 2013 Source Shigeru Kondo Pi 12,100,000,000,050

Compute: 94 days

Verify: 46 hours

2 x Xeon E5-2690 @ 2.9 GHz
128 GB DDR3 @ 1600 MHz
24 x 3 TB

See the complete list including other notably large computations.

 

If you wish to set a record, you must:

  1. Run the computation twice using different algorithms.
  2. If using y-cruncher v0.7.5 or later, both computations must be done with "Verify Output" enabled.
  3. The digits from both computations need to match.
  4. Then send me the validation files, but do not make any attempt to modify* them.

*The validation files are protected with a checksum to prevent tampering/cheating. Yes, people have tried to cheat before.

 

An exception to the "two computations rule" can be made for Pi since it can be verified using BBP formulas.

 

Note that for anyone attempting to set a Pi world record: Should the attempt succeed, I kindly ask that you make yourself sufficiently available for external requests to access or download the digits in its entirety (at least until it is broken again by someone else). Pi is popular enough that people do actually want to see the digits.

 

Features:

 

The main computational features of y-cruncher are:

 

Download:

Sample Screenshot: 100 billion digits of Pi

Core i7 5960X @ 4.0 GHz - 128GB DDR4 @ 2666 MHz - 16 HDs

 

Latest Releases: (February 24, 2018)

OS Download Link Size

Windows

y-cruncher v0.7.5.9481.zip

33.4 MB

Linux (Static)

y-cruncher v0.7.5.9481-static.tar.gz

31.8 MB

Linux (Dynamic)

y-cruncher v0.7.5.9481-dynamic.tar.gz

24.7 MB

 

 

 

 

 

 

 

 

The Linux version comes in both statically and dynamically linked versions. The static version should work on most Linux distributions, but lacks Cilk Plus and NUMA binding. The dynamic version supports all features, but is less portable due to the DLL dependency hell.

 

The Windows download comes bundled with the HWBOT submitter which allows benchmarks to be submitted to HWBOT.

 

System Requirements:

Windows:

Linux:

All Systems:

Very old systems that don't meet these requirements may be able to run older versions of y-cruncher. Support goes all the way back to even before Windows XP.

 

Version History:

 

Other Downloads (for C++ programmers):

 

Advanced Documentation:

 

 

Benchmarks:

Comparison Chart: (Last updated: January 20, 2018)

 

Computations of Pi to various sizes. All times in seconds. All computations done entirely in ram.

The timings include the time needed to convert the digits to decimal representation, but not the time needed to write out the digits to disk.

 

 

Laptops + Low-Power:

Processor(s): Core i7 3630QM VIA C4650 Pentium N42001 Xeon E3-1535M v5 Core i7 6820HK
Generation: Intel Ivy Bridge VIA Isaiah Intel Apollo Lake Intel Skylake Intel Skylake
Cores/Threads: 4/8 4/4 4/4 4/8 4/8
Processor Speed: 3.2 GHz 2.0 GHz 1.1 - 2.5 GHz 2.9 GHz 3.2 GHz
Memory: 8 GB - 1600 MT/s 16 GB 4 GB 16 GB 48 GB - 2133 MT/s
Version: v0.7.2 ~ Hina v0.7.2 ~ Hina v0.7.2 ~ Ushio v0.7.1 ~ Kurumi v0.7.5 ~ Kurumi
Instruction Set: x64 AVX x64 AVX x64 SSE4.1 x64 AVX2 + ADX x64 AVX2 + ADX
25,000,000 3.767 17.207 11.739 1.865 1.695
50,000,000 8.496 39.049 26.289 4.102 3.721
100,000,000 19.056 87.626 65.147 9.007 8.033
250,000,000 55.089 277.711 192.473 25.444 22.330
500,000,000 128.311 587.516 493.551 56.566 49.150
1,000,000,000 299.217 1,350.868   130.055 109.197
2,500,000,000   3,884.838     308.908
5,000,000,000         687.168
10,000,000,000         1,539.122
Credit:   Tralalak Kaupo Karuse  

 

 

Mainstream Desktops:

Processor(s): Core 2 Quad Q6600 Core i7 920 FX-8350 Core i7 4770K Core i7 5775C Core i7 7700K Ryzen 7 1800X
Generation: Intel Core Intel Nehalem AMD Piledriver Intel Haswell Intel Broadwell Intel Kaby Lake AMD Zen
Cores/Threads: 4/4 4/8 8/8 4/8 4/8 4/8 8/16
Processor Speed: 2.4 GHz 3.5 GHz (OC) 4.0 GHz 4.0 GHz (OC) 3.8 GHz (OC) 4.8 GHz (OC) 3.8 GHz
Memory: 6 GB - 800 MT/s 12 GB - 1333 MT/s 32 GB - 1600 MT/s 32 GB - 2133 MT/s 16 GB - 2400 MT/s 64 GB - 3000 MT/s 64 GB - 2666 MT/s
Program Version: v0.7.2 ~ Kasumi v0.7.5 ~ Ushio v0.7.5 ~ Miyu v0.7.5 ~ Airi v0.7.1 ~ Kurumi v0.7.1 ~ Kurumi v0.7.5 ~ Yukina
Instruction Set: x64 SSE3 x64 SSE4.1 x64 AVX + XOP x64 AVX2 x64 AVX2 + ADX x64 AVX2 + ADX x64 AVX2 + ADX
25,000,000 10.591 5.046 3.419 1.565 1.730 1.271 1.319
50,000,000 23.698 11.117 7.567 3.435 3.940 2.817 2.759
100,000,000 53.502 24.855 16.506 7.530 8.739 6.198 5.889
250,000,000 157.269 73.794 46.288 21.232 25.073 17.384 16.175
500,000,000 351.470 164.814 102.536 46.666 56.343 38.176 35.612
1,000,000,000 801.731 375.974 226.424 103.687 125.967 84.432 78.956
2,500,000,000   1,066.704 658.832 292.495 369.738 238.194 223.325
5,000,000,000     1,458.813 642.066   527.186 494.441
10,000,000,000           1,151.396 1,076.301
Credit:         AndrĂ© Bachmann Oliver Kruse  

 

 

High-End Desktops:

Processor(s): Core i7 5820K Core i7 5960X Threadripper 1950X Core i9 7900X Core i9 7940X
Generation: Intel Haswell Intel Haswell AMD Threadripper Intel Skylake X Intel Skylake X
Cores/Threads: 6/12 8/16 16/32 10/20 14/28
Processor Speed: 4.5 GHz (OC) 4.0 GHz (OC) 4.0 GHz (OC)

4.3/4.0/3.6 GHz*

4.7/4.0/3.7 GHz*
3.0 GHz cache 2.8 GHz cache
Memory: 32 GB - 2400 MT/s 64 GB - 2400 MT/s 128 GB - 2800-3200 MT/s 128 GB - 3200 MT/s 128 GB - 3400 MT/s
Program Version: v0.7.3 ~ Airi v0.7.4 ~ Airi v0.7.3 ~ Yukina v0.7.3 ~ Kotori v0.7.5 ~ Kotori v0.7.5 ~ Kotori
Instruction Set: x64 AVX2 x64 AVX2 x64 AVX2 + ADX x64 AVX512-DQ x64 AVX512-DQ
25,000,000 1.287 0.881 0.975 0.746 0.563 0.480
50,000,000 2.499 2.038 1.997 1.445 1.198 1.093
100,000,000 5.401 4.209 3.697 3.054 2.507 2.403
250,000,000 14.732 11.461 9.602 8.182 6.535 5.784
500,000,000 32.294 25.153 20.710 17.740 13.776 11.690
1,000,000,000 71.225 55.194 45.496 38.293 29.723 24.807
2,500,000,000 200.323 154.758 127.040 107.432 82.166 68.032
5,000,000,000 443.543 342.364 279.979 238.768 179.539 147.917
10,000,000,000   745.234 612.269 524.572 392.243 322.117
25,000,000,000     1,910.832 1,560.887 1,109.199 916.517
Credit: Sean Heneghan   Oliver Kruse      

*All-core non-AVX/AVX/AVX512 CPU frequency.

 

 

Multi-Processor Workstation/Servers:

 

Due to high core count and the effect of NUMA (Non-Uniform Memory Access), performance on multi-processor systems are extremely sensitive to various settings. Therefore, these benchmarks may not be entirely representative of what the hardware is capable of.

Processor(s): Xeon E5-2683 v3 Xeon E5-2687W v4 Xeon E5-2696 v4 Xeon E7-8880 v3 Epyc 7601 Xeon Gold 6130F
Generation: Intel Haswell Intel Broadwell Intel Broadwell Intel Haswell AMD Naples Intel Skylake Purley
Sockets/Cores/Threads: 2/28/56 2/24/48 2/44/88 4/64/128 2/64/128 2/32/64
Processor Speed: 2.03 GHz 3.0 GHz 2.2 GHz 2.3 GHz 2.2 GHz 2.1 GHz
Memory: 128 GB - ??? 64 GB 768 GB - ??? 2 TB - ??? 256 GB - ?? 256 GB - ??
Program Version: v0.6.9 ~ Airi v0.7.4 ~ Kurumi v0.7.1 ~ Kurumi v0.7.1 ~ Airi v0.7.3 ~ Yukina v0.7.3 ~ Kotori
Instruction Set: x64 AVX2 x64 AVX2 + ADX x64 AVX2 + ADX x64 AVX2 x64 AVX2 + ADX x64 AVX512-DQ
25,000,000 0.907 0.705 0.715 1.176 2.459 1.150
50,000,000 1.745 1.372 1.344 2.321 4.347 1.883
100,000,000 3.317 2.726 2.673 4.217 6.996 3.341
250,000,000 8.339 6.947 6.853 8.781 14.258 7.731
500,000,000 17.708 14.454 14.538 15.879 24.930 15.346
1,000,000,000 37.311 30.816 31.260 32.078 47.837 31.301
2,500,000,000 102.131 84.631 84.271 78.251 111.139 82.871
5,000,000,000 218.917 185.02 192.889 164.157 228.252 179.488
10,000,000,000 471.802 396.895 417.322 346.307 482.777 387.530
25,000,000,000 1,511.852 1,126.769 1,186.881 957.966 1,184.144 1,063.850
50,000,000,000   2,478.332 2,601.476 2,096.169    
100,000,000,000     6,037.704 4,442.742    
250,000,000,000       17,428.450    
Credit: Shigeru Kondo Cameron Giesbrecht "yoyo" Jacob Coleman Dave Graham
Processor(s): Xeon X5482 Xeon E5-2690
Generation: Intel Penryn Intel Sandy Bridge
Sockets/Cores/Threads: 2/8/8 2/16/32
Processor Speed: 3.2 GHz 3.5 GHz
Memory: 64 GB - 800 MT/s 256 GB - ???
Program Version: v0.7.2 ~ Ushio v0.7.5 ~ Nagisa v0.6.2/3 ~ Hina
Instruction Set: x64 SSE4.1 x64 AVX
25,000,000 4.548 4.248 2.283
50,000,000 9.779 9.148 4.295
100,000,000 20.834 19.580 8.167
250,000,000 60.049 56.226 20.765
500,000,000 134.978 126.448 42.394
1,000,000,000 308.679 286.903 89.920
2,500,000,000 874.588 824.820 239.154
5,000,000,000 1,946.683 1,836.808 520.977
10,000,000,000 4,317.677 4,000.065 1,131.809
25,000,000,000     3,341.281
50,000,000,000     7,355.076
Credit:     Shigeru Kondo

 

 

Fastest Times:

The full chart of rankings for each size can be found here:

These fastest times may include unreleased betas.


Got a faster time? Let me know: a-yee@u.northwestern.edu

Note that I usually don't respond to these emails. I simply put them into the charts which I update periodically.

 

 

Performance Tips:

 

Decimal Digits of Pi - Times in Seconds

Core i9 7940X @ 3.7 GHz AVX512

Memory: 2666 MT/s 3466 MT/s
25,000,000 0.839 0.758
50,000,000 1.424 1.338
100,000,000 2.701 2.425
250,000,000 6.489 5.877
500,000,000 13.307 11.917
1,000,000,000 27.913 24.915
2,500,000,000 76.837 68.322
5,000,000,000 168.058 148.737
10,000,000,000 365.047 322.115
25,000,000,000 1,037.527 916.039

High core count Skylake X processors are known to be heavily bottlenecked by memory bandwidth.

Memory Bandwidth:

 

Because of the memory-intensive nature of computing Pi and other constants, y-cruncher needs a lot of memory bandwidth to perform well. In fact, the program has been noticably memory bound on nearly all high-end desktops since 2012 as well as the majority of multi-socket systems since at least 2006.

 

Recommendations:

Don't be surprised if y-cruncher exposes instabilities that other applications and stress-tests do not. y-cruncher is unusual in that it simultaneously places a heavy load on both the CPU and the entire memory subsystem.

 

 

 

Parallel Performance:

 

y-cruncher has a lot of settings for tuning parallel performance. By default, it makes a best effort to analyze the hardware and pick the best settings. But because of the virtually unlimited combinations of processor topologies, it's difficult for y-cruncher to optimally pick the best settings for everything. So sometimes the best performance can only be achieved with manual settings.

*These are advanced settings that cannot be changed if you're using the benchmark option in the console UI. To change them, you will need to either run benchmark mode from the command line or use the custom compute menu.

 

Load imbalance is a faily common problem in y-cruncher. The usual causes are:

  1. The number of logical cores is not a power-of-two.
  2. The cores are not homogenous. Common reasons include:
    • The cores are clocked at different speeds.
    • The cores have access to different amounts of memory bandwidth due an imbalanced NUMA topology.
    • The cores are different generation cores hidden behind a virtual machine.
  3. CPU-intensive background processes are interfering with y-cruncher's ability to use all the hardware. This applies to all forms of system jitter.

 

Swap Mode:

 

This is probably one of the most complicated features in y-cruncher.

 

 

Known Issues:

 

Everything in this section is in the process of being re-verified and moved to: https://github.com/Mysticial/y-cruncher/issues

 

 

Performance Issues:


Algorithms and Developments:

 

FAQ:

 

Pi and other Constants:

 

Hardware and Overclocking:

 

Academia:

 

Programming:

 

Program Usage:

 

Other:

 

Links:

Here's some interesting sites dedicated to the computation of Pi and other constants:

 

Questions or Comments

Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.