GPUが100倍速いという神話をぶち殺せたらいいな ver.2013

1. 100倍をぶち殺せたらいいな関東GPGPU勉強会 #2山田てるみ

2. 自己紹介• 山田てるみ• @telmin_orca• なんちゃってGPUプログラマ

3. 2012∼2013

4. 2012∼2013• Fermi -> Keplerhttp://en.wikipedia.org/wiki/File:Nvidia_logo.svghttp://www.nvidia.co.jp/page/home.html

5. 2012∼2013• GCN Architecture

6. MaliOpenCL2012∼2013

7. 2010

8. Debunking the100X GPU vs. CPUMythGPUが100倍速いという神話をぶち殺す

9. 3 years later...あれから3年が過ぎた…

10. NVIDIA• Kepler Architecture• GK110• SM -> SMX• 32(48) -> 192!!• 1.03TFLOPS -> 3.5TFLOPS!!

11. の造りし　　　もの

12. M IC 、襲来

13. Xeon Phi• MIC Architecture• 60 Core / 4 Threads• 32KB / L1 cache• 512KB / L2 cache• 512-bit vector unit!

14. Debunking the100X GPU vs. CPUMyth? ver.2013

15. 大事なこと• 以下の測定結果にはHost <->device間のデータ転送時間は含まれていません• 元論文に準拠しました• Haswellのデータも追加しました

16. SAXPY• y = Ax+y• かけてたす• 演算量少なすぎてメモリ律速

17. SAXPY• 実験条件• 要素数:10000000

18. SAXPYvoid simpleSaxpy(double* x, double* y, const double A, const size_t num){ for(size_t i = 0; i < num; ++i) { y[i] = a * x[i] + y[i]; }}

19. SAXPY OpenMP +AVXvoid avxSaxpy(const double* x, double* y, const double a, const size_t num){ __m256d v_a = _mm256_set1_pd(a);#pragma omp parallel for for(int i = 0 ; i < num / 4; ++i) { __m256d v_x0 = _mm256_loadu_pd(&x[i * 4]); __m256d v_y0 = _mm256_loadu_pd(&y[i * 4]); __m256d v01 = _mm256_mul_pd(v_a, v_x0); __m256d v02 = _mm256_add_pd(v01, v_y0); _mm256_storeu_pd(&y[i * 4], v02); }}

20. SAXPY CUDA__global__ voidcudaSaxpyKernel(const double* x, double* y, const double a, const int num_elements){ const int id = blockDim.x * blockIdx.x + threadIdx.x; if(id < num_elements) { y[id] = a * x[id] + y[id]; }}

21. SAXPY MICvoid micSaxpy(const double* x, double* y, const double a, const size_t num){ __m512d v_a = _mm512_set1_pd(a); #pragma omp parallel for for(int i = 0; i < num / 8; ++i) { __m512d v_x = _mm512_load_pd(&x[i * 8]); __m512d v_y = _mm512_load_pd(&y[i * 8]); __m512d res = _mm512_fmadd_pd(v_x, v_a, v_y); _mm512_storenr_pd(&y[i * 8], res); }}

22. SAXPYmsec GFlops GB/sCorei72600K14.077 1.42071 17.0486Corei74770K12.448 1.606 19.279Titan 0.134 141.461 848.763XeonPhi 1.98 10.095 121.15

23. Histogram• ヒストグラム

24. Histogram

25. Histogram• 実験条件• 1920x1080画像• bin: 256

26. Histogramvoid simpleHistogram(const unsigned char* src, std::vector<int>& dst, const size_t width, const size_t height){ // grayscale for(size_t y = 0; y < height; ++y) { for(size_t x = 0; x < width; ++x) { unsigned char val = src[y * width + x]; dst[val]++; } }}

27. Histogram OpenMPvoid openMPHistogram(const unsigned char* src, std::vector<int>& dst, const size_t width, const size_t height){#pragma omp parallel { std::vector<int> local_dst(256);#pragma omp for for(size_t y = 0; y < height; ++y) { for(size_t x = 0; x < width; ++x) { unsigned char val = src[y * width + x]; local_dst[val]++; } }#pragma omp critical { for(size_t i = 0; i < 256; ++i) { dist[i] += local_dist[i]; } } }}

28. Histogram CUDA__global__ voidhistogram_cuda_kernel(const unsigned char* src, int* dst, const unsigned int width, const unsigned int height, const unsigned int num_elements){ int idx = blockDim.x * blockIdx.x + threadIdx.x; int x = idx % width; int y = idx / width; if(idx < num_elements) { unsigned char val = src[y * width + x]; atomicAdd(&dst[val], 1); }}

29. Histogrammsec MPixel / sCorei7 2600K 0.734 2823.8Corei7 4770K 0.273 3370.66Titan 0.0816 25381.3XeonPhi 117.5 17.6463

30. Histogram MICvoid micHistogram_240(const unsigned char* src, int* dst, const size_t width, const size_t height){#pragma omp parallel num_threads(240) { const size_t thread_id = omp_get_thread_num(); const size_t num_threads = omp_get_num_threads(); size_t local_height = height / num_threads; local_height += (thread_id % 2)? 0 : 1; const size_t offset = 5 * thread_id -‐ (thread_id / 2); std::vector<int> local_dst(256); std::vector<unsigned char> local_src(local_height * width); memcpy(&local_src[0], &src[offset * width], sizeof(unsigned char) * local_height * width); for(size_t y = 0; y < local_height; ++y) { for(size_t x = 0; x < width; ++x) { size_t val = local_src[y * width + x]; local_dst[val]++; } } #pragma omp critical { for(size_t i = 0; i < 256; ++i) { dst[i] += local_dst[i]; } } }}

31. Histogrammsec MPixel / sCorei7 2600K 0.734 2823.8Corei7 4770K 0.273 3370.66Titan 0.0816 25381.3XeonPhi 2.074 999.806

32. NL-means• Non-local Algorithm• A non-local algorithm for image denoising• http://bengal.missouri.edu/~kes25c/nl2.pdf• バイラテラルフィルタの親戚

33. NL-means• エッジキープ型のフィルタ• ノイズを除去しつつもボケにくい！• Aviutilとかにプラグインがある

34. NL-meanshttp://opencv.jp/opencv2-x-samples/non-local-means-ﬁlterby fukushima1981.

35. NL-means

36. NL-means• 実験条件• 1920x1080画像• Window size : 7x7• Template size : 3x3

37. NL-meanssec FPSCorei7 2600K 2.086 0.479Corei7 4770K 2.29 0.436Titan 0.05826 17.16XeonPhi 1.217 0.822

38. Aobench• Ambient Occlution• 前回もやった• Intelもサンプルに使用http://software.intel.com/en-us/articles/data-and-thread-parallelism/

39. Aobench• 実験条件• 512x512画像• NSUBSAMPLE: 2• NTHETA: 16• NPHI: 16

40. Aobenchsec FPSCorei7 2600K 1.556 0.642Corei7 4770K 1.448 0.6905Titan 0.0162 61.71XeonPhi 0.9 1.11

41. 0.00%$2000.00%$4000.00%$6000.00%$8000.00%$10000.00%$12000.00%$SAXPY$ Histogram$ NL:means$ Aobench$Core$i7$2600K$Core$i7$4770K$Titan$XeonPhi$

42. 結論• CPUがGPUを倒す未来はもう少し先の物語…

GPUが100倍速いという神話をぶち殺せたらいいな ver.2013

Ryo Sakamoto

GPUが100倍速いという神話をぶち殺せたらいいな ver.2013