Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Model Center   Code  
»áÔ±   
   
 
     
   
 ¶©ÔÄ
  ¾èÖú
¿ìÀ´²Ù×ÝÄãµÄGPU| CUDA±à³ÌÈëÃż«¼ò½Ì³Ì
 
  3817  次浏览      29
 2018-8-20 
 
±à¼­ÍƼö:
±¾ÎÄÀ´×ÔÓÚcsdn£¬½éÉÜÁËCUDA±à³ÌÄ£ÐÍ»ù´¡£¬ÏòÁ¿¼Ó·¨ÊµÀý£¬¾ØÕó³Ë·¨ÊµÀýµÈ¡£

ǰÑÔ

2006Ä꣬NVIDIA¹«Ë¾·¢²¼ÁËCUDA£¨http://docs.nvidia.com/cuda/£©£¬CUDAÊǽ¨Á¢ÔÚNVIDIAµÄCPUsÉϵÄÒ»¸öͨÓò¢ÐмÆËãÆ½Ì¨ºÍ±à³ÌÄ£ÐÍ£¬»ùÓÚCUDA±à³Ì¿ÉÒÔÀûÓÃGPUsµÄ²¢ÐмÆËãÒýÇæÀ´¸ü¼Ó¸ßЧµØ½â¾ö±È½Ï¸´ÔӵļÆËãÄÑÌâ¡£½üÄêÀ´£¬GPU×î³É¹¦µÄÒ»¸öÓ¦ÓþÍÊÇÉî¶ÈѧϰÁìÓò£¬»ùÓÚGPUµÄ²¢ÐмÆËãÒѾ­³ÉΪѵÁ·Éî¶ÈѧϰģÐ͵ıêÅ䡣Ŀǰ£¬×îеÄCUDA°æ±¾ÎªCUDA 9¡£

GPU²¢²»ÊÇÒ»¸ö¶ÀÁ¢ÔËÐеļÆËãÆ½Ì¨£¬¶øÐèÒªÓëCPUЭͬ¹¤×÷£¬¿ÉÒÔ¿´³ÉÊÇCPUµÄЭ´¦ÀíÆ÷£¬Òò´Ëµ±ÎÒÃÇÔÚ˵GPU²¢ÐмÆËãʱ£¬ÆäʵÊÇÖ¸µÄ»ùÓÚCPU+GPUµÄÒì¹¹¼ÆËã¼Ü¹¹¡£ÔÚÒì¹¹¼ÆËã¼Ü¹¹ÖУ¬GPUÓëCPUͨ¹ýPCIe×ÜÏßÁ¬½ÓÔÚÒ»ÆðÀ´Ð­Í¬¹¤×÷£¬CPUËùÔÚλÖóÆÎªÎªÖ÷»ú¶Ë£¨host£©£¬¶øGPUËùÔÚλÖóÆÎªÉ豸¶Ë£¨device£©£¬ÈçÏÂͼËùʾ¡£

»ùÓÚCPU+GPUµÄÒì¹¹¼ÆËã. À´Ô´£ºPreofessional CUDA? C Programming

¿ÉÒÔ¿´µ½GPU°üÀ¨¸ü¶àµÄÔËËãºËÐÄ£¬ÆäÌØ±ðÊʺÏÊý¾Ý²¢ÐеļÆËãÃܼ¯ÐÍÈÎÎñ£¬Èç´óÐ;ØÕóÔËË㣬¶øCPUµÄÔËËãºËÐĽÏÉÙ£¬µ«ÊÇÆä¿ÉÒÔʵÏÖ¸´ÔÓµÄÂß¼­ÔËË㣬Òò´ËÆäÊʺϿØÖÆÃܼ¯ÐÍÈÎÎñ¡£ÁíÍ⣬CPUÉϵÄÏß³ÌÊÇÖØÁ¿¼¶µÄ£¬ÉÏÏÂÎÄÇл»¿ªÏú´ó£¬µ«ÊÇGPUÓÉÓÚ´æÔںܶàºËÐÄ£¬ÆäÏß³ÌÊÇÇáÁ¿¼¶µÄ¡£Òò´Ë£¬»ùÓÚCPU+GPUµÄÒì¹¹¼ÆËãÆ½Ì¨¿ÉÒÔÓÅÊÆ»¥²¹£¬CPU¸ºÔð´¦ÀíÂß¼­¸´ÔӵĴ®ÐгÌÐò£¬¶øGPUÖØµã´¦ÀíÊý¾ÝÃܼ¯Ð͵IJ¢ÐмÆËã³ÌÐò£¬´Ó¶ø·¢»Ó×î´ó¹¦Ð§¡£

»ùÓÚCPU+GPUµÄÒì¹¹¼ÆËãÓ¦ÓÃÖ´ÐÐÂß¼­. À´Ô´£ºPreofessional CUDA? C Programming

CUDAÊÇNVIDIA¹«Ë¾Ëù¿ª·¢µÄGPU±à³ÌÄ£ÐÍ£¬ËüÌṩÁËGPU±à³ÌµÄ¼òÒ×½Ó¿Ú£¬»ùÓÚCUDA±à³Ì¿ÉÒÔ¹¹½¨»ùÓÚGPU¼ÆËãµÄÓ¦ÓóÌÐò¡£CUDAÌṩÁË¶ÔÆäËü±à³ÌÓïÑÔµÄÖ§³Ö£¬ÈçC/C++£¬Python£¬FortranµÈÓïÑÔ£¬ÕâÀïÎÒÃÇÑ¡ÔñCUDA C/C++½Ó¿Ú¶ÔCUDA±à³Ì½øÐн²½â¡£¿ª·¢Æ½Ì¨ÎªWindows 10 + VS 2013£¬WindowsϵͳϵÄCUDA°²×°½Ì³Ì¿ÉÒԲο¼ÕâÀïhttp: //docs.nvidia.com /cuda / cuda-installation-guide- microsoft-windows/index.html

1 CUDA±à³ÌÄ£ÐÍ»ù´¡

ÔÚ¸ø³öCUDAµÄ±à³ÌʵÀý֮ǰ£¬ÕâÀïÏȶÔCUDA±à³ÌÄ£ÐÍÖеÄһЩ¸ÅÄî¼°»ù´¡ÖªÊ¶×ö¸ö¼òµ¥½éÉÜ¡£CUDA±à³ÌÄ£ÐÍÊÇÒ»¸öÒ칹ģÐÍ£¬ÐèÒªCPUºÍGPUЭͬ¹¤×÷¡£ÔÚCUDAÖУ¬hostºÍdeviceÊÇÁ½¸öÖØÒªµÄ¸ÅÄÎÒÃÇÓÃhostÖ¸´úCPU¼°ÆäÄڴ棬¶øÓÃdeviceÖ¸´úGPU¼°ÆäÄÚ´æ¡£CUDA³ÌÐòÖмȰüº¬host³ÌÐò£¬ÓÖ°üº¬device³ÌÐò£¬ËüÃÇ·Ö±ðÔÚCPUºÍGPUÉÏÔËÐС£Í¬Ê±£¬hostÓëdeviceÖ®¼ä¿ÉÒÔ½øÐÐͨÐÅ£¬ÕâÑùËüÃÇÖ®¼ä¿ÉÒÔ½øÐÐÊý¾Ý¿½±´¡£µäÐ͵ÄCUDA³ÌÐòµÄÖ´ÐÐÁ÷³ÌÈçÏ£º

1.·ÖÅähostÄڴ棬²¢½øÐÐÊý¾Ý³õʼ»¯£»

2.·ÖÅädeviceÄڴ棬²¢´Óhost½«Êý¾Ý¿½±´µ½deviceÉÏ£»

3.µ÷ÓÃCUDAµÄºËº¯ÊýÔÚdeviceÉÏÍê³ÉÖ¸¶¨µÄÔËË㣻

4.½«deviceÉϵÄÔËËã½á¹û¿½±´µ½hostÉÏ£»

5.ÊÍ·ÅdeviceºÍhostÉÏ·ÖÅäµÄÄÚ´æ¡£

ÉÏÃæÁ÷³ÌÖÐ×îÖØÒªµÄÒ»¸ö¹ý³ÌÊǵ÷ÓÃCUDAµÄºËº¯ÊýÀ´Ö´Ðв¢ÐмÆË㣬kernel£¨http: //docs. nvidia.com /cuda /cuda-c -programming-guide/index.html#kernels£©ÊÇCUDAÖÐÒ»¸öÖØÒªµÄ¸ÅÄkernelÊÇÔÚdevice ÉÏÏß³ÌÖв¢ÐÐÖ´Ðеĺ¯Êý£¬ºËº¯ÊýÓÃ__global__·ûºÅÉùÃ÷£¬ÔÚµ÷ÓÃʱÐèÒªÓÃ<<<grid, block>>> À´Ö¸¶¨kernel ÒªÖ´ÐеÄÏß³ÌÊýÁ¿£¬ÔÚCUDAÖУ¬Ã¿Ò»¸öÏ̶߳¼ÒªÖ´Ðк˺¯Êý£¬²¢ÇÒÿ¸öÏ̻߳á·ÖÅäÒ»¸öΨһµÄÏ̺߳Åthread ID£¬Õâ¸öIDÖµ¿ÉÒÔͨ¹ýºËº¯ÊýµÄÄÚÖñäÁ¿threadIdxÀ´»ñµÃ¡£

ÓÉÓÚGPUʵ¼ÊÉÏÊÇÒ칹ģÐÍ£¬ËùÒÔÐè񻂿·ÖhostºÍdeviceÉϵĴúÂ룬ÔÚCUDAÖÐÊÇͨ¹ýº¯ÊýÀàÐÍÏÞ¶¨´Ê¿ªÇø±ðhostºÍdeviceÉϵĺ¯Êý£¬Ö÷ÒªµÄÈý¸öº¯ÊýÀàÐÍÏÞ¶¨´ÊÈçÏ£º

__global__£ºÔÚdeviceÉÏÖ´ÐУ¬´ÓhostÖе÷Óã¨Ò»Ð©Ìض¨µÄGPUÒ²¿ÉÒÔ´ÓdeviceÉϵ÷Óã©£¬·µ»ØÀàÐͱØÐëÊÇvoid£¬²»Ö§³Ö¿É±ä²ÎÊý²ÎÊý£¬²»ÄܳÉΪÀà³ÉÔ±º¯Êý¡£×¢ÒâÓÃ__global__¶¨ÒåµÄkernelÊÇÒì²½µÄ£¬ÕâÒâζ×Åhost²»»áµÈ´ýkernelÖ´ÐÐÍê¾ÍÖ´ÐÐÏÂÒ»²½¡£

__device__£ºÔÚdeviceÉÏÖ´ÐУ¬½ö¿ÉÒÔ´ÓdeviceÖе÷Ó㬲»¿ÉÒÔºÍ__global__ͬʱÓá£

__host__£ºÔÚhostÉÏÖ´ÐУ¬½ö¿ÉÒÔ´ÓhostÉϵ÷Óã¬Ò»°ãÊ¡ÂÔ²»Ð´£¬²»¿ÉÒÔºÍ__global__ͬʱÓ㬵«¿ÉºÍ__ device __£¬´Ëʱº¯Êý»áÔÚdeviceºÍhost¶¼±àÒë¡£

ÒªÉî¿ÌÀí½âkernel£¬±ØÐëÒª¶ÔkernelµÄÏ̲߳ã´Î½á¹¹ÓÐÒ»¸öÇåÎúµÄÈÏʶ¡£Ê×ÏÈGPUÉϺܶಢÐл¯µÄÇáÁ¿¼¶Ï̡߳£kernelÔÚdeviceÉÏÖ´ÐÐʱʵ¼ÊÉÏÊÇÆô¶¯ºÜ¶àỊ̈߳¬Ò»¸ökernelËùÆô¶¯µÄËùÓÐÏ̳߳ÆÎªÒ»¸öÍø¸ñ£¨grid£©£¬Í¬Ò»¸öÍø¸ñÉϵÄÏ̹߳²ÏíÏàͬµÄÈ«¾ÖÄÚ´æ¿Õ¼ä£¬gridÊÇÏ߳̽ṹµÄµÚÒ»²ã´Î£¬¶øÍø¸ñÓÖ¿ÉÒÔ·ÖΪºÜ¶àÏ߳̿飨block£©£¬Ò»¸öÏ߳̿éÀïÃæ°üº¬ºÜ¶àỊ̈߳¬ÕâÊǵڶþ¸ö²ã´Î¡£Ïß³ÌÁ½²ã×éÖ¯½á¹¹ÈçÏÂͼËùʾ£¬ÕâÊÇÒ»¸ögirdºÍblock¾ùΪ2-dimµÄÏß³Ì×éÖ¯¡£gridºÍblock¶¼ÊǶ¨ÒåΪdim3ÀàÐ͵ıäÁ¿£¬dim3¿ÉÒÔ¿´³ÉÊǰüº¬Èý¸öÎÞ·ûºÅÕûÊý£¨x£¬y£¬z£©³ÉÔ±µÄ½á¹¹Ìå±äÁ¿£¬ÔÚ¶¨Òåʱ£¬È±Ê¡Öµ³õʼ»¯Îª1¡£Òò´ËgridºÍblock¿ÉÒÔÁé»îµØ¶¨ÒåΪ 1- dim£¬2-dimÒÔ¼°3-dim ½á¹¹£¬¶ÔÓÚͼÖнṹ£¨Ö÷Ҫˮƽ·½ÏòΪxÖᣩ£¬¶¨ÒåµÄgridºÍblockÈçÏÂËùʾ£¬kernelÔÚµ÷ÓÃʱҲ±ØÐëͨ¹ýÖ´ÐÐÅäÖà £¨http://docs.nvidia.com/ cuda/cuda-c-programming- guide / index .html #execution-configuration £©<<<grid, block>>>À´Ö¸¶¨kernelËùʹÓõÄÏß³ÌÊý¼°½á¹¹¡£

dim3 grid(3, 2);
dim3 block(4, 3);
kernel_fun<<< grid, block >>>(prams...);

ËùÒÔ£¬Ò»¸öÏß³ÌÐèÒªÁ½¸öÄÚÖõÄ×ø±ê±äÁ¿£¨blockIdx£¬threadIdx£©À´Î¨Ò»±êʶ£¬ËüÃǶ¼ÊÇdim3ÀàÐͱäÁ¿£¬ÆäÖÐblockIdxÖ¸Ã÷Ïß³ÌËùÔÚgridÖеÄλÖ㬶øthreaIdxÖ¸Ã÷Ïß³ÌËùÔÚblockÖеÄλÖã¬ÈçͼÖеÄThread (1,1)Âú×㣺

threadIdx.x = 1
threadIdx.y = 1
blockIdx.x = 1
blockIdx.y = 1

Ò»¸öÏ߳̿éÉϵÄÏß³ÌÊÇ·ÅÔÚͬһ¸öÁ÷ʽ¶à´¦ÀíÆ÷£¨SM)Éϵ쬵«Êǵ¥¸öSMµÄ×ÊÔ´ÓÐÏÞ£¬Õâµ¼ÖÂÏ߳̿éÖеÄÏß³ÌÊýÊÇÓÐÏÞÖÆµÄ£¬ÏÖ´úGPUsµÄÏ߳̿é¿ÉÖ§³ÖµÄÏß³ÌÊý¿É´ï1024¸ö¡£ÓÐʱºò£¬ÎÒÃÇÒªÖªµÀÒ»¸öÏß³ÌÔÚblcokÖеÄÈ«¾ÖID£¬´Ëʱ¾Í±ØÐ뻹Ҫ֪µÀblockµÄ×éÖ¯½á¹¹£¬ÕâÊÇͨ¹ýÏ̵߳ÄÄÚÖñäÁ¿blockDimÀ´»ñµÃ¡£Ëü»ñÈ¡Ï߳̿é¸÷¸öά¶ÈµÄ´óС¡£¶ÔÓÚÒ»¸ö2-dimµÄblock640?wx_fmt=png£¬Ị̈߳¨x,y)µÄIDֵΪ640?wx_fmt=png,Èç¹ûÊÇ3-dimµÄblock640?wx_fmt=png,Ïß³Ì(x,y,z)µÄIDֵΪ640?wx_fmt=png¡£ÁíÍâÏ̻߳¹ÓÐÄÚÖñäÁ¿gridDim£¬ÓÃÓÚ»ñµÃÍø¸ñ¿é¸÷¸öά¶ÈµÄ´óС¡£

kernelµÄÕâÖÖÏß³Ì×éÖ¯½á¹¹ÌìÈ»ÊʺÏvector,matrixµÈÔËË㣬ÈçÎÒÃǽ«ÀûÓÃÉÏͼ2-dim½á¹¹ÊµÏÖÁ½¸ö¾ØÕóµÄ¼Ó·¨£¬Ã¿¸öÏ̸߳ºÔð´¦Àíÿ¸öλÖõÄÁ½¸öÔªËØÏà¼Ó£¬´úÂëÈçÏÂËùʾ¡£Ï߳̿é´óСΪ(16, 16)£¬È»ºó½«N*N´óСµÄ¾ØÕó¾ù·ÖΪ²»Í¬µÄÏ߳̿éÀ´Ö´Ðмӷ¨ÔËËã¡£

// Kernel¶¨Òå
__global__ void MatAdd (float A[N][N], float B[N][N], float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel Ïß³ÌÅäÖÃ
dim3 threadsPerBlock(16, 16);
dim3 numBlocks (N / threadsPerBlock.x, N / threadsPerBlock .y);
// kernelµ÷ÓÃ
MatAdd <<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}

´ËÍâÕâÀï¼òµ¥½éÉÜÒ»ÏÂCUDAµÄÄÚ´æÄ£ÐÍ£¬ÈçÏÂͼËùʾ¡£¿ÉÒÔ¿´µ½£¬Ã¿¸öÏß³ÌÓÐ×Ô¼ºµÄ˽Óб¾µØÄڴ棨Local Memory£©£¬¶øÃ¿¸öÏ߳̿éÓаüº¬¹²ÏíÄڴ棨Shared Memory£©,¿ÉÒÔ±»Ï߳̿éÖÐËùÓÐÏ̹߳²Ïí£¬ÆäÉúÃüÖÜÆÚÓëÏ߳̿éÒ»Ö¡£´ËÍ⣬ËùÓеÄÏ̶߳¼¿ÉÒÔ·ÃÎÊÈ«¾ÖÄڴ棨Global Memory£©¡£»¹¿ÉÒÔ·ÃÎÊһЩֻ¶ÁÄÚ´æ¿é£º³£Á¿Äڴ棨Constant Memory£©ºÍÎÆÀíÄڴ棨Texture Memory£©¡£ÄÚ´æ½á¹¹Éæ¼°µ½³ÌÐòÓÅ»¯£¬ÕâÀï²»ÉîÈë̽ÌÖËüÃÇ¡£

CUDAÄÚ´æÄ£ÐÍ

»¹ÓÐÖØÒªÒ»µã£¬ÄãÐèÒª¶ÔGPUµÄÓ²¼þʵÏÖÓÐÒ»¸ö»ù±¾µÄÈÏʶ¡£ÉÏÃæËµµ½ÁËkernelµÄÏß³Ì×éÖ¯²ã´Î£¬ÄÇôһ¸ökernelʵ¼ÊÉÏ»áÆô¶¯ºÜ¶àỊ̈߳¬ÕâЩÏß³ÌÊÇÂß¼­Éϲ¢Ðе쬵«ÊÇÔÚÎïÀí²ãÈ´²¢²»Ò»¶¨¡£ÕâÆäʵºÍCPUµÄ¶àÏß³ÌÓÐÀàËÆÖ®´¦£¬¶àÏß³ÌÈç¹ûûÓжàºËÖ§³Ö£¬ÔÚÎïÀí²ãÒ²ÊÇÎÞ·¨ÊµÏÖ²¢Ðеġ£µ«ÊǺÃÔÚGPU´æÔںܶàCUDAºËÐÄ£¬³ä·ÖÀûÓÃCUDAºËÐÄ¿ÉÒÔ³ä·Ö·¢»ÓGPUµÄ²¢ÐмÆËãÄÜÁ¦¡£GPUÓ²¼þµÄÒ»¸öºËÐÄ×é¼þÊÇSM£¬Ç°ÃæÒѾ­Ëµ¹ý£¬SMÊÇÓ¢ÎÄÃûÊÇ Streaming Multiprocessor£¬·­Òë¹ýÀ´¾ÍÊÇÁ÷ʽ¶à´¦ÀíÆ÷¡£SMµÄºËÐÄ×é¼þ°üÀ¨CUDAºËÐÄ£¬¹²ÏíÄڴ棬¼Ä´æÆ÷µÈ£¬SM¿ÉÒÔ²¢·¢µØÖ´ÐÐÊý°Ù¸öỊ̈߳¬²¢·¢ÄÜÁ¦¾ÍÈ¡¾öÓÚSMËùÓµÓеÄ×ÊÔ´Êý¡£µ±Ò»¸ökernel±»Ö´ÐÐʱ£¬ËüµÄgirdÖеÄÏ߳̿鱻·ÖÅäµ½SMÉÏ£¬Ò»¸öÏ߳̿éÖ»ÄÜÔÚÒ»¸öSMÉϱ»µ÷¶È¡£SMÒ»°ã¿ÉÒÔµ÷¶È¶à¸öÏ߳̿飬ÕâÒª¿´SM±¾ÉíµÄÄÜÁ¦¡£ÄÇôÓпÉÄÜÒ»¸ökernelµÄ¸÷¸öÏ߳̿鱻·ÖÅä¶à¸öSM£¬ËùÒÔgridÖ»ÊÇÂß¼­²ã£¬¶øSM²ÅÊÇÖ´ÐеÄÎïÀí²ã¡£SM²ÉÓõÄÊÇSIMT£¨Á´½Ó£ºhttp://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture£©(Single-Instruction, Multiple-Thread£¬µ¥Ö¸Áî¶àÏß³Ì)¼Ü¹¹£¬»ù±¾µÄÖ´Ðе¥ÔªÊÇÏß³ÌÊø£¨wraps)£¬Ïß³ÌÊø°üº¬32¸öỊ̈߳¬ÕâЩÏß³ÌͬʱִÐÐÏàͬµÄÖ¸Áµ«ÊÇÿ¸öÏ̶߳¼°üº¬×Ô¼ºµÄÖ¸ÁîµØÖ·¼ÆÊýÆ÷ºÍ¼Ä´æÆ÷״̬£¬Ò²ÓÐ×Ô¼º¶ÀÁ¢µÄÖ´Ðз¾¶¡£ËùÒÔ¾¡¹ÜÏß³ÌÊøÖеÄÏß³Ìͬʱ´Óͬһ³ÌÐòµØÖ·Ö´ÐУ¬µ«ÊÇ¿ÉÄܾßÓв»Í¬µÄÐÐΪ£¬±ÈÈçÓöµ½ÁË·ÖÖ§½á¹¹£¬Ò»Ð©Ï߳̿ÉÄܽøÈëÕâ¸ö·ÖÖ§£¬µ«ÊÇÁíÍâһЩÓпÉÄܲ»Ö´ÐУ¬ËüÃÇÖ»ÄÜËÀµÈ£¬ÒòΪGPU¹æ¶¨Ïß³ÌÊøÖÐËùÓÐÏß³ÌÔÚͬһÖÜÆÚÖ´ÐÐÏàͬµÄÖ¸ÁÏß³ÌÊø·Ö»¯»áµ¼ÖÂÐÔÄÜϽµ¡£µ±Ï߳̿鱻»®·Öµ½Ä³¸öSMÉÏʱ£¬Ëü½«½øÒ»²½»®·ÖΪ¶à¸öÏß³ÌÊø£¬ÒòΪÕâ²ÅÊÇSMµÄ»ù±¾Ö´Ðе¥Ôª£¬µ«ÊÇÒ»¸öSMͬʱ²¢·¢µÄÏß³ÌÊøÊýÊÇÓÐÏ޵ġ£ÕâÊÇÒòΪ×ÊÔ´ÏÞÖÆ£¬SMҪΪÿ¸öÏ߳̿é·ÖÅä¹²ÏíÄڴ棬¶øÒ²ÒªÎªÃ¿¸öÏß³ÌÊøÖеÄÏ̷߳ÖÅä¶ÀÁ¢µÄ¼Ä´æÆ÷¡£ËùÒÔSMµÄÅäÖûáÓ°ÏìÆäËùÖ§³ÖµÄÏ߳̿éºÍÏß³ÌÊø²¢·¢ÊýÁ¿¡£×ÜÖ®£¬¾ÍÊÇÍø¸ñºÍÏ߳̿éÖ»ÊÇÂß¼­»®·Ö£¬Ò»¸ökernelµÄËùÓÐÏß³ÌÆäʵÔÚÎïÀí²ãÊDz»Ò»¶¨Í¬Ê±²¢·¢µÄ¡£ËùÒÔkernelµÄgridºÍblockµÄÅäÖò»Í¬£¬ÐÔÄÜ»á³öÏÖ²îÒ죬ÕâµãÊÇÒªÌØ±ð×¢ÒâµÄ¡£»¹ÓУ¬ÓÉÓÚSMµÄ»ù±¾Ö´Ðе¥ÔªÊǰüº¬32¸öÏ̵߳ÄÏß³ÌÊø£¬ËùÒÔblock´óСһ°ãÒªÉèÖÃΪ32µÄ±¶Êý¡£

CUDA±à³ÌµÄÂß¼­²ãºÍÎïÀí²ã

ÔÚ½øÐÐCUDA±à³Ìǰ£¬¿ÉÒÔÏȼì²éÒ»ÏÂ×Ô¼ºµÄGPUµÄÓ²¼þÅäÖã¬ÕâÑù²Å¿ÉÒÔÓеķÅʸ£¬¿ÉÒÔͨ¹ýÏÂÃæµÄ³ÌÐò»ñµÃGPUµÄÅäÖÃÊôÐÔ£º

int dev = 0;
cudaDeviceProp devProp;
CHECK (cudaGetDeviceProperties (&devProp, dev));
std : :cout << "ʹÓÃGPU device " << dev << ": " << devProp .name << std::endl;
std::cout << "SMµÄÊýÁ¿£º" << devProp.multiProcessorCount << std::endl;
std::cout << "ÿ¸öÏ߳̿éµÄ¹²ÏíÄÚ´æ´óС£º" << devProp.sharedMemPerBlock / 1024.0 << " KB" << std ::endl;
std ::cout << "ÿ¸öÏ߳̿éµÄ×î´óÏß³ÌÊý£º" << devProp .maxThreadsPerBlock << std::endl;
std ::cout << "ÿ¸öEMµÄ×î´óÏß³ÌÊý£º" << devProp . maxThreadsPerMultiProcessor << std::endl;
std ::cout << "ÿ¸öEMµÄ×î´óÏß³ÌÊøÊý£º" << devProp.maxThreadsPerMultiProcessor / 32 << std::endl;
// Êä³öÈçÏÂ
ʹÓÃGPU device 0: GeForce GT 730
SMµÄÊýÁ¿£º2
ÿ¸öÏ߳̿éµÄ¹²ÏíÄÚ´æ´óС£º48 KB
ÿ¸öÏ߳̿éµÄ×î´óÏß³ÌÊý£º1024
ÿ¸öEMµÄ×î´óÏß³ÌÊý£º2048
ÿ¸öEMµÄ×î´óÏß³ÌÊøÊý£º64

ºÃ°É£¬GT 730ÏÔ¿¨È·ÊµÓеãÔü£¬Ö»ÓÐ2¸öSM£¬ÎØÎØ......

2 ÏòÁ¿¼Ó·¨ÊµÀý

ÖªµÀÁËCUDA±à³Ì»ù´¡£¬ÎÒÃǾÍÀ´¸ö¼òµ¥µÄʵս£¬ÀûÓÃCUDA±à³ÌʵÏÖÁ½¸öÏòÁ¿µÄ¼Ó·¨£¬ÔÚʵÏÖ֮ǰ£¬Ïȼòµ¥½éÉÜÒ»ÏÂCUDA±à³ÌÖÐÄÚ´æ¹ÜÀíAPI¡£Ê×ÏÈÊÇÔÚdeviceÉÏ·ÖÅäÄÚ´æµÄcudaMallocº¯Êý£º

cudaError_t cudaMalloc(void** devPtr, size_t size);

Õâ¸öº¯ÊýºÍCÓïÑÔÖеÄmallocÀàËÆ£¬µ«ÊÇÔÚdeviceÉÏÉêÇëÒ»¶¨×Ö½Ú´óСµÄÏԴ棬ÆäÖÐdevPtrÊÇÖ¸ÏòËù·ÖÅäÄÚ´æµÄÖ¸Õ롣ͬʱҪÊÍ·Å·ÖÅäµÄÄÚ´æÊ¹ÓÃcudaFreeº¯Êý£¬ÕâºÍCÓïÑÔÖеÄfreeº¯Êý¶ÔÓ¦¡£ÁíÍâÒ»¸öÖØÒªµÄº¯ÊýÊǸºÔðhostºÍdeviceÖ®¼äÊý¾ÝͨÐŵÄcudaMemcpyº¯Êý£º

cudaError_t cudaMalloc(void** devPtr, size_t size);

ÆäÖÐsrcÖ¸ÏòÊý¾ÝÔ´£¬¶ødstÊÇÄ¿±êÇøÓò£¬countÊǸ´ÖƵÄ×Ö½ÚÊý£¬ÆäÖÐkind¿ØÖƸ´ÖƵķ½Ïò£ºcudaMemcpy HostToHost , cudaMemcpyHostToDevice,

cudaMemcpyDeviceToHost ¼°cudaMemcpyDeviceToDevice£¬Èç cudaMemcpyHostToDevice½«host ÉÏÊý¾Ý¿½±´µ½deviceÉÏ¡£

ÏÖÔÚÎÒÃÇÀ´ÊµÏÖÒ»¸öÏòÁ¿¼Ó·¨µÄʵÀý£¬ÕâÀïgridºÍblock¶¼Éè¼ÆÎª1-dim£¬Ê×Ïȶ¨ÒåkernelÈçÏ£º

// Á½¸öÏòÁ¿¼Ó·¨kernel£¬gridºÍblock¾ùΪһά
__global__ void add(float* x, float * y, float* z, int n)
{
// »ñȡȫ¾ÖË÷Òý
int index = threadIdx.x + blockIdx.x * blockDim.x;
// ²½³¤
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
{
z[i] = x[i] + y[i];
}
}

ÆäÖÐstrideÊÇÕû¸ögridµÄÏß³ÌÊý£¬ÓÐʱºòÏòÁ¿µÄÔªËØÊýºÜ¶à£¬Õâʱºò¿ÉÒÔ½«ÔÚÿ¸öÏß³ÌʵÏÖ¶à¸öÔªËØ£¨ÔªËØ×ÜÊý/Ïß³Ì×ÜÊý£©µÄ¼Ó·¨£¬Ï൱ÓÚʹÓÃÁ˶à¸ögridÀ´´¦Àí£¬ÕâÊÇÒ»ÖÖgrid-stride loop£¨Á´½Ó£ºhttps://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/£©·½Ê½£¬²»¹ýÏÂÃæµÄÀý×ÓÒ»¸öÏß³ÌÖ»´¦ÀíÒ»¸öÔªËØ£¬ËùÒÔkernelÀïÃæµÄÑ­»·ÊDz»Ö´Ðеġ£ÏÂÃæÎÒÃǾßÌåʵÏÖÏòÁ¿¼Ó·¨£º

int main()
{
int N = 1 << 20;
int nBytes = N * sizeof(float);
// ÉêÇëhostÄÚ´æ
float *x, *y, *z;
x = (float*)malloc(nBytes);
y = (float*)malloc(nBytes);
z = (float*)malloc(nBytes);
// ³õʼ»¯Êý¾Ý
for (int i = 0; i < N; ++i)
{
x[i] = 10.0;
y[i] = 20.0;
}
// ÉêÇëdeviceÄÚ´æ
float *d_x, *d_y, *d_z;
cudaMalloc((void**)&d_x, nBytes);
cudaMalloc((void**)&d_y, nBytes);
cudaMalloc((void**)&d_z, nBytes);
// ½«hostÊý¾Ý¿½±´µ½device
cudaMemcpy ((void*)d_x, (void*)x, nBytes, cudaMemcpyHostToDevice );
cudaMemcpy ((void*)d_y, (void*)y, nBytes, cudaMemcpyHostToDevice);
// ¶¨ÒåkernelµÄÖ´ÐÐÅäÖÃ
dim3 blockSize(256);
dim3 gridSize((N + blockSize.x - 1) / blockSize.x);
// Ö´ÐÐkernel
add << < gridSize, blockSize >> >(d_x, d_y, d_z, N);
// ½«deviceµÃµ½µÄ½á¹û¿½±´µ½host
cudaMemcpy((void*)z, (void*)d_z, nBytes, cudaMemcpyHostToDevice);
// ¼ì²éÖ´Ðнá¹û
float maxError = 0.0;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(z[i] - 30.0));
std::cout << "×î´óÎó²î: " << maxError << std::endl;
// ÊÍ·ÅdeviceÄÚ´æ
cudaFree(d_x);
cudaFree(d_y);
cudaFree(d_z);
// ÊÍ·ÅhostÄÚ´æ
free(x);
free(y);
free(z);
return 0;
}

ÕâÀïÎÒÃǵÄÏòÁ¿´óСΪ1<<20£¬¶øblock´óСΪ256£¬ÄÇôgrid´óСÊÇ4096£¬

kernelµÄÏ̲߳㼶½á¹¹ÈçÏÂͼËùʾ£º

kernelµÄÏ̲߳ã´Î½á¹¹. À´Ô´£ºhttps://devblogs.nvidia.com/even-easier-introduction-cuda/

ʹÓÃnvprof¹¤¾ß¿ÉÒÔ·ÖÎökernelÔËÐÐÇé¿ö£¬½á¹ûÈçÏÂËùʾ£¬¿ÉÒÔ¿´µ½kernelº¯Êý·ÑʱԼ1.5ms¡£

nvprof cuda9.exe
==7244== NVPROF is profiling process 7244, command: cuda9.exe
×î´óÎó²î: 4.31602e+008
==7244== Profiling application: cuda9.exe
==7244== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 67.57% 3.2256ms 2 1.6128ms 1.6017ms 1.6239ms [CUDA memcpy HtoD]
32.43% 1.5478ms 1 1.5478ms 1.5478ms 1.5478ms add(float*, float*, float*, int)

Äãµ÷ÕûblockµÄ´óС£¬¶Ô±È²»Í¬ÅäÖÃϵÄkernelÔËÐÐÇé¿ö£¬ÎÒÕâÀï²âÊÔµÄÊǵ±blockΪ128ʱ£¬kernel·ÑʱԼ1.6ms£¬¶øblockΪ512ʱkernel·ÑʱԼ1.7ms£¬µ±blockΪ64ʱ£¬kernel·ÑʱԼ2.3ms¡£¿´À´²»ÊÇblockÔ½´óÔ½ºÃ£¬¶øÒªÊʵ±Ñ¡Ôñ¡£

ÔÚÉÏÃæµÄʵÏÖÖУ¬ÎÒÃÇÐèÒªµ¥¶ÀÔÚhostºÍdeviceÉϽøÐÐÄÚ´æ·ÖÅ䣬²¢ÇÒÒª½øÐÐÊý¾Ý¿½±´£¬ÕâÊǺÜÈÝÒ׳ö´íµÄ¡£ºÃÔÚCUDA 6.0ÒýÈëͳһÄڴ棨Unified Memory£©£¨Á´½Ó£ºhttp://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd£©À´±ÜÃâÕâÖÖÂé·³£¬¼òµ¥À´Ëµ¾ÍÊÇͳһÄÚ´æÊ¹ÓÃÒ»¸öÍйÜÄÚ´æÀ´¹²Í¬¹ÜÀíhostºÍdeviceÖеÄÄڴ棬²¢ÇÒ×Ô¶¯ÔÚhostºÍdeviceÖнøÐÐÊý¾Ý´«Êä¡£CUDAÖÐʹÓÃcudaMallocManagedº¯Êý·ÖÅäÍйÜÄڴ棺

cudaError_t cudaMallocManaged (void **devPtr, size_t size, unsigned int flag=0);

ÀûÓÃͳһÄڴ棬¿ÉÒÔ½«ÉÏÃæµÄ³ÌÐò¼ò»¯ÈçÏ£º

int main()
{
int N = 1 << 20;
int nBytes = N * sizeof(float);
// ÉêÇëÍйÜÄÚ´æ
float *x, *y, *z;
cudaMallocManaged ((void**)&x, nBytes);
cudaMallocManaged ((void**)&y, nBytes);
cudaMallocManaged ((void**)&z, nBytes);
// ³õʼ»¯Êý¾Ý
for (int i = 0; i < N; ++i)
{
x[i] = 10.0;
y[i] = 20.0;
}
// ¶¨ÒåkernelµÄÖ´ÐÐÅäÖÃ
dim3 blockSize (256);
dim3 gridSize ((N + blockSize.x - 1) / blockSize.x);
// Ö´ÐÐkernel
add << < gridSize, blockSize >> >(x, y, z, N);
// ͬ²½device ±£Ö¤½á¹ûÄÜÕýÈ··ÃÎÊ
cudaDeviceSynchronize();
// ¼ì²éÖ´Ðнá¹û
float maxError = 0.0;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(z[i] - 30.0));
std::cout << "×î´óÎó²î: " << maxError << std::endl;
// ÊÍ·ÅÄÚ´æ
cudaFree(x);
cudaFree(y);
cudaFree(z);
return 0;
}

Ïà±È֮ǰµÄ´úÂ룬ʹÓÃͳһÄÚ´æ¸ü¼ò½àÁË£¬ÖµµÃ×¢ÒâµÄÊÇkernelÖ´ÐÐÊÇÓëhostÒì²½µÄ£¬ÓÉÓÚÍйÜÄÚ´æ×Ô¶¯½øÐÐÊý¾Ý´«Ê䣬ÕâÀïÒªÓÃcudaDeviceSynchronize()º¯Êý±£Ö¤deviceºÍhostͬ²½£¬ÕâÑùºóÃæ²Å¿ÉÒÔÕýÈ··ÃÎÊkernel¼ÆËãµÄ½á¹û¡£

3 ¾ØÕó³Ë·¨ÊµÀý

×îºóÎÒÃÇÔÙʵÏÖÒ»¸öÉÔ΢¸´ÔÓһЩµÄÀý×Ó£¬¾ÍÊÇÁ½¸ö¾ØÕóµÄ³Ë·¨£¬ÉèÊäÈë¾ØÕóΪAºÍB,ÒªµÃµ½C=A*B¡£ÊµÏÖ˼·ÊÇÿ¸öÏ̼߳ÆËãCµÄÒ»¸öÔªËØÖµ640?wx_fmt=png£¬¶ÔÓÚ¾ØÕóÔËË㣬Ӧ¸ÃÑ¡ÓÃgridºÍblockΪ2-DµÄ¡£Ê×Ïȶ¨Ò徨ÕóµÄ½á¹¹Ì壺

// ¾ØÕóÀàÐÍ£¬ÐÐÓÅÏÈ£¬M(row, col) = *(M.elements + row * M.width + col)
struct Matrix
{
int width;
int height;
float *elements;
Matrix(int w, int h, float* e = NULL)
{
width = w;
height = h;
elements = e;
}
};

¾ØÕó³Ë·¨ÊµÏÖģʽ

È»ºóʵÏÖ¾ØÕó³Ë·¨µÄºËº¯Êý£¬ÕâÀïÎÒÃǶ¨ÒåÁËÁ½¸ö¸¨ÖúµÄ__device__º¯Êý·Ö±ðÓÃÓÚ»ñÈ¡¾ØÕóµÄÔªËØÖµºÍΪ¾ØÕóÔªËØ¸³Öµ£¬¾ßÌå´úÂëÈçÏ£º

// »ñÈ¡¾ØÕóAµÄ(row, col)ÔªËØ
__device__ float getElement(const Matrix A, int row, int col)
{
return A.elements[row * A.width + col];
}
// Ϊ¾ØÕóAµÄ(row, col)ÔªËØ¸³Öµ
__device__ void setElement(Matrix A, int row, int col, float value)
{
A.elements[row * A.width + col] = value;
}
// ¾ØÕóÏà³Ëkernel£¬2-D£¬Ã¿¸öÏ̼߳ÆËãÒ»¸öÔªËØ
__global__ void matMulKernel(const Matrix A, const Matrix B, Matrix C)
{
float Cvalue = 0.0;
int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = 0; i < A.width; ++i)
{
Cvalue += getElement(A, row, i) * getElement(B, i, col);
}
setElement(C, row, col, Cvalue);
}

×îºóÎÒÃDzÉÓÃͳһÄÚ´æ±àд¾ØÕóÏà³ËµÄ²âÊÔʵÀý£º

int main()
{
int width = 1 << 10;
int height = 1 << 10;
Matrix A(width, height, NULL);
Matrix B(width, height, NULL);
Matrix C(width, height, NULL);
int nBytes = width * height * sizeof(float);
// ÉêÇëÍйÜÄÚ´æ
cudaMallocManaged ((void**)&A.elements, nBytes);
cudaMallocManaged ((void**)&B.elements, nBytes);
cudaMallocManaged ((void**)&C.elements, nBytes);
// ³õʼ»¯Êý¾Ý
for (int i = 0; i < width * height; ++i)
{
A.elements[i] = 1.0;
B.elements[i] = 2.0;
}
// ¶¨ÒåkernelµÄÖ´ÐÐÅäÖÃ
dim3 blockSize (32, 32);
dim3 gridSize ((width + blockSize.x - 1) / blockSize . x ,
(height + blockSize.y - 1) / blockSize.y);
// Ö´ÐÐkernel
matMulKernel << < gridSize, blockSize >> >(A, B, C);
// ͬ²½device ±£Ö¤½á¹ûÄÜÕýÈ··ÃÎÊ
cudaDeviceSynchronize();
// ¼ì²éÖ´Ðнá¹û
float maxError = 0.0;
for (int i = 0; i < width * height; i++)
maxError = fmax(maxError, fabs(C.elements[i] - 2 * width));
std::cout << C.elements[0] << std::endl;
std::cout << "×î´óÎó²î: " << maxError << std::endl;
return 0;
}

ÕâÀï¾ØÕó´óСΪ1024*1024£¬Éè¼ÆµÄÏ̵߳Äblock´óСΪ(32, 32)£¬ÄÇôgrid´óСΪ(32, 32)£¬×îÖÕ²âÊÔ½á¹ûÈçÏ£º

nvprof cuda9.exe
==2456== NVPROF is profiling process 2456, command: cuda9.exe
×î´óÎó²î: 0
==2456== Profiling application: cuda9.exe
==2456== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 2.67533s 1 2.67533s 2.67533s 2.67533s matMulKernel(Matrix, Matrix, Matrix)
API calls: 92.22% 2.67547s 1 2.67547s 2.67547s 2.67547s cudaDeviceSynchronize
6.06% 175.92ms 3 58.640ms 2.3933ms 170.97ms cudaMallocManaged
1.65% 47.845ms 1 47.845ms 47.845ms 47.845ms cudaLaunch
0.05% 1.4405ms 94 15.324us 0ns 938.54us cuDeviceGetAttribute
0.01% 371.49us 1 371.49us 371.49us 371.49us cuDeviceGetName
0.00% 13.474us 1 13.474us 13.474us 13.474us cuDeviceTotalMem
0.00% 6.9300us 1 6.9300us 6.9300us 6.9300us cudaConfigureCall
0.00% 3.8500us 3 1.2830us 385ns 1.9250us cuDeviceGetCount
0.00% 3.4650us 3 1.1550us 0ns 2.3100us cudaSetupArgument
0.00% 2.3100us 2 1.1550us 385ns 1.9250us cuDeviceGet
==2456== Unified Memory profiling result:
Device "GeForce GT 730 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
2048 4.0000KB 4.0000KB 4.0000KB 8.000000MB 22.70431ms Host To Device
266 46.195KB 32.000KB 1.0000MB 12.00000MB 7.213048ms Device To Host

µ±È»£¬Õâ²»ÊÇ×î¸ßЧµÄʵÏÖ£¬ºóÃæ¿ÉÒÔ¼ÌÐøÓÅ»¯...

С½á

×îºóÖ»ÓÐÒ»¾ä»°£ºCUDAÈëÃÅÈÝÒ×£¬µ«ÊÇÉîÈëÄÑ£¡Ï£Íû²»ÊÇ´ÓÈëÃŵ½·ÅÆú.

   
3817 ´Îä¯ÀÀ       29
 
Ïà¹ØÎÄÕÂ

ÔÆ¼ÆËãµÄ¼Ü¹¹
¶ÔÔÆ¼ÆËã·þÎñÄ£ÐÍ
ÔÆ¼ÆËãºËÐļ¼ÊõÆÊÎö
Á˽âÔÆ¼ÆËãµÄ©¶´
 
Ïà¹ØÎĵµ

ÔÆ¼ÆËã¼ò½é
ÔÆ¼ÆËã¼ò½éÓëÔÆ°²È«
ÏÂÒ»´úÍøÂç¼ÆËã--ÔÆ¼ÆËã
ÈídzÎöÔÆ¼ÆËã
 
Ïà¹Ø¿Î³Ì

ÔÆ¼ÆËãÔ­ÀíÓëÓ¦ÓÃ
ÔÆ¼ÆËãÓ¦ÓÃÓ뿪·¢
CMMIÌåϵÓëʵ¼ù
»ùÓÚCMMI±ê×¼µÄÈí¼þÖÊÁ¿±£Ö¤