±à¼ÍƼö: |
±¾ÎÄÀ´×ÔÓÚcsdn£¬½éÉÜÁËGPGPU
µÄÓÅȱµã£¬CUDA ¼Ü¹¹£¬CUDA ToolkitµÄ°²×°£¬ÀûÓà CUDA ½øÐÐÔËËãµÈ¡£ |
|
CUDA ÊÇ NVIDIA µÄ GPGPU Ä£ÐÍ£¬ËüʹÓà C ÓïÑÔΪ»ù´¡£¬¿ÉÒÔÖ±½ÓÒÔ´ó¶àÊýÈËÊìϤµÄ
C ÓïÑÔ£¬Ð´³öÔÚÏÔʾоƬÉÏÖ´ÐеijÌÐò£¬¶ø²»ÐèÒªÈ¥Ñ§Ï°ÌØ¶¨µÄÏÔʾоƬµÄÖ¸Áî»òÊÇÌØÊâµÄ½á¹¹¡£¡±
ÏÖ´úµÄÏÔʾоƬÒѾ¾ßÓи߶ȵĿɳÌÐò»¯ÄÜÁ¦£¬ÓÉÓÚÏÔʾоƬͨ³£¾ßÓÐÏ൱¸ßµÄÄÚ´æ´ø¿í£¬ÒÔ¼°´óÁ¿µÄÖ´Ðе¥Ôª£¬Òò´Ë¿ªÊ¼ÓÐÀûÓÃÏÔʾоƬÀ´°ïÖú½øÐÐһЩ¼ÆË㹤×÷µÄÏë·¨£¬¼´
GPGPU¡£CUDA ¼´ÊÇ NVIDIA µÄ GPGPU Ä£ÐÍ¡£
NVIDIA µÄÐÂÒ»´úÏÔʾоƬ£¬°üÀ¨ GeForce 8 ϵÁм°¸üеÄÏÔʾоƬ¶¼Ö§³Ö CUDA¡£NVIDIA
Ãâ·ÑÌṩ CUDA µÄ¿ª·¢¹¤¾ß£¨°üÀ¨ Windows °æ±¾ºÍ Linux °æ±¾£©¡¢³ÌÐò·¶Àý¡¢ÎļþµÈµÈ£¬¿ÉÒÔÔÚ
CUDA Zone ÏÂÔØ¡£
GPGPU µÄÓÅȱµã
ʹÓÃÏÔʾоƬÀ´½øÐÐÔËË㹤×÷£¬ºÍʹÓà CPU Ïà±È£¬Ö÷ÒªÓм¸¸öºÃ´¦£º
1.ÏÔʾоƬͨ³£¾ßÓиü´óµÄÄÚ´æ´ø¿í¡£ÀýÈ磬NVIDIA µÄ GeForce
8800GTX ¾ßÓг¬¹ý 50GB/s µÄÄÚ´æ´ø¿í£¬¶øÄ¿Ç°¸ß½× CPU µÄÄÚ´æ´ø¿íÔòÔÚ 10GB/s
×óÓÒ¡£
2.ÏÔʾоƬ¾ßÓиü´óÁ¿µÄÖ´Ðе¥Ôª¡£ÀýÈç GeForce 8800GTX
¾ßÓÐ 128 ¸ö ¡°stream processors¡±£¬ÆµÂÊΪ 1.35GHz¡£CPU ƵÂÊͨ³£½Ï¸ß£¬µ«ÊÇÖ´Ðе¥ÔªµÄÊýÄ¿ÔòÒªÉٵöࡣ
3.ºÍ¸ß½× CPU Ïà±È£¬ÏÔ¿¨µÄ¼Û¸ñ½ÏΪµÍÁ®¡£ÀýÈçĿǰһÕÅ GeForce
8800GT °üÀ¨ 512MB ÄÚ´æµÄ¼Û¸ñ£¬ºÍÒ»¿Å 2.4GHz ËĺËÐÄ CPU µÄ¼Û¸ñÏàÈô¡£
µ±È»£¬Ê¹ÓÃÏÔʾоƬҲÓÐËüµÄһЩȱµã£º
1.ÏÔʾоƬµÄÔËËãµ¥ÔªÊýÁ¿ºÜ¶à£¬Òò´Ë¶ÔÓÚ²»Äܸ߶Ȳ¢Ðл¯µÄ¹¤×÷£¬ËùÄÜ´øÀ´µÄ°ïÖú¾Í²»´ó¡£
2.ÏÔʾоƬĿǰͨ³£Ö»Ö§³Ö 32 bits ¸¡µãÊý£¬ÇÒ¶à°ë²»ÄÜÍêȫ֧³Ö
IEEE 754 ¹æ¸ñ£¬ ÓÐЩÔËËãµÄ¾«È·¶È¿ÉÄܽϵ͡£Ä¿Ç°Ðí¶àÏÔʾоƬ²¢Ã»ÓзֿªµÄÕûÊýÔËËãµ¥Ôª£¬Òò´ËÕûÊýÔËËãµÄЧÂʽϲ
3.ÏÔʾоƬͨ³£²»¾ßÓзÖÖ§Ô¤²âµÈ¸´ÔÓµÄÁ÷³Ì¿ØÖƵ¥Ôª£¬Òò´Ë¶ÔÓÚ¾ßÓи߶ȷÖÖ§µÄ³ÌÐò£¬Ð§ÂÊ»á±È½Ï²î¡£
4.Ŀǰ GPGPU µÄ³ÌÐòÄ£ÐÍÈÔ²»³ÉÊ죬Ҳ»¹Ã»Óй«Èϵıê×¼¡£ÀýÈç
NVIDIA ºÍ AMD/ATI ¾ÍÓи÷×Ô²»Í¬µÄ³ÌÐòÄ£ÐÍ¡£
ÕûÌåÀ´Ëµ£¬ÏÔʾоƬµÄÐÔÖÊÀàËÆ stream processor£¬ÊʺÏÒ»´Î½øÐдóÁ¿ÏàͬµÄ¹¤×÷¡£CPU
Ôò±È½ÏÓе¯ÐÔ£¬ÄÜͬʱ½øÐб仯½Ï¶àµÄ¹¤×÷¡£
CUDA ¼Ü¹¹
CUDA ÊÇ NVIDIA µÄ GPGPU Ä£ÐÍ£¬ËüʹÓà C ÓïÑÔΪ»ù´¡£¬¿ÉÒÔÖ±½ÓÒÔ´ó¶àÊýÈËÊìϤµÄ
C ÓïÑÔ£¬Ð´³öÔÚÏÔʾоƬÉÏÖ´ÐеijÌÐò£¬¶ø²»ÐèÒªÈ¥Ñ§Ï°ÌØ¶¨µÄÏÔʾоƬµÄÖ¸Áî»òÊÇÌØÊâµÄ½á¹¹¡£
ÔÚ CUDA µÄ¼Ü¹¹Ï£¬Ò»¸ö³ÌÐò·ÖΪÁ½¸ö²¿·Ý£ºhost ¶ËºÍ device ¶Ë¡£Host ¶ËÊÇÖ¸ÔÚ
CPU ÉÏÖ´ÐеIJ¿·Ý£¬¶ø device ¶ËÔòÊÇÔÚÏÔʾоƬÉÏÖ´ÐеIJ¿·Ý¡£Device ¶ËµÄ³ÌÐòÓÖ³ÆÎª
¡°kernel¡±¡£Í¨³£ host ¶Ë³ÌÐò»á½«Êý¾Ý×¼±¸ºÃºó£¬¸´ÖƵ½ÏÔ¿¨µÄÄÚ´æÖУ¬ÔÙÓÉÏÔʾоƬִÐÐ device
¶Ë³ÌÐò£¬Íê³ÉºóÔÙÓÉ host ¶Ë³ÌÐò½«½á¹û´ÓÏÔ¿¨µÄÄÚ´æÖÐÈ¡»Ø¡£
ÓÉÓÚ CPU ´æÈ¡ÏÔ¿¨ÄÚ´æÊ±Ö»ÄÜ͸¹ý PCI Express ½Ó¿Ú£¬Òò´ËËٶȽÏÂý£¨PCI Express
x16 µÄÀíÂÛ´ø¿íÊÇË«Ïò¸÷ 4GB/s£©£¬Òò´Ë²»ÄÜÌ«³£½øÐÐÕâÀද×÷£¬ÒÔÃâ½µµÍЧÂÊ¡£
ÔÚ CUDA ¼Ü¹¹Ï£¬ÏÔʾоƬִÐÐʱµÄ×îСµ¥Î»ÊÇ thread¡£Êý¸ö thread ¿ÉÒÔ×é³ÉÒ»¸ö
block¡£Ò»¸ö block ÖÐµÄ thread ÄÜ´æÈ¡Í¬Ò»¿é¹²ÏíµÄÄڴ棬¶øÇÒ¿ÉÒÔ¿ìËÙ½øÐÐͬ²½µÄ¶¯×÷¡£
ÿһ¸ö block ËùÄܰüº¬µÄ thread ÊýÄ¿ÊÇÓÐÏ޵ġ£²»¹ý£¬Ö´ÐÐÏàͬ³ÌÐòµÄ block£¬¿ÉÒÔ×é³É
grid¡£²»Í¬ block ÖÐµÄ thread ÎÞ·¨´æÈ¡Í¬Ò»¸ö¹²ÏíµÄÄڴ棬Òò´ËÎÞ·¨Ö±½Ó»¥Í¨»ò½øÐÐͬ²½¡£Òò´Ë£¬²»Í¬
block ÖÐµÄ thread ÄܺÏ×÷µÄ³Ì¶ÈÊDZȽϵ͵ġ£²»¹ý£¬ÀûÓÃÕâ¸öģʽ£¬¿ÉÒÔÈóÌÐò²»Óõ£ÐÄÏÔʾоƬʵ¼ÊÉÏÄÜͬʱִÐеÄ
thread ÊýÄ¿ÏÞÖÆ¡£ÀýÈ磬һ¸ö¾ßÓкÜÉÙÁ¿Ö´Ðе¥ÔªµÄÏÔʾоƬ£¬¿ÉÄÜ»á°Ñ¸÷¸ö block ÖÐµÄ thread
˳ÐòÖ´ÐУ¬¶ø·ÇͬʱִÐС£²»Í¬µÄ grid Ôò¿ÉÒÔÖ´Ðв»Í¬µÄ³ÌÐò£¨¼´ kernel£©¡£
Grid¡¢block ºÍ thread µÄ¹ØÏµ£¬ÈçÏÂͼËùʾ£º
ÿ¸ö thread ¶¼ÓÐ×Ô¼ºµÄÒ»·Ý register ºÍ local
memory µÄ¿Õ¼ä¡£Í¬Ò»¸ö block ÖеÄÿ¸ö thread ÔòÓй²ÏíµÄÒ»·Ý share memory¡£´ËÍ⣬ËùÓеÄ
thread£¨°üÀ¨²»Í¬ block µÄ thread£©¶¼¹²ÏíÒ»·Ý global memory ¡¢constant
memory¡¢ºÍ texture memory¡£²»Í¬µÄ grid ÔòÓи÷×﵀ global memory¡¢constant
memory ºÍ texture memory¡£ÕâЩ²»Í¬µÄÄÚ´æµÄ²î±ð£¬»áÔÚÖ®ºóÌÖÂÛ¡£
Ö´ÐÐģʽ
ÓÉÓÚÏÔʾоƬ´óÁ¿²¢ÐмÆËãµÄÌØÐÔ£¬Ëü´¦ÀíһЩÎÊÌâµÄ·½Ê½£¬ºÍÒ»°ã CPU ÊDz»Í¬µÄ¡£Ö÷ÒªµÄÌØµã°üÀ¨£º
1.ÄÚ´æ´æÈ¡ latency µÄÎÊÌ⣺CPU ͨ³£Ê¹Óà cache
À´¼õÉÙ´æÈ¡Ö÷ÄÚ´æµÄ´ÎÊý£¬ÒÔ±ÜÃâÄÚ´æ latency Ó°Ïìµ½Ö´ÐÐЧÂÊ¡£ÏÔʾоƬÔò¶à°ëûÓÐ cache£¨»òºÜС£©£¬¶øÀûÓò¢Ðл¯Ö´Ðеķ½Ê½À´Òþ²ØÄÚ´æµÄ
latency£¨¼´£¬µ±µÚÒ»¸ö thread ÐèÒªµÈ´ýÄÚ´æ¶ÁÈ¡½á¹ûʱ£¬Ôò¿ªÊ¼Ö´Ðеڶþ¸ö thread£¬ÒÀ´ËÀàÍÆ£©¡£
2.·ÖÖ§Ö¸ÁîµÄÎÊÌ⣺CPU ͨ³£ÀûÓ÷ÖÖ§Ô¤²âµÈ·½Ê½À´¼õÉÙ·ÖÖ§Ö¸ÁîÔì³ÉµÄ
pipeline bubble¡£ÏÔʾоƬÔò¶à°ëʹÓÃÀàËÆ´¦ÀíÄÚ´æ latency µÄ·½Ê½¡£²»¹ý£¬Í¨³£ÏÔʾоƬ´¦Àí·ÖÖ§µÄЧÂÊ»á±È½Ï²î¡£
Òò´Ë£¬×îÊʺÏÀûÓà CUDA ´¦ÀíµÄÎÊÌ⣬ÊÇ¿ÉÒÔ´óÁ¿²¢Ðл¯µÄÎÊÌ⣬²ÅÄÜÓÐЧÒþ²ØÄÚ´æµÄ latency£¬²¢ÓÐЧÀûÓÃÏÔʾоƬÉϵĴóÁ¿Ö´Ðе¥Ôª¡£Ê¹ÓÃ
CUDA ʱ£¬Í¬Ê±ÓÐÉÏǧ¸ö thread ÔÚÖ´ÐÐÊǺÜÕý³£µÄ¡£Òò´Ë£¬Èç¹û²»ÄÜ´óÁ¿²¢Ðл¯µÄÎÊÌ⣬ʹÓà CUDA
¾Íû°ì·¨´ïµ½×îºÃµÄЧÂÊÁË¡£
CUDA ToolkitµÄ°²×°
Ŀǰ NVIDIA ÌṩµÄ CUDA Toolkit£¨¿É´ÓÕâÀïÏÂÔØ£©Ö§³Ö Windows £¨32
bits ¼° 64 bits °æ±¾£©¼°Ðí¶à²»Í¬µÄ Linux °æ±¾¡£
CUDA Toolkit ÐèÒªÅäºÏ C/C++ compiler¡£ÔÚ Windows Ï£¬Ä¿Ç°Ö»Ö§³Ö
Visual Studio 7.x ¼° Visual Studio 8£¨°üÀ¨Ãâ·ÑµÄ Visual Studio
C++ 2005 Express£©¡£Visual Studio 6 ºÍ gcc ÔÚ Windows
ÏÂÊDz»Ö§Ô®µÄ¡£ÔÚ Linux ÏÂÔòÖ»Ö§Ô® gcc¡£
ÕâÀï¼òµ¥½éÉÜÒ»ÏÂÔÚ Windows ÏÂÉ趨²¢Ê¹Óà CUDA µÄ·½Ê½¡£
ÏÂÔØ¼°°²×°
ÔÚ Windows Ï£¬CUDA Toolkit ºÍ CUDA SDK ¶¼ÊÇÓɰ²×°³ÌÐòµÄÐÎʽ°²×°µÄ¡£CUDA
Toolkit °üÀ¨ CUDA µÄ»ù±¾¹¤¾ß£¬¶ø CUDA SDK Ôò°üÀ¨Ðí¶à·¶Àý³ÌÐòÒÔ¼°Á´½Ó¿â¡£»ù±¾ÉÏҪд
CUDA µÄ³ÌÐò£¬Ö»ÐèÒª°²×° CUDA Toolkit ¼´¿É¡£²»¹ý CUDA SDK ÈÔÖµµÃ°²×°£¬ÒòΪÀïÃæµÄÐí¶à·¶Àý³ÌÐòºÍÁ´½Ó¿â¶¼Ï൱ÓÐÓá£
CUDA Toolkit °²×°Íêºó£¬Ô¤Éè»á°²×°ÔÚ C:/CUDA Ŀ¼Àï¡£ÆäÖаüÀ¨¼¸¸öĿ¼£º
bin ¨C ¹¤¾ß³ÌÐò¼°¶¯Ì¬Á´½Ó¿â
doc ¨C Îļþ
include ¨C header ™n
lib ¨C Á´½Ó¿âµµ°¸
open64 ¨C »ùÓÚ Open64 µÄ CUDA compiler
src ¨C һЩÔʼÂë
°²×°³ÌÐòÒ²»áÉ趨һЩ»·¾³±äÁ¿£¬°üÀ¨£º
CUDA_BIN_PATH ¨C ¹¤¾ß³ÌÐòµÄĿ¼£¬Ä¬ÈÏΪ C:/CUDA/bin
CUDA_INC_PATH ¨C header ÎļþµÄĿ¼£¬Ä¬ÈÏΪ C:/CUDA/inc
CUDA_LIB_PATH ¨C Á´½Ó¿âÎļþµÄĿ¼£¬Ä¬ÈÏΪ C:/CUDA/lib
ÔÚ Visual Studio ÖÐʹÓà CUDA
CUDA µÄÖ÷Òª¹¤¾ßÊÇ nvcc£¬Ëü»áÖ´ÐÐËùÐèÒªµÄ³ÌÐò£¬½« CUDA ³ÌÐò´úÂë±àÒë³ÉÖ´Ðеµ (»ò
object ™n) ¡£ÔÚ Visual Studio Ï£¬ÎÒÃÇ͸¹ýÉ趨 custom build tool
µÄ·½Ê½£¬Èà Visual Studio »á×Ô¶¯Ö´ÐÐ nvcc¡£
ÕâÀïÒÔ Visual Studio 2005 ΪÀý£º
1.Ê×ÏÈ£¬½¨Á¢Ò»¸ö Win32 Console ģʽµÄ project£¨ÔÚ
Application Settings Öмǵù´Ñ¡ Empty project£©£¬²¢ÐÂÔöÒ»¸öµµ°¸£¬ÀýÈç
main.cu¡£
2.ÔÚ main.cu ÉÏÓÒ¼üµ¥»÷£¬²¢Ñ¡Ôñ Properties¡£µãÑ¡
General£¬È·¶¨ Tool µÄ²¿·ÝÊÇÑ¡Ôñ Custom Build Tool¡£
3.Ñ¡Ôñ Custom Build Step£¬ÔÚ Command
Line ʹÓÃÒÔÏÂÉ趨£º
Release ģʽ£º¡±(CUDABINPATH)/nvcc.exe"-ccbin"
(VCInstallDir)bin¡± -c -DWIN32 -D_CONSOLE -D_ MBCS
-Xcompiler /EHsc, /W3, /nologo, /Wp64, /O2, /Zi, /MT
-I¡±(CUDAINCPATH) "-o ( ConfigurationName )/ (InputName).obj(InputFileName)
Debug ģʽ£º¡±(CUDABINPATH)/nvcc.exe"-ccbin"(VCInstallDir)bin¡±
-c -D_DEBUG -DWIN32 -D_ CONSOLE -D_ MBCS -Xcompiler
/EHsc, /W3, /nologo, / Wp64 ,/Od,/Zi,/RTC1,/MTd -I¡±(CUDAINCPATH)
"-o ( ConfigurationName) /(InputName).obj(InputFileName)
4.Èç¹ûÏëҪʹÓÃÈí¼þ·ÂÕæµÄģʽ£¬¿ÉÒÔÐÂÔöÁ½¸ö¶îÍâµÄÉ趨£º
EmuRelease ģʽ£º¡±(CUDABINPATH)/nvcc.exe"-ccbin"(VCInstallDir)bin¡±
-deviceemu -c -DWIN32 - D_ CONSOLE -D_MBCS -Xcompiler
/EHsc ,/W3,/ nologo, / Wp64 ,/O2,/Zi,/MT -I¡±(CUDAINCPATH)
"-o (ConfigurationName)/(InputName).obj(InputFileName)
EmuDebug ģʽ£º¡±(CUDABINPATH)/nvcc.exe"-ccbin"(VCInstallDir)bin¡±
-deviceemu -c -D_DEBUG - DWIN32 -D_CONSOLE -D_ MBCS
- Xcompiler /EHsc,/W3, / nologo, /Wp64, /Od, /Zi,/RTC1,/MTd
-I¡±( CUDAINCPATH ) "-o(ConfigurationName)/(InputName).obj(InputFileName)
5.¶ÔËùÓеÄÅäÖÃÎļþ£¬ÔÚ Custom Build Step µÄ Outputs
ÖмÓÈë (ConfigurationName)/(InputName ). obj ¡£
6.Ñ¡Ôñ project£¬ÓÒ¼üµ¥»÷Ñ¡Ôñ Properties£¬ÔÙµãÑ¡
Linker¡£¶ÔËùÓеÄÅäÖÃÎļþÐÞ¸ÄÒÔÏÂÉ趨£º
General/Enable Incremental Linking£ºNo
General/Additional Library Directories£º$(CUDA_LIB_PATH)
Input/Additional Dependencies£ºcudart.lib
ÕâÑùÓ¦¸Ã¾Í¿ÉÒÔÖ±½ÓÔÚ Visual Studio µÄ IDE ÖУ¬±à¼ CUDA ³ÌÐòºó£¬Ö±½Ó build
ÒÔ¼°Ö´ÐгÌÐòÁË¡£
CUDAºÍVisual C++2005 ideµÄÉèÖñȽϸ´ÔÓ£¬OpenHero¹±Ï×Á˽â¾ö·½°¸
CUDA VS2005 Wizard£ºhttp://blog.csdn.NET/OpenHero/archive/2008/04/18/2305856.aspx
visual assist Ö§³ÖcuÎļþ£ºhttp://blog.csdn.Net/OpenHero/archive/2008/04/24/2324711.aspx
Óï·¨¸ßÁÁ£ºhttp://blog.csdn.net/OpenHero/archive/2008/04/17/2301617.aspx
µÚÒ»¸öCUDA³ÌÐò
CUDA ĿǰÓÐÁ½ÖÖ²»Í¬µÄ API£ºRuntime API ºÍ Driver API£¬Á½ÖÖ API
¸÷ÓÐÆäÊÊÓõķ¶Î§¡£ÓÉÓÚ runtime API ½ÏÈÝÒ×ʹÓã¬Ò»¿ªÊ¼ÎÒÃÇ»áÒÔ runetime API
ΪÖ÷¡£
CUDA µÄ³õʼ»¯
Ê×ÏÈ£¬ÏȽ¨Á¢Ò»¸öµµ°¸ first_cuda.cu¡£Èç¹ûÊÇʹÓà Visual Studio µÄ»°£¬ÔòÇëÏȰ´ÕÕÕâÀïµÄÉ趨·½Ê½É趨
project¡£
ҪʹÓà runtime API µÄʱºò£¬ÐèÒª include cuda_runtime.h¡£ËùÒÔ£¬ÔÚ³ÌÐòµÄ×îÇ°Ãæ£¬¼ÓÉÏ
#include <stdio.h>
#include <cuda_runtime.h> |
½ÓÏÂÀ´ÊÇÒ»¸ö InitCUDA º¯Ê½£¬»áºô½Ð runtime API
ÖУ¬Óйسõʼ»¯ CUDA µÄ¹¦ÄÜ£º
bool InitCUDA()
{
int count;
cudaGetDeviceCount (&count);
if (count == 0) {
fprintf(stderr, ¡°There is no device./n¡±);
return false;
}
int i;
for (i = 0; i < count; i++) {
cudaDeviceProp prop;
if (cudaGetDeviceProperties (&prop, i) ==
cudaSuccess ) {
if (prop.major >= 1) {
break;
}
}
}
if(i == count) {
fprintf (stderr, ¡°There is no device supporting
CUDA 1.x./n¡±);
return false;
}
cudaSetDevice(i);
return true;
} |
Õâ¸öº¯Ê½»áÏȺô½Ð cudaGetDeviceCount º¯Ê½£¬È¡µÃÖ§³Ö CUDA µÄ×°ÖõÄÊýÄ¿¡£Èç¹ûϵͳÉÏûÓÐÖ§³Ö
CUDA µÄ×°Öã¬ÔòËü»á´«»Ø 1£¬¶ø device 0 »áÊÇÒ»¸ö·ÂÕæµÄ×°Ö㬵«²»Ö§³Ö CUDA 1.0
ÒÔÉϵŦÄÜ¡£ËùÒÔ£¬ÒªÈ·¶¨ÏµÍ³ÉÏÊÇ·ñÓÐÖ§³Ö CUDA µÄ×°Öã¬ÐèÒª¶Ôÿ¸ö device ºô½Ð cudaGetDeviceProperties
º¯Ê½£¬È¡µÃ×°Öõĸ÷ÏîÊý¾Ý£¬²¢ÅжÏ×°ÖÃÖ§³ÖµÄ CUDA °æ±¾£¨prop.major ºÍ prop.minor
·Ö±ð´ú±í×°ÖÃÖ§³ÖµÄ°æ±¾ºÅÂ룬ÀýÈç 1.0 Ôò prop.major Ϊ 1 ¶ø prop.minor
Ϊ 0£©¡£
͸¹ý cudaGetDeviceProperties º¯Ê½¿ÉÒÔÈ¡µÃÐí¶àÊý¾Ý£¬³ýÁË×°ÖÃÖ§³ÖµÄ CUDA
°æ±¾Ö®Í⣬»¹ÓÐ×°ÖõÄÃû³Æ¡¢ÄÚ´æµÄ´óС¡¢×î´óµÄ thread ÊýÄ¿¡¢Ö´Ðе¥ÔªµÄƵÂʵȵȡ£ÏêÇé¿É²Î¿¼ NVIDIA
µÄ CUDA Programming Guide¡£
ÔÚÕÒµ½Ö§³Ö CUDA 1.0 ÒÔÉϵÄ×°ÖÃÖ®ºó£¬¾Í¿ÉÒÔºô½Ð cudaSetDevice º¯Ê½£¬°ÑËüÉèΪĿǰҪʹÓõÄ×°Öá£
×îºóÊÇ main º¯Ê½¡£ÔÚ main º¯Ê½ÖÐÎÒÃÇÖ±½Óºô½Ð¸Õ²ÅµÄ InitCUDA º¯Ê½£¬²¢ÏÔʾÊʵ±µÄѶϢ£º
int main()
{
if(!InitCUDA()) {
return 0;
}
printf(¡°CUDA initialized./n¡±);
return 0;
} |
ÕâÑù¾Í¿ÉÒÔÀûÓà nvcc À´ compile Õâ¸ö³ÌÐòÁË¡£Ê¹Óà Visual Studio µÄ»°£¬Èô°´ÕÕÏÈǰµÄÉ趨·½Ê½£¬¿ÉÒÔÖ±½Ó
Build Project ²¢Ö´ÐС£
nvcc ÊÇ CUDA µÄ compile ¹¤¾ß£¬Ëü»á½« .cu ™n²ð½â³öÔÚ GPU ÉÏÖ´ÐеIJ¿·Ý£¬¼°ÔÚ
host ÉÏÖ´ÐеIJ¿·Ý£¬²¢ºô½ÐÊʵ±µÄ³ÌÐò½øÐÐ compile ¶¯×÷¡£ÔÚ GPU Ö´ÐеIJ¿·Ý»á͸¹ý NVIDIA
ÌṩµÄ compiler ±àÒë³ÉÖнéÂ룬¶ø host Ö´ÐеIJ¿·ÝÔò»á͸¹ýϵͳÉ쵀 C++ compiler
±àÒ루ÔÚ Windows ÉÏʹÓà Visual C++ ¶øÔÚ Linux ÉÏʹÓà gcc£©¡£
±àÒëºóµÄ³ÌÐò£¬Ö´ÐÐʱÈç¹ûϵͳÉÏÓÐÖ§³Ö CUDA µÄ×°Öã¬Ó¦¸Ã»áÏÔʾ CUDA initialized.
µÄѶϢ£¬·ñÔò»áÏÔʾÏà¹ØµÄ´íÎóѶϢ¡£
ÀûÓÃ CUDA ½øÐÐÔËËã
µ½Ä¿Ç°ÎªÖ¹£¬ÎÒÃǵijÌÐò²¢Ã»ÓÐ×öʲôÓÐÓõŤ×÷¡£ËùÒÔ£¬ÏÖÔÚÎÒÃǼÓÈëÒ»¸ö¼òµ¥µÄ¶¯×÷£¬¾ÍÊǰÑÒ»´ó¶ÑÊý×Ö£¬¼ÆËã³öËüµÄƽ·½ºÍ¡£
Ê×ÏÈ£¬°Ñ³ÌÐò×îÇ°ÃæµÄ include ²¿·Ý¸Ä³É£º
#include
<stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#define DATA_SIZE 1048576
int data[DATA_SIZE];
|
²¢¼ÓÈëÒ»¸öк¯Ê½ GenerateNumbers£º
void GenerateNumbers(int
*number, int size)
{
for(int i = 0; i < size; i++) {
number[i] = rand() % 10;
}
} |
Õâ¸öº¯Ê½»á²úÉúÒ»´ó¶Ñ 0 ~ 9 Ö®¼äµÄËæ»úÊý¡£
ÒªÀûÓà CUDA ½øÐмÆËã֮ǰ£¬ÒªÏȰÑÊý¾Ý¸´ÖƵ½ÏÔ¿¨ÄÚ´æÖУ¬²ÅÄÜÈÃÏÔʾоƬʹÓá£Òò´Ë£¬ÐèҪȡµÃÒ»¿éÊʵ±´óСµÄÏÔ¿¨Äڴ棬ÔٰѲúÉúºÃµÄÊý¾Ý¸´ÖƽøÈ¥¡£ÔÚ
main º¯Ê½ÖмÓÈ룺
GenerateNumbers(data,
DATA_SIZE);
int* gpudata, *result;
cudaMalloc ((void**) &gpudata, sizeof(int)
* DATA_SIZE);
cudaMalloc((void**) &result, sizeof(int));
cudaMemcpy (gpudata, data, sizeof(int) * DATA_SIZE,
cudaMemcpyHostToDevice); |
ÉÏÃæÕâ¶Î³ÌÐò»áÏȺô½Ð GenerateNumbers ²úÉúËæ»úÊý£¬²¢ºô½Ð cudaMalloc È¡µÃÒ»¿éÏÔ¿¨Äڴ棨result
ÔòÊÇÓÃÀ´´æÈ¡¼ÆËã½á¹û£¬ÔÚÉÔºó»áÓõ½£©£¬²¢Í¸¹ý cudaMemcpy ½«²úÉúµÄËæ»úÊý¸´ÖƵ½ÏÔ¿¨ÄÚ´æÖС£cudaMalloc
ºÍ cudaMemcpy µÄÓ÷¨ºÍÒ»°ãµÄ malloc ¼° memcpy ÀàËÆ£¬²»¹ý cudaMemcpy
Ôò¶à³öÒ»¸ö²ÎÊý£¬Ö¸Ê¾¸´ÖÆÄÚ´æµÄ·½Ïò¡£ÔÚÕâÀïÒòΪÊÇ´ÓÖ÷ÄÚ´æ¸´ÖÆµ½ÏÔ¿¨Äڴ棬ËùÒÔʹÓà cudaMemcpyHostToDevice¡£Èç¹ûÊÇ´ÓÏÔ¿¨ÄÚ´æµ½Ö÷Äڴ棬ÔòʹÓÃ
cudaMemcpyDeviceToHost¡£ÕâÔÚÖ®ºó»áÓõ½¡£
½ÓÏÂÀ´ÊÇҪдÔÚÏÔʾоƬÉÏÖ´ÐеijÌÐò¡£ÔÚ CUDA ÖУ¬ÔÚº¯Ê½Ç°Ãæ¼ÓÉÏ __global__ ±íʾÕâ¸öº¯Ê½ÊÇÒªÔÚÏÔʾоƬÉÏÖ´Ðеġ£Òò´Ë£¬¼ÓÈëÒÔϵĺ¯Ê½£º
__global__
static void sumOfSquares(int *num, int* result)
{
int sum = 0;
int i;
for(i = 0; i < DATA_SIZE; i++) {
sum += num[i] * num[i];
}
*result = sum;
} |
ÔÚÏÔʾоƬÉÏÖ´ÐеijÌÐòÓÐһЩÏÞÖÆ£¬ÀýÈçËü²»ÄÜÓд«»ØÖµ¡£ÆäËüµÄÏÞÖÆ»áÔÚÖ®ºóÌáµ½¡£
½ÓÏÂÀ´ÊÇÒªÈà CUDA Ö´ÐÐÕâ¸öº¯Ê½¡£ÔÚ CUDA ÖУ¬ÒªÖ´ÐÐÒ»¸öº¯Ê½£¬Ê¹ÓÃÒÔϵÄÓï·¨£º
º¯Ê½Ãû³Æ<<<block ÊýÄ¿, thread ÊýÄ¿, shared memory
´óС>>>(²ÎÊý¡);
ºô½ÐÍêºó£¬»¹Òª°Ñ½á¹û´ÓÏÔʾоƬ¸´ÖÆ»ØÖ÷ÄÚ´æÉÏ¡£ÔÚ main º¯Ê½ÖмÓÈëÒÔϵijÌÐò£º
sumOfSquares<<<1,
1, 0>>>(gpudata, result);
int sum;
cudaMemcpy (&sum, result, sizeof(int), cudaMemcpyDeviceToHost);
cudaFree (gpudata);
cudaFree(result);
printf (¡°sum: %d/n¡±, sum); |
ÒòΪÕâ¸ö³ÌÐòֻʹÓÃÒ»¸ö thread£¬ËùÒÔ block ÊýÄ¿¡¢thread ÊýÄ¿¶¼ÊÇ 1¡£ÎÒÃÇҲûÓÐʹÓõ½ÈκÎ
shared memory£¬ËùÒÔÉèΪ 0¡£±àÒëºóÖ´ÐУ¬Ó¦¸Ã¿ÉÒÔ¿´µ½Ö´ÐеĽá¹û¡£
ΪÁËÈ·¶¨Ö´ÐеĽá¹ûÕýÈ·£¬ÎÒÃÇ¿ÉÒÔ¼ÓÉÏÒ»¶ÎÒÔ CPU Ö´ÐеijÌÐò´úÂ룬À´ÑéÖ¤½á¹û£º
sum = 0;
for(int i = 0; i < DATA_SIZE; i++) {
sum += data[i] * data[i];
}
printf(¡°sum (CPU): %d/n¡±, sum); |
±àÒëºóÖ´ÐУ¬È·ÈÏÁ½¸ö½á¹ûÏàͬ¡£
¼ÆËãÔËÐÐʱ¼ä
CUDA ÌṩÁËÒ»¸ö clock º¯Ê½£¬¿ÉÒÔÈ¡µÃĿǰµÄ timestamp£¬ºÜÊʺÏÓÃÀ´ÅжÏÒ»¶Î³ÌÐòÖ´ÐÐËù»¨·ÑµÄʱ¼ä£¨µ¥Î»Îª
GPU Ö´Ðе¥ÔªµÄƵÂÊ£©¡£Õâ¶Ô³ÌÐòµÄÓÅ»¯Ò²Ï൱ÓÐÓá£ÒªÔÚÎÒÃǵijÌÐòÖмǼʱ¼ä£¬°Ñ sumOf Squares
º¯Ê½¸Ä³É£º
__global__
static void sumOfSquares(int *num, int* result,
clock_t* time)
{
int sum = 0;
int i;
clock_t start = clock();
for(i = 0; i < DATA_SIZE; i++) {
sum += num[i] * num[i];
}
*result = sum;
*time = clock() - start;
} |
°Ñ main º¯Ê½Öм䲿·Ý¸Ä³É£º
int* gpudata,
*result;
clock_t* time;
cudaMalloc((void**) &gpudata, sizeof(int)
* DATA_ SIZE );
cudaMalloc ((void**) &result, sizeof(int));
cudaMalloc ((void**) &time, sizeof(clock_t));
cudaMemcpy (gpudata, data, sizeof(int) * DATA_SIZE,
cudaMemcpyHostToDevice );
sumOfSquares<<<1, 1, 0>>>(gpudata,
result, time);
int sum;
clock_t time_used;
cudaMemcpy (&sum, result, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy (&time_used, time, sizeof(clock_t),
cudaMemcpyDeviceToHost);
cudaFree (gpudata);
cudaFree (result);
printf (¡°sum: %d time: %d/n¡±, sum, time_used); |
±àÒëºóÖ´ÐУ¬¾Í¿ÉÒÔ¿´µ½Ö´ÐÐËù»¨·ÑµÄʱ¼äÁË¡£
Èç¹û¼ÆËãʵ¼ÊÔËÐÐʱ¼äµÄ»°£¬¿ÉÄÜ»á×¢Òâµ½ËüµÄÖ´ÐÐЧÂʲ¢²»ºÃ¡£ÕâÊÇÒòΪÎÒÃǵijÌÐò²¢Ã»ÓÐÀûÓõ½ CUDA
µÄÖ÷ÒªµÄÓÅÊÆ£¬¼´²¢Ðл¯Ö´ÐС£ÔÚÏÂÒ»¶ÎÎÄÕÂÖУ¬»áÌÖÂÛÈçºÎ½øÐÐÓÅ»¯µÄ¶¯×÷¡£
ÔÚÉÏһƪÎÄÕÂÖУ¬ÎÒÃÇ×öÁËÒ»¸ö¼ÆËãÒ»´ó¶ÑÊý×ֵį½·½ºÍµÄ³ÌÐò¡£²»¹ý£¬ÎÒÃÇÒ²Ìáµ½Õâ¸ö³ÌÐòµÄÖ´ÐÐЧÂʲ¢²»ÀíÏë¡£µ±È»£¬Êµ¼ÊÉÏÀ´Ëµ£¬Èç¹ûÖ»ÊÇÒª×ö¼ÆËãÆ½·½ºÍµÄ¶¯×÷£¬ÓÃ
CPU ×ö»á±ÈÓà GPU ¿ìµÃ¶à¡£ÕâÊÇÒòΪƽ·½ºÍµÄ¼ÆËã²¢²»ÐèҪ̫¶àÔËËãÄÜÁ¦£¬ËùÒÔ¼¸ºõ¶¼ÊDZ»ÄÚ´æ´ø¿íËùÏÞÖÆ¡£Òò´Ë£¬¹âÊǰÑÊý¾Ý¸´ÖƵ½ÏÔ¿¨ÄÚ´æÉϵÄÕâ¸ö¶¯×÷£¬ËùÐèÒªµÄʱ¼ä£¬¿ÉÄÜÒѾºÍÖ±½ÓÔÚ
CPU ÉϽøÐмÆËã²î²»¶àÁË¡£
²»¹ý£¬Èç¹û½øÐÐÆ½·½ºÍµÄ¼ÆË㣬ֻÊÇÒ»¸ö¸ü¸´ÔӵļÆËã¹ý³ÌµÄÒ»²¿·ÝµÄ»°£¬ÄÇôµ±È»ÔÚ GPU ÉϼÆË㻹ÊÇÓÐËüµÄºÃ´¦µÄ¡£¶øÇÒ£¬Èç¹ûÊý¾ÝÒѾÔÚÏÔ¿¨ÄÚ´æÉÏ£¨ÀýÈçÔÚ
GPU ÉÏ͸¹ýijÖÖËã·¨²úÉú£©£¬ÄÇô£¬Ê¹Óà GPU ½øÐÐÕâÑùµÄÔËË㣬»¹ÊÇ»á±È½Ï¿ìµÄ¡£
¸Õ²ÅÒ²Ìáµ½ÁË£¬ÓÉÓÚÕâ¸ö¼ÆËãµÄÖ÷Ҫƿ¾±ÊÇÄÚ´æ´ø¿í£¬ËùÒÔ£¬ÀíÂÛÉÏÏÔ¿¨µÄÄÚ´æ´ø¿íÊÇÏ൱´óµÄ¡£ÕâÀïÎÒÃǾÍÀ´¿´¿´£¬µ¹µ×ÎÒÃǵĵÚÒ»¸ö³ÌÐò£¬ÄÜÀûÓõ½¶àÉÙÄÚ´æ´ø¿í¡£
³ÌÐòµÄ²¢Ðл¯
ÎÒÃǵĵÚÒ»¸ö³ÌÐò£¬²¢Ã»ÓÐÀûÓõ½Èκβ¢Ðл¯µÄ¹¦ÄÜ¡£Õû¸ö³ÌÐòÖ»ÓÐÒ»¸ö thread¡£ÔÚ GeForce
8800GT ÉÏÃæ£¬ÔÚ GPU ÉÏÖ´ÐеIJ¿·Ý£¨³ÆÎª ¡°kernel¡°£©´óÔ¼»¨·Ñ 640M ¸öƵÂÊ¡£GeForce
8800GT µÄÖ´Ðе¥ÔªµÄƵÂÊÊÇ 1.5GHz£¬Òò´ËÕâ±íʾËü»¨·ÑÁËÔ¼ 0.43 ÃëµÄʱ¼ä¡£1M ¸ö
32 bits Êý×ÖµÄÊý¾ÝÁ¿ÊÇ 4MB£¬Òò´Ë£¬Õâ¸ö³ÌÐòʵ¼ÊÉÏʹÓõÄÄÚ´æ´ø¿í£¬Ö»ÓÐ 9.3MB/s ×óÓÒ£¡ÕâÊǷdz£Ôã¸âµÄ±íÏÖ¡£
Ϊʲô»áÓÐÕâÑù²îµÄ±íÏÖÄØ£¿ÕâÊÇÒòΪ GPU µÄ¼Ü¹¹ÌØÐÔËùÔì³ÉµÄ¡£ÔÚ CUDA ÖУ¬Ò»°ãµÄÊý¾Ý¸´ÖƵ½µÄÏÔ¿¨ÄÚ´æµÄ²¿·Ý£¬³ÆÎª
global memory¡£ÕâЩÄÚ´æÊÇûÓÐ cache µÄ£¬¶øÇÒ£¬´æÈ¡ global memory
ËùÐèÒªµÄʱ¼ä£¨¼´ latency£©ÊǷdz£³¤µÄ£¬Í¨³£ÊÇÊý°Ù¸ö cycles¡£ÓÉÓÚÎÒÃǵijÌÐòÖ»ÓÐÒ»¸ö thread£¬ËùÒÔÿ´ÎËü¶ÁÈ¡
global memory µÄÄÚÈÝ£¬¾ÍÒªµÈµ½Êµ¼Ê¶ÁÈ¡µ½Êý¾Ý¡¢ÀÛ¼Óµ½ sum Ö®ºó£¬²ÅÄܽøÐÐÏÂÒ»²½¡£Õâ¾ÍÊÇΪʲôËüµÄ±íÏÖ»áÕâôµÄ²î¡£
ÓÉÓÚ global memory ²¢Ã»ÓÐ cache£¬ËùÒÔÒª±Ü¿ª¾Þ´óµÄ latency µÄ·½·¨£¬¾ÍÊÇÒªÀûÓôóÁ¿µÄ
threads¡£¼ÙÉèÏÖÔÚÓдóÁ¿µÄ threads ÔÚͬʱִÐУ¬ÄÇôµ±Ò»¸ö thread ¶ÁÈ¡Äڴ棬¿ªÊ¼µÈ´ý½á¹ûµÄʱºò£¬GPU
¾Í¿ÉÒÔÁ¢¿ÌÇл»µ½ÏÂÒ»¸ö thread£¬²¢¶ÁÈ¡ÏÂÒ»¸öÄÚ´æÎ»Öá£Òò´Ë£¬ÀíÏëÉϵ± thread µÄÊýÄ¿¹»¶àµÄʱºò£¬¾Í¿ÉÒÔÍêÈ«°Ñ
global memory µÄ¾Þ´ó latency Òþ²ØÆðÀ´ÁË¡£
ÒªÔõô°Ñ¼ÆËãÆ½·½ºÍµÄ³ÌÐò²¢Ðл¯ÄØ£¿×î¼òµ¥µÄ·½·¨£¬Ëƺõ¾ÍÊǰÑÊý×Ö·Ö³ÉÈô¸É×飬°Ñ¸÷×éÊý×Ö·Ö±ð¼ÆËãÆ½·½ºÍºó£¬×îºóÔÙ°Ñÿ×éµÄºÍ¼Ó×ÜÆðÀ´¾Í¿ÉÒÔÁË¡£Ò»¿ªÊ¼£¬ÎÒÃÇ¿ÉÒÔ°Ñ×îºó¼Ó×ܵ͝×÷£¬ÓÉ
CPU À´½øÐС£
Ê×ÏÈ£¬ÔÚ first_cuda.cu ÖУ¬ÔÚ #define DATA_SIZE µÄºóÃæÔö¼ÓÒ»¸ö
#define£¬É趨 thread µÄÊýÄ¿£º
#define DATA_SIZE 1048576
#define THREAD_NUM 256
½Ó×Å£¬°Ñ kernel ³ÌÐò¸Ä³É£º
__global__ static
void sumOfSquares(int *num, int* result ,
clock_ t* time)
{
const int tid = threadIdx.x;
const int size = DATA_SIZE / THREAD_NUM;
int sum = 0;
int i;
clock_t start;
if(tid == 0) start = clock();
for(i = tid * size; i < (tid + 1) * size; i++)
{
sum += num[i] * num[i];
}
result[tid] = sum;
if(tid == 0) *time = clock() - start;
} |
³ÌÐòÀïµÄ threadIdx ÊÇ CUDA µÄÒ»¸öÄÚ½¨µÄ±äÁ¿£¬±íʾĿǰµÄ thread Êǵڼ¸¸ö
thread£¨ÓÉ 0 ¿ªÊ¼¼ÆË㣩¡£ÒÔÎÒÃǵÄÀý×ÓÀ´Ëµ£¬»áÓÐ 256 ¸ö threads£¬ËùÒÔͬʱ»áÓÐ
256 ¸ö sumOfSquares º¯Ê½ÔÚÖ´ÐУ¬µ«Ã¿Ò»¸öµÄ threadIdx.x Ôò·Ö±ð»áÊÇ 0
~ 255¡£ÀûÓÃÕâ¸ö±äÁ¿£¬ÎÒÃǾͿÉÒÔÈÃÿһ·Ýº¯Ê½Ö´ÐÐʱ£¬¶ÔÕû¸öÊý¾Ý²»Í¬µÄ²¿·Ý¼ÆËãÆ½·½ºÍ¡£ÁíÍ⣬ÎÒÃÇÒ²ÈüÆËãʱ¼äµÄ¶¯×÷£¬Ö»ÔÚ
thread 0£¨¼´ threadIdx.x = 0 µÄʱºò£©½øÐС£
ͬÑùµÄ£¬ÓÉÓÚ»áÓÐ 256 ¸ö¼ÆËã½á¹û£¬ËùÒÔÔÀ´´æ·Å result µÄÄÚ´æÎ»ÖÃÒ²ÒªÀ©´ó¡£°Ñ main
º¯Ê½ÖеÄÖм䲿·Ý¸Ä³É£º
int* gpudata,
*result;
clock_t* time;
cudaMalloc((void**) &gpudata, sizeof(int)
* DATA_SIZE);
cudaMalloc((void**) &result, sizeof(int) *
THREAD_NUM);
cudaMalloc((void**) &time, sizeof(clock_t));
cudaMemcpy (gpudata, data, sizeof(int) * DATA_SIZE,
cudaMemcpyHostToDevice);
sumOfSquares<<<1, THREAD_NUM, 0>>>(gpudata,
result, time);
int sum[THREAD_NUM];
clock_t time_used;
cudaMemcpy(¡Æ, result, sizeof(int) * THREAD_ NUM,
cudaMemcpyDeviceToHost);
cudaMemcpy (&time_used, time, sizeof (clock_t),
cudaMemcpyDeviceToHost);
cudaFree (gpudata);
cudaFree (result);
cudaFree (time); |
¿ÉÒÔ×¢Òâµ½ÎÒÃÇÔÚºô½Ð sumOfSquares º¯Ê½Ê±£¬Ö¸¶¨ THREAD_NUM Ϊ thread
µÄÊýÄ¿¡£×îºó£¬ÔÚ CPU ¶Ë°Ñ¼ÆËãºÃµÄ¸÷×éÊý¾ÝµÄƽ·½ºÍ½øÐмÓ×Ü£º
int final_sum = 0;
for(int i = 0; i < THREAD_NUM; i++) {
final_sum += sum[i];
}
printf(¡°sum: %d time: %d/n¡±, final_sum, time_used);
final_sum = 0;
for(int i = 0; i < DATA_SIZE; i++) {
sum += data[i] * data[i];
}
printf(¡°sum (CPU): %d/n¡±, final_sum); |
±àÒëºóÖ´ÐУ¬È·ÈϽá¹ûºÍÔÀ´Ïàͬ¡£
Õâ¸ö°æ±¾µÄ³ÌÐò£¬ÔÚ GeForce 8800GT ÉÏÖ´ÐУ¬Ö»ÐèÒªÔ¼ 8.3M cycles£¬±Èǰһ°æ³ÌÐò¿ìÁË
77 ±¶£¡Õâ¾ÍÊÇ͸¹ý´óÁ¿ thread À´Òþ²Ø latency Ëù´øÀ´µÄЧ¹û¡£
²»¹ý£¬Èç¹û¼ÆËãÒ»ÏÂËüʹÓõÄÄÚ´æ´ø¿í£¬¾Í»á·¢ÏÖÆäʵÈÔ²»ÊǺÜÀíÏ룬´óÔ¼Ö»ÓÐ 723MB/s ¶øÒÑ¡£ÕâºÍ
GeForce 8800GT Ëù¾ßÓеÄÄÚ´æ´ø¿íÊǺܴóµÄ²î¾à¡£ÎªÊ²Ã´»áÕâÑùÄØ£¿
ÄÚ´æµÄ´æÈ¡Ä£Ê½
ÏÔ¿¨ÉϵÄÄÚ´æÊÇ DRAM£¬Òò´Ë×îÓÐЧÂʵĴæÈ¡·½Ê½£¬ÊÇÒÔÁ¬ÐøµÄ·½Ê½´æÈ¡¡£Ç°ÃæµÄ³ÌÐò£¬ËäÈ»¿´ÆðÀ´ÊÇÁ¬Ðø´æÈ¡ÄÚ´æÎ»Öã¨Ã¿¸ö
thread ¶ÔÒ»¿éÁ¬ÐøµÄÊý×Ö¼ÆËãÆ½·½ºÍ£©£¬µ«ÊÇÎÒÃÇÒª¿¼Âǵ½Êµ¼ÊÉÏ thread µÄÖ´Ðз½Ê½¡£Ç°ÃæÌá¹ý£¬µ±Ò»¸ö
thread ÔڵȴýÄÚ´æµÄÊý¾Ýʱ£¬GPU »áÇл»µ½ÏÂÒ»¸ö thread¡£Ò²¾ÍÊÇ˵£¬Êµ¼ÊÉÏÖ´ÐеÄ˳ÐòÊÇÀàËÆ
thread 0 -> thread 1 -> thread 2 -> ¡
Òò´Ë£¬ÔÚͬһ¸ö thread ÖÐÁ¬Ðø´æÈ¡Äڴ棬ÔÚʵ¼ÊÖ´ÐÐʱ·´¶ø²»ÊÇÁ¬ÐøÁË¡£ÒªÈÃʵ¼ÊÖ´Ðнá¹ûÊÇÁ¬ÐøµÄ´æÈ¡£¬ÎÒÃÇÓ¦¸ÃÒªÈÃ
thread 0 ¶ÁÈ¡µÚÒ»¸öÊý×Ö£¬thread 1 ¶ÁÈ¡µÚ¶þ¸öÊý×Ö¡ÒÀ´ËÀàÍÆ¡£ËùÒÔ£¬ÎÒÃÇ¿ÉÒÔ°Ñ kernel
³ÌÐò¸Ä³ÉÈçÏ£º
__global__ static
void sumOfSquares(int *num, int* result,
clock_t* time)
{
const int tid = threadIdx.x;
int sum = 0;
int i;
clock_t start;
if(tid == 0) start = clock();
for(i = tid; i < DATA_SIZE; i += THREAD_NUM)
{
sum += num[i] * num[i];
}
result[tid] = sum;
if(tid == 0) *time = clock() - start;
} |
±àÒëºóÖ´ÐУ¬È·ÈϽá¹ûÏàͬ¡£
½ö½öÊÇÕâÑù¼òµ¥µÄÐ޸ģ¬Êµ¼ÊÖ´ÐеÄЧÂʾÍÓкܴóµÄ²î±ð¡£ÔÚ GeForce 8800GT ÉÏ£¬ÉÏÃæµÄ³ÌÐòÖ´ÐÐÐèÒªµÄƵÂÊÊÇ
2.6M cycles£¬ÓÖ±Èǰһ°æ³ÌÐò¿ìÁËÈý±¶¡£²»¹ý£¬ÕâÑùÈÔÖ»ÓÐ 2.3GB/s µÄ´ø¿í¶øÒÑ¡£
ÕâÊÇÒòΪÎÒÃÇʹÓÃµÄ thread ÊýÄ¿»¹ÊDz»¹»¶àµÄÔÒò¡£ÀíÂÛÉÏ 256 ¸ö threads ×î¶àÖ»ÄÜÒþ²Ø
256 cycles µÄ latency¡£µ«ÊÇ GPU ´æÈ¡ global memory ʱµÄ latency
¿ÉÄܸߴï 500 cycles ÒÔÉÏ¡£Èç¹ûÔö¼Ó thread ÊýÄ¿£¬¾Í¿ÉÒÔ¿´µ½¸üºÃµÄЧÂÊ¡£ÀýÈ磬¿ÉÒÔ°Ñ
THREAD_NUM ¸Ä³É 512¡£ÔÚ GeForce 8800GT ÉÏ£¬Õâ¿ÉÒÔÈÃÖ´Ðл¨·ÑµÄʱ¼ä¼õÉÙµ½
1.95M cycles¡£ÓÐЩ¸Ä½ø£¬µ«ÊÇÈÔ²»¹»´ó¡£²»ÐÒµÄÊÇ£¬Ä¿Ç° GeForce 8800GT Ò»¸ö
block ×î¶àÖ»ÄÜÓÐ 512 ¸ö threads£¬ËùÒÔ²»ÄÜÔÙÔö¼ÓÁË£¬¶øÇÒ£¬Èç¹û thread ÊýÄ¿Ôö¼ÓÌ«¶à£¬ÄÇôÔÚ
CPU ¶ËÒª×öµÄ×îºó¼Ó×ܹ¤×÷Ò²»á±ä¶à¡£
¸ü¶àµÄ²¢Ðл¯
Ç°ÃæÌáµ½ÁË block¡£ÔÚ֮ǰ½éÉܺô½Ð CUDA º¯Ê½Ê±£¬Ò²ÓÐÌáµ½ ¡°block ÊýÄ¿¡± Õâ¸ö²ÎÊý¡£µ½Ä¿Ç°ÎªÖ¹£¬ÎÒÃǶ¼Ö»Ê¹ÓÃÒ»¸ö
block¡£¾¿¾¹ block ÊÇÊ²Ã´ÄØ£¿
ÔÚ CUDA ÖУ¬thread ÊÇ¿ÉÒÔ·Ö×éµÄ£¬Ò²¾ÍÊÇ block¡£Ò»¸ö block ÖÐµÄ thread£¬¾ßÓÐÒ»¸ö¹²ÏíµÄ
shared memory£¬Ò²¿ÉÒÔ½øÐÐͬ²½¹¤×÷¡£²»Í¬ block Ö®¼äµÄ thread Ôò²»ÐС£ÔÚÎÒÃǵijÌÐòÖУ¬Æäʵ²»Ì«ÐèÒª½øÐÐ
thread µÄͬ²½¶¯×÷£¬Òò´ËÎÒÃÇ¿ÉÒÔʹÓöà¸ö block À´½øÒ»²½Ôö¼Ó thread µÄÊýÄ¿¡£
Ê×ÏÈ£¬ÔÚ #define DATA_SIZE µÄµØ·½£¬¸Ä³ÉÈçÏ£º
#define DATA_SIZE 1048576
#define BLOCK_NUM 32
#define THREAD_NUM 256 |
Õâ±íʾÎÒÃǻὨÁ¢ 32 ¸ö blocks£¬Ã¿¸ö blocks ÓÐ 256 ¸ö threads£¬×ܹ²ÓÐ
32*256 = 8192 ¸ö threads¡£
½Ó×Å£¬ÎÒÃÇ°Ñ kernel ²¿·Ý¸Ä³É£º
__global__
static void sumOfSquares(int *num, int* result,
clock_t* time)
{
const int tid = threadIdx.x;
const int bid = blockIdx.x;
int sum = 0;
int i;
if(tid == 0) time[bid] = clock();
for(i = bid * THREAD_NUM + tid; i < DATA_SIZE;
i += BLOCK_NUM * THREAD_NUM) {
sum += num[i] * num[i];
}
result[bid * THREAD_NUM + tid] = sum;
if(tid == 0) time[bid + BLOCK_NUM] = clock();
} |
blockIdx.x ºÍ threadIdx.x Ò»ÑùÊÇ CUDA ÄÚ½¨µÄ±äÁ¿£¬Ëü±íʾµÄÊÇĿǰµÄ
block ±àºÅ¡£ÁíÍ⣬עÒâµ½ÎÒÃǰѼÆËãʱ¼äµÄ·½Ê½¸Ä³Éÿ¸ö block ¶¼»á¼Ç¼¿ªÊ¼Ê±¼ä¼°½áÊøÊ±¼ä¡£
main º¯Ê½²¿·Ý£¬Ð޸ijɣº
int* gpudata,
*result;
clock_t* time;
cudaMalloc((void**) &gpudata, sizeof(int)
* DATA_SIZE);
cudaMalloc((void**) &result,
sizeof(int) * THREAD_NUM * BLOCK_NUM);
cudaMalloc((void**) &time, sizeof(clock_t)
* BLOCK_NUM * 2);
cudaMemcpy(gpudata, data, sizeof(int) * DATA_SIZE,
cudaMemcpyHostToDevice);
sumOfSquares<<<BLOCK_NUM, THREAD_NUM,
0>>>(gpudata, result,
time);
int sum[THREAD_NUM * BLOCK_NUM];
clock_t time_used[BLOCK_NUM * 2];
cudaMemcpy(¡Æ, result, sizeof(int) * THREAD_NUM
* BLOCK_NUM,
cudaMemcpyDeviceToHost);
cudaMemcpy(&time_used, time, sizeof(clock_t)
* BLOCK_NUM * 2,
cudaMemcpyDeviceToHost);
cudaFree(gpudata);
cudaFree(result);
cudaFree(time);
int final_sum = 0;
for(int i = 0; i < THREAD_NUM * BLOCK_NUM;
i++) {
final_sum += sum[i];
}
clock_t min_start, max_end;
min_start = time_used[0];
max_end = time_used[BLOCK_NUM];
for(int i = 1; i < BLOCK_NUM; i++) {
if(min_start > time_used[i])
min_start = time_used[i];
if(max_end < time_used[i + BLOCK_NUM])
max_end = time_used[i + BLOCK_NUM];
}
printf(¡°sum: %d time: %d/n¡±, final_sum, max_end
- min_start); |
»ù±¾ÉÏÎÒÃÇÖ»ÊÇ°Ñ result µÄ´óС±ä´ó£¬²¢Ð޸ļÆËãʱ¼äµÄ·½Ê½£¬°Ñÿ¸ö block ×îÔçµÄ¿ªÊ¼Ê±¼ä£¬ºÍ×îÍíµÄ½áÊøÊ±¼äÏà¼õ£¬È¡µÃ×ÜÔËÐÐʱ¼ä¡£
Õâ¸ö°æ±¾µÄ³ÌÐò£¬Ö´ÐеÄʱ¼ä¼õÉٺܶ࣬ÔÚ GeForce 8800GT ÉÏÖ»ÐèÒªÔ¼ 150K cycles£¬Ï൱ÓÚ
40GB/s ×óÓҵĴø¿í¡£²»¹ý£¬ËüÔÚ CPU ÉÏÖ´ÐеIJ¿·Ý£¬ÐèÒªµÄʱ¼ä¼Ó³¤ÁË£¨ÒòΪ CPU ÏÖÔÚÐèÒª¼Ó×Ü
8192 ¸öÊý×Ö£©¡£ÎªÁ˱ÜÃâÕâ¸öÎÊÌ⣬ÎÒÃÇ¿ÉÒÔÈÃÿ¸ö block °Ñ×Ô¼ºµÄÿ¸ö thread µÄ¼ÆËã½á¹û½øÐмÓ×Ü¡£
Thread µÄͬ²½
Ç°ÃæÌá¹ý£¬Ò»¸ö block ÄÚµÄ thread ¿ÉÒÔÓй²ÏíµÄÄڴ棬Ҳ¿ÉÒÔ½øÐÐͬ²½¡£ÎÒÃÇ¿ÉÒÔÀûÓÃÕâÒ»µã£¬ÈÃÿ¸ö
block ÄÚµÄËùÓÐ thread °Ñ×Ô¼º¼ÆËãµÄ½á¹û¼Ó×ÜÆðÀ´¡£°Ñ kernel ¸Ä³ÉÈçÏ£º
__global__
static void sumOfSquares(int *num, int* result,
clock_t* time)
{
extern __shared__ int shared[];
const int tid = threadIdx.x;
const int bid = blockIdx.x;
int i;
if(tid == 0) time[bid] = clock();
shared[tid] = 0;
for(i = bid * THREAD_NUM + tid; i < DATA_SIZE;
i += BLOCK_NUM * THREAD_NUM) {
shared[tid] += num[i] * num[i];
}
__syncthreads();
if(tid == 0) {
for(i = 1; i < THREAD_NUM; i++) {
shared[0] += shared[i];
}
result[bid] = shared[0];
}
if(tid == 0) time[bid + BLOCK_NUM] = clock();
} |
ÀûÓà __shared__ ÉùÃ÷µÄ±äÁ¿±íʾÕâÊÇ shared memory£¬ÊÇÒ»¸ö block ÖÐÿ¸ö
thread ¶¼¹²ÏíµÄÄÚ´æ¡£Ëü»áʹÓÃÔÚ GPU ÉϵÄÄڴ棬ËùÒÔ´æÈ¡µÄËÙ¶ÈÏ൱¿ì£¬²»ÐèÒªµ£ÐÄ latency
µÄÎÊÌâ¡£
__syncthreads() ÊÇÒ»¸ö CUDA µÄÄÚ²¿º¯Êý£¬±íʾ block ÖÐËùÓÐµÄ thread
¶¼ÒªÍ¬²½µ½Õâ¸öµã£¬²ÅÄܼÌÐøÖ´ÐС£ÔÚÎÒÃǵÄÀý×ÓÖУ¬ÓÉÓÚÖ®ºóÒª°ÑËùÓÐ thread ¼ÆËãµÄ½á¹û½øÐмÓ×Ü£¬ËùÒÔÎÒÃÇÐèҪȷ¶¨Ã¿¸ö
thread ¶¼ÒѾ°Ñ½á¹ûдµ½ shared[tid] ÀïÃæÁË¡£
½ÓÏÂÀ´£¬°Ñ main º¯Ê½µÄÒ»²¿·Ý¸Ä³É£º
int* gpudata,
*result;
clock_t* time;
cudaMalloc((void**) &gpudata, sizeof(int)
* DATA_ SIZE);
cudaMalloc((void**) &result, sizeof(int) *
BLOCK_ NUM);
cudaMalloc((void**) &time, sizeof(clock_t)
* BLOCK_ NUM * 2);
cudaMemcpy(gpudata, data, sizeof(int) * DATA_SIZE,
cudaMemcpyHostToDevice);
sumOfSquares<<<BLOCK_NUM, THREAD_NUM,
THREAD_NUM * sizeof(int)>>>(gpudata,
result, time);
int sum[BLOCK_NUM];
clock_t time_used[BLOCK_NUM * 2];
cudaMemcpy(¡Æ, result, sizeof(int) * BLOCK_NUM,
cudaMemcpyDeviceToHost);
cudaMemcpy(&time_used, time, sizeof(clock_t)
* BLOCK _NUM * 2,
cudaMemcpyDeviceToHost);
cudaFree(gpudata);
cudaFree(result);
cudaFree(time);
int final_sum = 0;
for(int i = 0; i < BLOCK_NUM; i++) {
final_sum += sum[i];
} |
¿ÉÒÔ×¢Òâµ½£¬ÏÖÔÚ CPU Ö»ÐèÒª¼Ó×Ü BLOCK_NUM Ò²¾ÍÊÇ 32 ¸öÊý×־ͿÉÒÔÁË¡£
²»¹ý£¬Õâ¸ö³ÌÐòÓÉÓÚÔÚ GPU É϶à×öÁËһЩ¶¯×÷£¬ËùÒÔËüµÄЧÂÊ»á±È½Ï²îһЩ¡£ÔÚ GeForce 8800GT
ÉÏ£¬ËüÐèÒªÔ¼ 164K cycles¡£
µ±È»£¬Ð§ÂÊ»á±ä²îµÄÒ»¸öÔÒòÊÇ£¬ÔÚÕâÒ»°æµÄ³ÌÐòÖУ¬×îºó¼Ó×ܵŤ×÷£¬Ö»ÓÉÿ¸ö block µÄ thread
0 À´½øÐУ¬µ«Õâ²¢²»ÊÇ×îÓÐЧÂʵķ½·¨¡£ÀíÂÛÉÏ£¬°Ñ 256 ¸öÊý×Ö¼Ó×ܵ͝×÷£¬ÊÇ¿ÉÒÔ²¢Ðл¯µÄ¡£×î³£¼ûµÄ·½·¨£¬ÊÇ͸¹ýÊ÷×´µÄ¼Ó·¨£º
°Ñ kernel ¸Ä³ÉÈçÏ£º
__global__ static
void sumOfSquares(int *num, int* result,
clock_t* time)
{
extern __shared__ int shared[];
const int tid = threadIdx.x;
const int bid = blockIdx.x;
int i;
int offset = 1, mask = 1;
if(tid == 0) time[bid] = clock();
shared[tid] = 0;
for(i = bid * THREAD_NUM + tid; i < DATA_SIZE;
i += BLOCK_NUM * THREAD_NUM) {
shared[tid] += num[i] * num[i];
}
__syncthreads();
while(offset < THREAD_NUM) {
if((tid & mask) == 0) {
shared[tid] += shared[tid + offset];
}
offset += offset;
mask = offset + mask;
__syncthreads();
}
if(tid == 0) {
result[bid] = shared[0];
time[bid + BLOCK_NUM] = clock();
}
} |
ºóÃæµÄ while Ñ»·¾ÍÊǽøÐÐÊ÷×´¼Ó·¨¡£main º¯Ê½Ôò²»ÐèÒªÐ޸ġ£
ÕâÒ»°æµÄ³ÌÐò£¬ÔÚ GeForce 8800GT ÉÏÖ´ÐÐÐèÒªµÄʱ¼ä£¬´óÔ¼ÊÇ 140K cycles£¨Ï൱ÓÚÔ¼
43GB/s£©£¬±ÈÍêÈ«²»ÔÚ GPU ÉϽøÐмÓ×ܵİ汾»¹¿ì£¡ÕâÊÇÒòΪ£¬ÔÚÍêÈ«²»ÔÚ GPU ÉϽøÐмÓ×ܵİ汾£¬Ð´Èëµ½
global memory µÄÊý¾ÝÊýÁ¿ºÜ´ó£¨8192 ¸öÊý×Ö£©£¬Ò²¶ÔЧÂÊ»áÓÐÓ°Ïì¡£ËùÒÔ£¬ÕâÒ»°æ³ÌÐò²»µ«ÔÚ
CPU ÉϵÄÔËËãÐèÇ󽵵ͣ¬ÔÚ GPU ÉÏÒ²ÄÜÅܵĸü¿ì¡£
½øÒ»²½¸ÄÉÆ
ÉÏÒ»¸ö°æ±¾µÄÊ÷×´¼Ó·¨ÊÇÒ»°ãµÄд·¨£¬µ«ÊÇËüÔÚ GPU ÉÏÖ´ÐеÄʱºò£¬»áÓÐ share memory
µÄ bank conflict µÄÎÊÌ⣨ÏêÇéÔÚºóÃæ½éÉÜ GPU ¼Ü¹¹Ê±»áÌáµ½£©¡£²ÉÓÃÏÂÃæµÄ·½·¨£¬¿ÉÒÔ±ÜÃâÕâ¸öÎÊÌ⣺
offset = THREAD_NUM
/ 2;
while(offset > 0) {
if(tid < offset) {
shared[tid] += shared[tid + offset];
}
offset >>= 1;
__syncthreads();
} |
ÕâÑùͬʱҲʡȥÁË mask ±äÊý¡£Òò´Ë£¬Õâ¸ö°æ±¾µÄÖ´ÐеÄЧÂʾͿÉÒÔÔÙÌá¸ßһЩ¡£ÔÚ GeForce
8800GT ÉÏ£¬Õâ¸ö°æ±¾Ö´ÐеÄʱ¼äÊÇÔ¼ 137K cycles¡£µ±È»£¬Õâʱ²î±ðÒѾºÜСÁË¡£Èç¹û»¹ÒªÔÙÌá¸ßЧÂÊ£¬¿ÉÒÔ°ÑÊ÷×´¼Ó·¨Õû¸öÕ¹¿ª£º
if(tid <
128) { shared[tid] += shared[tid + 128]; }
__syncthreads();
if(tid < 64) { shared[tid] += shared[tid +
64]; }
__syncthreads();
if(tid < 32) { shared[tid] += shared[tid +
32]; }
__syncthreads();
if(tid < 16) { shared[tid] += shared[tid +
16]; }
__syncthreads();
if(tid < 8) { shared[tid] += shared[tid + 8];
}
__syncthreads();
if(tid < 4) { shared[tid] += shared[tid + 4];
}
__syncthreads();
if(tid < 2) { shared[tid] += shared[tid + 2];
}
__syncthreads();
if(tid < 1) { shared[tid] += shared[tid + 1];
}
__syncthreads(); |
µ±È»ÕâÖ»ÊÊÓÃÓÚ THREAD_NUM ÊÇ 256 µÄÇéÐΡ£ÕâÑù¿ÉÒÔÔÙÊ¡ÏÂÔ¼ 1000 cycles
×óÓÒ£¨Ô¼ 44GB/s£©¡£×îºóÍêÕûµÄ³ÌÐòÎļþ¿ÉÒÔ´ÓÕâÀïÏÂÔØ¡£
Ç°Ãæ½éÉܵļÆËãÆ½·½ºÍµÄ³ÌÐò£¬ËƺõûÓÐʲôʵÓüÛÖµ¡£ËùÒÔÎÒÃǵĵڶþ¸ö CUDA ³ÌÐò£¬Òª×öÒ»¸öȷʵÓУ¨Ä³Ð©£©ÊµÓüÛÖµµÄ³ÌÐò£¬Ò²¾ÍÊǽøÐоØÕó³Ë·¨¡£¶øÇÒ£¬Õâ´ÎÎÒÃÇ»áʹÓø¡µãÊý¡£
ËäÈ»¾ØÕó³Ë·¨ÓеãÀÏÌ×£¬²»¹ýÒòΪËüÏ൱¼òµ¥£¬¶øÇÒÒ²¿ÉÒÔÓÃÀ´½éÉÜһЩÓÐ¹Ø CUDA µÄÓÐȤÐÔÖÊ¡£
¾ØÕó³Ë·¨
ΪÁ˵¥´¿Æð¼û£¬ÎÒÃÇÕâÀïÒÔ·½ÐεľØÕóΪÀý×Ó¡£»ù±¾ÉÏ£¬¼ÙÉèÓÐÁ½¸ö¾ØÕó A ºÍ B£¬Ôò¼ÆËã AB = C
µÄ·½·¨ÈçÏ£º
for(i = 0; i
< n; i++) {
for(j = 0; j < n; j++) {
C[i][j] = 0;
for(k = 0; k < n; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
} |
Ò»¿ªÊ¼£¬ÎÒÃÇÏÈ×¼±¸ºÃ²úÉúÊý¾Ý¡¢É趨 CUDA µÈµÈµÄ¹¤×÷£º
int main()
{
float *a, *b, *c, *d;
int n = 1000;
if(!InitCUDA()) return 0;
a = (float*) malloc(sizeof(float) * n * n);
b = (float*) malloc(sizeof(float) * n * n);
c = (float*) malloc(sizeof(float) * n * n);
d = (float*) malloc(sizeof(float) * n * n);
srand(0);
matgen(a, n, n);
matgen(b, n, n);
clock_t time = matmultCUDA(a, n, b, n, c,
n, n);
matmult(a, n, b, n, d, n, n);
compare_mat(c, n, d, n, n);
double sec = (double) time / CLOCKS_PER_SEC;
printf (¡°Time used: %.2f (%.2lf GFLOPS)/n¡±, sec,
2.0 * n * n * n / (sec * 1E9));
return 0;
} |
InitCUDA º¯Ê½ºÍµÚÒ»¸ö CUDA ³ÌÐòÒ»Ñù£¬¿ÉÒÔÖ±½Ó²Î¿¼Ç°ÃæµÄÎÄÕ¡£ÒÔÏÂÊÇÉÏÃæÓõ½µÄһЩÆäËüµÄº¯Ê½£º
²úÉú¾ØÕó£º
void matgen(float*
a, int lda, int n)
{
int i, j;
for(i = 0; i < n; i++) {
for(j = 0; j < n; j++) {
a[i * lda + j] = (float) rand() / RAND_MAX +
(float) rand() / (RAND_MAX * RAND_MAX);
}
}
} |
Õâ¸öº¯Ê½Ö»ÊÇÀûÓÃËæ»úÊýÉú³ÉÆ÷°Ñ¾ØÕóÌîÂú 0 ~ 1 Ö®¼äµÄÊý×Ö¡£Ìرð×¢Òâµ½ÒòΪ C ÓïÑÔÖÐÎÞ·¨ÉùÃ÷±ä¶¯´óСµÄ¶þά¾ØÕó£¬ËùÒÔÎÒÃÇʹÓÃ
i * lda + j µÄ·½Ê½¡£
½øÐоØÕó³Ë·¨£º
void matmult(const
float* a, int lda, const float* b, int ldb,
float* c, int ldc, int n)
{
int i, j, k;
for(i = 0; i < n; i++) {
for(j = 0; j < n; j++) {
double t = 0;
for(k = 0; k < n; k++) {
t += a[i * lda + k] * b[k * ldb + j];
}
c[i * ldc + j] = t;
}
}
} |
ÕâÊÇÒÔ CPU ½øÐоØÕó³Ë·¨¡¢ÓÃÀ´½øÐÐÑéÖ¤´ð°¸ÕýÈ·Óë·ñµÄ³ÌÐò¡£Ìرð×¢Òâµ½ËüÓà double À´´¢´æÔÝʱµÄ¼ÆËã½á¹û£¬ÒÔÌá¸ß¾«È·¶È¡£
ÑéÖ¤½á¹û£º
void compare_mat(const
float* a, int lda,
const float* b, int ldb, int n)
{
float max_err = 0;
float average_err = 0;
int i, j;
for(i = 0; i < n; i++) {
for(j = 0; j < n; j++) {
if(b[i * ldb + j] != 0) {
float err = fabs((a[i * lda + j] -
b[i * ldb + j]) / b[i * ldb + j]);
if(max_err < err) max_err = err;
average_err += err;
}
}
}
printf(¡°Max error: %g Average error: %g/n¡±,
max_err, average_err / (n * n));
} |
Õâ¸öº¯Ê½¼ÆËãÁ½¸ö¾ØÕóµÄ×î´óÏà¶ÔÎó²îºÍƽ¾ùÏà¶ÔÎó²î£¬²¢°Ñ½á¹ûÓ¡³öÀ´¡£
×îºóÊÇ CUDA µÄ¾ØÕó³Ë·¨µÄ²¿·Ý£º
#define NUM_THREADS
256
clock_t matmultCUDA(const float* a, int lda,
const float* b, int ldb, float* c, int ldc,
int n)
{
float *ac, *bc, *cc;
clock_t start, end;
start = clock();
cudaMalloc((void**) &ac, sizeof(float) *
n * n);
cudaMalloc((void**) &bc, sizeof(float) *
n * n);
cudaMalloc((void**) &cc, sizeof(float) *
n * n);
cudaMemcpy2D(ac, sizeof(float) * n, a, sizeof(float)
* lda,
sizeof(float) * n, n, cudaMemcpyHostToDevice);
cudaMemcpy2D(bc, sizeof(float) * n, b, sizeof(float)
* ldb,
sizeof(float) * n, n, cudaMemcpyHostToDevice);
int blocks = (n + NUM_THREADS - 1) / NUM_THREADS;
matMultCUDA<<<blocks * n, NUM_THREADS>>>
(ac, n, bc, n, cc, n, n);
cudaMemcpy2D(c, sizeof(float) * ldc, cc, sizeof(float)
* n,
sizeof(float) * n, n, cudaMemcpyDeviceToHost);
cudaFree(ac);
cudaFree(bc);
cudaFree(cc);
end = clock();
return end - start;
} |
Õâ¸öº¯Ê½Ï൱µ¥´¿£¬¾ÍÊÇÔÚÏÔ¿¨ÄÚ´æÖÐÅäÖôæ·Å¾ØÕóµÄÄڴ棬Ȼºó°ÑÖ÷ÄÚ´æÖеľØÕóÊý¾Ý¸´ÖƵ½ÏÔ¿¨ÄÚ´æÉÏ¡£²»¹ý£¬ÒòΪÎÒÃǵľØÕó³Ë·¨º¯Ê½¿ÉÒÔÖ¸¶¨
pitch£¨¼´ lda¡¢ldb¡¢ºÍ ldc£©£¬ËùÒÔÈç¹ûÓÃÒ»°ãµÄ cudaMemcpy º¯Ê½À´¸´ÖÆÄÚ´æµÄ»°£¬»áÐèҪÿ¸ö
row ¶¼·Ö¿ª¸´ÖÆ£¬ÄÇ»áÐèÒªºô½ÐºÜ¶à´Î cudaMemcpy º¯Ê½£¬»áʹЧÂʱäµÃºÜ²î¡£Òò´Ë£¬ÔÚÕâÀïÎÒÃÇÓÃÁËÒ»¸öеÄ
cudaMemcpy2D º¯Ê½£¬ËüÊÇÓÃÀ´¸´ÖƶþάÊý×飬¿ÉÒÔÖ¸¶¨Êý×éµÄ pitch¡£ÕâÑù¾Í¿ÉÒÔ͸¹ýÒ»´Îº¯Êýµ÷ÓþͿÉÒÔÁË¡£
½øÐмÆËãµÄ kernel ÈçÏ£º
__global__ static void matMultCUDA(const
float* a, size_t lda,
const float* b, size_t ldb, float* c, size_t ldc,
int n)
{
const int tid = threadIdx.x;
const int bid = blockIdx.x;
const int idx = bid * blockDim.x + tid;
const int row = idx / n;
const int column = idx % n;
int i;
if(row < n && column < n) {
float t = 0;
for(i = 0; i < n; i++) {
t += a[row * lda + i] * b[i * ldb + column];
}
c[row * ldc + column] = t;
}
} |
Õâ¸öº¯Ê½Ò»¿ªÊ¼ÏÈ´Ó bid ºÍ tid ¼ÆËã³öÕâ¸ö thread Ó¦¸Ã¼ÆËãµÄ row ºÍ column£¬ÔÚÅжÏ
row ºÍ column ÔÚ·¶Î§ÄÚÖ®ºó£¬¾ÍÖ±½Ó½øÐмÆË㣬²¢°Ñ½á¹ûдµ½ c ¾ØÕóÖУ¬ÊǷdz£µ¥´¿µÄº¯Ê½¡£
ÔÚ GeForce 8800GT ÉÏʵ¼ÊÖ´ÐеĽá¹ûÈçÏ£º
Max error: 2.01484e-006
Average error: 3.36637e-007
Time used: 1.1560 (1.73 GFLOPS) |
¿ÉÒÔ¿´µ½Á½¸öÎÊÌ⣺
1.ºÜÃ÷ÏԵģ¬Ö´ÐÐЧÂÊÏ൱µÍÂä¡£
2.×î´óÏà¶ÔÎó²îÆ«¸ß¡£ÀíÏëÉÏÓ¦¸ÃÒªµÍÓÚ 1e-6¡£
¼ÆËã½á¹ûµÄÎó²îÆ«¸ßµÄÔÒòÊÇ£¬ÔÚ CPU ÉϽøÐмÆËãʱ£¬ÎÒÃÇʹÓà double£¨¼´ 64 bits
¸¡µãÊý£©À´ÀÛ½ø¼ÆËã¹ý³Ì£¬¶øÔÚ GPU ÉÏÔòÖ»ÄÜÓà float£¨32 bits ¸¡µãÊý£©¡£ÔÚÀÛ¼Ó´óÁ¿Êý×ÖµÄʱºò£¬ÓÉÓÚÀÛ¼Ó½á¹ûºÜ¿ì»á±ä´ó£¬Òò´ËºóÃæµÄÊý×ÖºÜÈÝÒ×±»ÉáÈ¥¹ý¶àµÄλÊý¡£
ÓÉÓÚ CUDA µÄ¸¡µãÊýÔËË㣬ÔÚ½øÐмӡ¢¼õ¡¢³Ë·¨Ê±ÊÇ·ûºÏ IEEE 754 ¹æ¶¨µÄ¾«È·¶ÈµÄ£¬Òò´Ë£¬ÎÒÃÇ¿ÉÒÔÀûÓÃ
Kahan¡¯s Summation Formula À´Ìá¸ß¾«È·¶È¡£°Ñ³ÌÐò¸Ä³É£º
if(row <
n && column < n) {
float t = 0;
float y = 0;
for(i = 0; i < n; i++) {
float r;
y -= a[row * lda + i] * b[i * ldb + column];
r = t - y;
y = (r - t) + y;
t = r;
}
} |
Ð޸ĺóµÄ³ÌÐòµÄÖ´Ðнá¹ûÊÇ£º
Max error: 1.19209e-007
Average error: 4.22751e-008
Time used: 1.1560 (1.73 GFLOPS) |
¿ÉÒÔ¿´µ½Ïà¶ÔÎó²îÓкܴóµÄ¸ÄÉÆ£¬Ð§ÂÊÔòûʲô±ä»¯¡£
ÓÉÓÚ Kahan¡¯s Summation Formula ÐèÒªµÄÔËËãÁ¿Ìá¸ß£¬µ«ÊÇЧÂÊȴûÓÐʲô¸Ä±ä£¬¿ÉÒÔ¿´³öÕâ¸ö
kernel Ö÷ÒªµÄÆ¿¾±Ó¦¸ÃÊÇÔÚÄÚ´æµÄ´æÈ¡¶¯×÷ÉÏ¡£ÕâÊÇÒòΪÓдóÁ¿µÄÄÚ´æ¶ÁÈ¡ÊÇÖØ¸´µÄ¡£ÀýÈ磬¾ØÕó a
µÄÒ»¸ö row ÔÚÿ´Î½øÐмÆËãʱ¶¼±»Öظ´¶ÁÈ룬µ«ÕâÊÇÏ൱À˷ѵġ£ÕâÑùµÄ¼ÆË㷽ʽ£¬×ܹ²ÐèÒª¶ÁÈ¡ 2*n3
´ÎÄÚ´æ¡£Èç¹ûÈÃÒ»¸ö row Ö»ÐèÒª¶ÁÈëÒ»´ÎµÄ»°£¬¾Í¿ÉÒÔ¼õµ½Îª n3+n2 ´Î¡£
µÚÒ»¸ö¸ÄÁ¼
ºÍÎÒÃǵĵÚÒ»¸ö CUDA ³ÌÐòÒ»Ñù£¬ÎÒÃÇ¿ÉÒÔÀûÓà shared memory À´´¢´æÃ¿¸ö row
µÄÊý¾Ý¡£²»¹ý£¬ÒòΪֻÓÐͬһ¸ö block µÄ threads ¿ÉÒÔ¹²Ïí shared memory£¬Òò´ËÏÖÔÚÒ»¸ö
row Ö»ÄÜÓÉͬһ¸ö block µÄ threads À´½øÐмÆËã¡£ÁíÍâÎÒÃÇÒ²ÐèÒªÄÜ´æ·ÅÒ»Õû¸ö row
µÄ shared memory¡£Òò´Ë£¬°ÑÏȰѺô½Ð kernel µÄ²¿·Ý¸Ä³É£º
matMultCUDA<<<n, NUM_THREADS, sizeof(float)
* n>>>
(ac, n, bc, n, cc, n, n);
kernel µÄ²¿·ÝÔò¸Ä³É£º
__global__ static
void matMultCUDA(const float* a, size_t lda,
const float* b, size_t ldb, float* c, size_t ldc,
int n)
{
extern __shared__ float data[];
const int tid = threadIdx.x;
const int row = blockIdx.x;
int i, j;
for(i = tid; i < n; i += blockDim.x) {
data[i] = a[row * lda + i];
}
__syncthreads();
for(j = tid; j < n; j += blockDim.x) {
float t = 0;
float y = 0;
for(i = 0; i < n; i++) {
float r;
y -= data[i] * b[i * ldb + j];
r = t - y;
y = (r - t) + y;
t = r;
}
c[row * ldc + j] = t;
}
} |
µÚÒ»¸ö²¿·ÝÏȰÑÕû¸ö row ¶Áµ½ shared memory ÖУ¬¶øµÚ¶þ¸ö²¿·ÝÔò½øÐмÆË㣬²¢Ã»ÓÐÌ«´óµÄ±ä»¯¡£Ö÷ÒªµÄ²î±ðÊÇÏÖÔÚÒ»¸ö
row Ö»ÓÉÒ»¸ö block ½øÐмÆËã¡£
ÔÚ GeForce 8800GT ÉÏ£¬Ö´ÐеĽá¹ûÊÇ£º
Max error: 1.19209e-007
Average error: 4.22751e-008
Time used: 0.4220 (4.74 GFLOPS) |
ºÜÃ÷ÏԵ쬼ÆËãµÄ½á¹û²¢Ã»Óиı䣬²»¹ýËÙ¶ÈÔòÌá¸ßÁ˳¬¹ýÒ»±¶¡£ËäÈ»Èç´Ë£¬µ«ÊÇÕâÑùµÄЧÂÊÈÔ²»¾¡ÀíÏ룬ÒòΪÀíÂÛÉÏ
GeForce 8800GT Óг¬¹ý 300GFLOPS µÄÔËËãÐÔÄÜ¡£¼´Ê¹ÊÇ°Ñ Kahan¡¯s Summation
Formula ËùÐèÒªµÄ¶îÍâÔËË㿼ÂǽøÈ¥£¬ÕâÑùµÄЧÂÊÈÔÈ»Á¬ÀíÂÛ×î´óÖµµÄÊ®·ÖÖ®Ò»¶¼²»µ½¡£
»áÓÐÕâÑùµÄ½á¹û£¬ÔÒòÆäʵ»¹ÊÇͬÑùµÄ£º¶ÔÄÚ´æµÄ´æÈ¡´ÎÊýÌ«¶àÁË¡£ËäÈ»ÏÖÔÚ A ¾ØÕóµÄ row µÄÊý¾ÝÒѾ²»ÔÙÐèÒªÖØ¸´¶ÁÈ¡£¬µ«ÊÇ
B ¾ØÕóµÄ column µÄÊý¾ÝÈÔȻһֱ±»Öظ´¶ÁÈ¡¡£
ÁíÒ»¸öÎÊÌâ±È½Ï²»ÊÇÄÇôÃ÷ÏÔ£º¶Ô B ¾ØÕóµÄ¶ÁÈ¡£¬ËäÈ»¿´ÆðÀ´²»Á¬Ðø£¬µ«Êµ¼ÊÉÏËüÊÇÁ¬ÐøµÄ¡£ÕâÊÇÒòΪ²»Í¬µÄ
thread »á¶ÁÈ¡²»Í¬µÄ column£¬Òò´Ëͬʱ¼äÿ¸ö thread ¶ÁÈ¡µÄ¸÷¸ö column ¼ÓÆðÀ´£¬¾ÍÊÇÒ»¸öÁ¬ÐøµÄÄÚ´æÇø¿é¡£ÄÇô£¬ÎªÊ²Ã´Ð§ÂÊ»¹ÊDz»¼ÑÄØ£¿ÕâÊÇÒòΪ£¬GPU
ÉϵÄÄÚ´æ¿ØÖÆÆ÷£¬´Óij¸ö¹Ì¶¨µÄ±¶ÊýµØÖ·¿ªÊ¼¶ÁÈ¡£¬²Å»áÓÐ×î¸ßµÄЧÂÊ£¨ÀýÈç 16 bytes µÄ±¶Êý£©¡£ÓÉÓÚ¾ØÕó´óС²¢²»ÊÇ
16 µÄ±¶Êý£¨ÕâÀïʹÓõÄÊÇ 1000x1000 µÄ¾ØÕ󣩣¬ËùÒÔÔì³ÉЧÂʲ»¼ÑµÄÇéÐΡ£
Òª½â¾öÕâ¸öÎÊÌ⣬ÎÒÃÇ¿ÉÒÔÔÚ cudaMalloc µÄʱºòÉÔ΢ÐÞ¸Äһϣ¬Èÿí¶È±ä³É Êʵ±µÄ±¶Êý¾Í¿ÉÒÔÁË¡£µ«ÊÇ£¬Êʵ±µÄ±¶ÊýÊǶàÉÙÄØ£¿ÐÒÔ˵ÄÊÇ£¬ÎÒÃDz¢²»ÐèÒªÖªµÀÕâЩϸ½Ú¡£CUDA
ÌṩÁËÒ»¸ö cudaMallocPitch µÄº¯Ê½£¬¿ÉÒÔ×Ô¶¯ÒÔ×î¼ÑµÄ±¶ÊýÀ´ÅäÖÃÄÚ´æ¡£Òò´Ë£¬ÎÒÃÇ¿ÉÒÔ°Ñ
cudaMalloc µÄ²¿·Ý¸Ä³É£º
size_t pitch_a,
pitch_b, pitch_c;
cudaMallocPitch((void**) &ac, &pitch_a,
sizeof(float) * n, n);
cudaMallocPitch((void**) &bc, &pitch_b,
sizeof(float) * n, n);
cudaMallocPitch((void**) &cc, &pitch_c,
sizeof(float) * n, n); |
cudaMallocPitch º¯Ê½»áÒÔÊʵ±µÄ±¶ÊýÅäÖÃÄڴ棬²¢°ÑÅäÖõĿí¶È´«»Ø¡£Òò´Ë£¬ÔڰѾØÕó¸´ÖƵ½ÏÔ¿¨ÄÚ´æÉÏʱ£¬ÒªÊ¹ÓÃËü´«»ØµÄ¿í¶È£º
cudaMemcpy2D(ac,
pitch_a, a, sizeof(float) * lda,
sizeof(float) * n, n, cudaMemcpyHostToDevice);
cudaMemcpy2D(bc, pitch_b, b, sizeof(float) * ldb,
sizeof(float) * n, n, cudaMemcpyHostToDevice); |
ºô½Ð kernel µÄ²¿·ÝÒ²ÐèÒªÐ޸ģº
matMultCUDA<<<n,
NUM_THREADS, sizeof(float) * n>>>
(ac, pitch_a / sizeof(float), bc, pitch_b / sizeof(float),
cc, pitch_c / sizeof(float), n); |
ͬÑùµÄ£¬°Ñ¼ÆËã½á¹û¸´Öƻص½Ö÷ÄÚ´æÊ±£¬Ò²ÒªÊ¹Óô«»ØµÄ¿í¶ÈÖµ£º
cudaMemcpy2D(c, sizeof(float) * ldc, cc, pitch_c,
sizeof(float) * n, n, cudaMemcpyDeviceToHost);
ÕâÑù¾ÍÍê³ÉÁË¡£Kernel ²¿·ÝÔò²»ÐèÒªÐ޸ġ£
ÕâÑùµÄÐÞ¸ÄÓжà´óµÄЧ¹ûÄØ£¿ÔÚ GeForce 8800GT ÉÏÖ´ÐУ¬½á¹ûÈçÏ£º
Max error: 1.19209e-007 Average error: 4.22751e-008
Time used: 0.1250 (16.00 GFLOPS)
¿ÉÒÔ¿´µ½£¬Ö´ÐÐËÙ¶ÈÓÖÔÙ´ó·ùÌá¸ßÁËÈý±¶¶à£¡¶øÕâÖ»ÊǰÑÄÚ´æµÄÅäÖ÷½Ê½ÉÔ΢ÐÞ¸Äһ϶øÒÑ¡£
ËäȻִÐÐËÙ¶ÈÌá¸ßÁ˺ܶ࣬µ«ÊÇ£¬ºÍÇ°ÃæÌáµ½µÄÀíÂÛÖµÏà±È£¬Æäʵ»¹ÊÇÓÐÏ൱µÄ²î¾à¡£ÕâÊÇÒòΪ£¬Ç°ÃæÒ²Ìáµ½¹ý£¬ÕâÑùµÄ×ö·¨ÐèÒª
n3+n2 ´ÎµÄÄÚ´æ¶ÁÈ¡£¬ºÍ n2 ´ÎµÄÄÚ´æÐ´È붯×÷¡£ÓÉÓÚ n = 1000£¬Ã¿¸öÊý×ֵĴóСÊÇ 32
bits£¬ËùÒÔ×ܹ²µÄÄÚ´æ´æÈ¡Êý¾ÝÁ¿Ô¼Îª 4GB¡£³ýÒÔʵ¼ÊÖ´ÐеÄʱ¼ä 0.125 Ã룬µÃµ½µÄ´ø¿íÊýÖµÊÇÔ¼
32GB/s£¬ÕâÒѾ½Ó½ü GeForce 8800GT ÏÔ¿¨ÄÚ´æµÄ´ø¿íÁË¡£ÓÉÓÚÎÒÃǼÆËãʱ¼äµÄʱºò£¬°ÑÅäÖÃÄÚ´æ¡¢ÒÔ¼°Êý¾ÝµÄ¸´Öƶ¯×÷Ò²¼ÆËã½øÈ¥£¬Òò´Ëʵ¼ÊÉÏ»¨·ÑÔÚ
kernel µÄʱ¼äÊǸü¶ÌµÄ£¨Ô¼ 0.09 Ã룩¡£Òò´Ë£¬¿ÉÒÔºÜÃ÷ÏԵĿ´³ö£¬Õâ¸ö³ÌÐòµÄЧÂÊ£¬ÊÇÊÜÏÞÓÚÄÚ´æ´ø¿íµÄ¡£
½øÒ»²½µÄ¸ÄÁ¼
ÉÏÒ»½ÚµÄ½áÂÛÏÔʾ³ö£¬¾ØÕó³Ë·¨µÄ³ÌÐò£¬Ð§ÂÊÊÇÊÜÏÞÓÚÄÚ´æ´ø¿íµÄ¡£ÄÇÓÐûÓа취½µµÍÄÚ´æµÄ´æÈ¡´ÎÊýÄØ£¿´ð°¸µ±È»ÊÇÓе쬲»È»¾Í²»»áÓÐÕâÒ»½ÚÁË
:)
Òª½øÒ»²½½µµÍÄÚ´æ´ø¿íµÄʹÓ㬿ÉÒÔ×¢Òâµ½£¬ÔÚÉÏÒ»½ÚµÄ·½·¨ÖУ¬ËäÈ» A ¾ØÕóµÄ´æÈ¡´ÎÊý±»¼õÖÁ×îµÍ£¬µ«ÊÇ
B ¾ØÕóµÄ´æÈ¡´ÎÊý²¢Ã»ÓмõÉÙ¡£ÕâÊÇÒòΪÎÒÃÇÖ»½« A ¾ØÕóµÄ row ¼ÓÔØµ½ shared memory
ÖУ¬µ«ÊÇ B ¾ØÕóµÄ column Ò²ÊÇÓб»Öظ´Ê¹Óõġ£ÀíÏëÉÏÓ¦¸ÃÒ²¿ÉÒÔ±ÜÃâÖØ¸´¼ÓÔØ²Å¶Ô¡£²»¹ý£¬ÓÉÓÚ
B ¾ØÕóµÄ column ʹÓõÄʱ»ú£¬ºÍ A ¾ØÕóµÄ row ÊDz»Í¬µÄ£¬ËùÒÔ²¢²»ÄÜÖ±½ÓÕâÑù×ö¡£
½â¾ö·½·¨ÊÇ ¡°blocking¡±¡£Ò²¾ÍÊǰÑÕû¸ö¾ØÕó³Ë·¨µÄ¶¯×÷£¬Çиî³ÉºÜ¶àС¾ØÕóµÄ³Ë·¨¡£ÀýÈ磬Ҫ¼ÆËã
C ¾ØÕóµÄ (0, 0) ~ (15, 15) µÄÖµ£¬¿ÉÒÔ°ÑËüÏë³É£º
A(0~15, 0~15) * B(0~15, 0~15) + A(0~15,16~31) *
B(16~31, 0~15)
+ A(0~15, 32~47) * B(32~47, 0~15) + ¡
ÕâÑùÒ»À´£¬ÎÒÃǾͿÉÒÔ°ÑÁ½¸öС¾ØÕó¼ÓÔØµ½ shared memory£¬ÔòС¾ØÕó±¾ÉíµÄ³Ë·¨¾Í²»ÐèÒªÔÙ´æÈ¡ÈκÎÍⲿµÄÄÚ´æÁË£¡ÕâÑùÒ»À´£¬¼ÙÉèС¾ØÕóµÄ´óСÊÇ
k£¬Ôòʵ¼ÊÉÏÐèÒªµÄÄÚ´æ´æÈ¡´ÎÊý¾Í»á±ä³ÉÔ¼ 2k2(n/k)3 = 2n3/k¡£
ÓÉÓÚĿǰ CUDA ÿ¸ö block µÄ thread ÊýÄ¿×î¶àÊÇ 512£¬Òò´Ë k = 16 ËÆºõÊÇÒ»¸öÏ൱ÀíÏëµÄÊý×Ö£¨¹²
256 ¸ö threads£©¡£Òò´Ë£¬¶ÔÓÚÒ»¸ö n = 1000 µÄ¾ØÕóÀ´Ëµ£¬ÎÒÃÇ¿ÉÒÔ°ÑÄÚ´æ´æÈ¡µÄÁ¿¼õÉÙµ½Ô¼
500MB£¬Ò²¾ÍÊÇÉÏÒ»½ÚµÄ´æÈ¡Á¿µÄ 1/8¡£ÀíÂÛÉÏ£¬ÕâÑùÓ¦¸Ã¿ÉÒÔÈÃЧÂÊÌá¸ß°Ë±¶£¨¼ÙÉèûÓÐÓöµ½±ðµÄÆ¿¾±£©¡£
ΪÁË·½±ã½øÐÐÇø¿éµÄ¼ÆË㣬ÎÒÃÇÈÃÿ¸ö block ÓÐ 16x16 ¸ö
threads£¬ÔÙ½¨Á¢ (n/16)x(n/16) ¸ö blocks ¡£°Ñºô½Ð kernel µÄµØ·½¸Ä³É£º
int bx = (n
+ BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 blocks(bx, bx);
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
matMultCUDA<<<blocks, threads>>>(ac,
pitch_a / sizeof(float),
bc, pitch_b / sizeof(float), cc, pitch_c / sizeof(float),
n); |
BLOCK_SIZE ÔòÊǶ¨Òå³É 16¡£dim3 ÊÇ CUDA µÄÒ»ÖÖÊý¾ÝÐÍ̬£¬±íʾһ¸ö 3D µÄÏòÁ¿¡£ÔÚÕâÀÎÒÃÇ͸¹ý
dim3 À´½¨Á¢ 16x16 ¸ö threads µÄ block£¬ºÍ (n/16)x(n/16) ¸ö
blocks¡£
Kernel ³ÌÐòµÄ²¿·Ý£¬Ôò¸Ä³É£º
__global__ static
void matMultCUDA(const float* a, size_t lda,
const float* b, size_t ldb, float* c, size_t ldc,
int n)
{
__shared__ float matA[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float matB[BLOCK_SIZE][BLOCK_SIZE];
const int tidc = threadIdx.x;
const int tidr = threadIdx.y;
const int bidc = blockIdx.x * BLOCK_SIZE;
const int bidr = blockIdx.y * BLOCK_SIZE;
int i, j;
float results = 0;
float comp = 0;
for(j = 0; j < n; j += BLOCK_SIZE) {
if(tidr + bidr < n && tidc + j <
n) {
matA[tidr][tidc] = a[(tidr + bidr) * lda + tidc
+ j];
}
else {
matA[tidr][tidc] = 0;
}
if(tidr + j < n && tidc + bidc
< n) {
matB[tidr][tidc] = b[(tidr + j) * ldb + tidc
+ bidc];
}
else {
matB[tidr][tidc] = 0;
}
__syncthreads();
for(i = 0; i < BLOCK_SIZE; i++) {
float t;
comp -= matA[tidr][i] * matB[i][tidc];
t = results - comp;
comp = (t - results) + comp;
results = t;
}
__syncthreads();
}
if(tidr + bidr < n && tidc + bidc
< n) {
c[(tidr + bidr) * ldc + tidc + bidc] = results;
}
} |
×¢Òâµ½ÒòΪÎÒÃÇÏÖÔÚʹÓà 16x16 µÄ threads£¬Òò´Ë threadIdx ±äÁ¿¿ÉÒÔÈ¡µÃ threadIdx.x
ºÍ threadIdx.y£¬·¶Î§·Ö±ðÊÇ 0 ~ 15¡£blockIdx.x ºÍ blockIdx.y
±äÁ¿Ò²ÊÇͬÑùµÄÇéÐΣ¬·¶Î§·Ö±ðÊÇ 0 ~ n/16¡£
ÔÚ³ÌÐòÖУ¬ÒòΪ¾ØÕóµÄ´óС²»Ò»¶¨»áÊÇ 16 µÄ±¶Êý£¬Òò´ËÐèҪʹÓà if ÅжÏʽ¼ì²éÊÇ·ñ³¬³ö¾ØÕó·¶Î§¡£
Õâ¸ö°æ±¾ÔÚ GeForce 8800GT ÉϵÄÖ´Ðнá¹ûÈçÏ£º
Max error: 1.19209e-007 Average error: 4.22751e-008
Time used: 0.0780 (25.64 GFLOPS)
ËÙ¶ÈËäÈ»Ìá¸ßÁË£¬µ«ÊÇËÆºõ²¢Ã»ÓдﵽԤÆÚÖеİ˱¶¡£µ±È»£¬Ç°ÃæÌáµ½¹ý£¬ÎÒÃÇÔÚ¼ÆËãʱ¼äʱ£¬°ÑһЩ¸´ÖÆÄÚ´æ¡¢ÅäÖÃÄÚ´æµÄ¶¯×÷Ò²¼ÆËãÔÚÄÚ£¬ÕâЩ¶¯×÷µÄʱ¼ä²¢²»»áËõ¶Ì¡£Êµ¼ÊÉÏ
kernel µÄÔËÐÐʱ¼ä£¬´óÔ¼ÊÇ 0.053 Ãë×óÓÒ£¨Ô¼ÂÔÏ൱ÓÚ 38GFLOPS£©£¬±ÈÉÏÒ»½ÚµÄ°æ±¾¿ìÁ˽«½üÒ»±¶¡£
Èç¹ûÕâÒ»°æ³ÌÐòÒѾ²»ÔÙÏÞÓÚÄÚ´æ´ø¿í£¬ÄÇΪʲôûÓдﵽԤÆÚµÄЧÂÊÄØ£¿ÕâÊÇÒòΪÕâÒ»°æ³ÌÐòÒѾÊÇÏÞÓÚÖ¸ÁîÖÜÆÚÁË¡£³ýÁËʹÓÃ
Kahan¡¯s Summation Formula »áÐèÒª¸ü¶àµÄÔËËãÖ®Í⣬³ÌÐòÖÐÒ²ÓдóÁ¿¼ÆËã¾ØÕóµØÖ·µÄ³Ë·¨µÈµÈ£¬Õâ¶¼»áÐèÒª»¨·ÑÔËËã×ÊÔ´¡£ÁíÍ⣬ÄÇЩÓÃÀ´Åжϳ¬³ö¾ØÕó·¶Î§µÄ
if ÅжÏʽ£¬Ò²»áÓÐÒ»¶¨µÄÓ°Ïì¡£
Òª°ÑÄÇЩ if ÅжÏʽȥµô£¬ÓÐÒ»¸ö·½·¨ÊÇ£¬ÔÚÅäÖÃÄÚ´æÊ±£¬¾ÍÅäÖÃ³É 16 µÄ±¶Êý£¬²¢ÔÚ¸´ÖƾØÕóµ½ÏÔ¿¨ÄÚ´æÖ®Ç°£¬ÏȽ«ËüÇåΪ
0¡£ÈçÏÂËùʾ£º
int newn =
((n + BLOCK_SIZE - 1) / BLOCK_SIZE) * BLOCK_SIZE;
cudaMallocPitch((void**) &ac, &pitch_a,
sizeof(float) * newn, newn);
cudaMallocPitch((void**) &bc, &pitch_b,
sizeof(float) * newn, newn);
cudaMallocPitch((void**) &cc, &pitch_c,
sizeof(float) * newn, newn);
cudaMemset(ac, 0, pitch_a * newn);
cudaMemset(bc, 0, pitch_b * newn); |
ÕâÑùÒ»À´£¬ÎÒÃǾͿÉÒÔ°Ñ kernel ÖÐµÄ if ÅжÏʽ¶¼ÒƳýÁË£º
__global__
static void matMultCUDA(const float* a, size_t
lda,
const float* b, size_t ldb, float* c, size_t ldc,
int n)
{
__shared__ float matA[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float matB[BLOCK_SIZE][BLOCK_SIZE];
const int tidc = threadIdx.x;
const int tidr = threadIdx.y;
const int bidc = blockIdx.x * BLOCK_SIZE;
const int bidr = blockIdx.y * BLOCK_SIZE;
int i, j;
float results = 0;
float comp = 0;
for(j = 0; j < n; j += BLOCK_SIZE) {
matA[tidr][tidc] = a[(tidr + bidr) * lda + tidc
+ j];
matB[tidr][tidc] = b[(tidr + j) * ldb + tidc
+ bidc];
__syncthreads();
for(i = 0; i < BLOCK_SIZE; i++) {
float t;
comp -= matA[tidr][i] * matB[i][tidc];
t = results - comp;
comp = (t - results) + comp;
results = t;
}
__syncthreads();
}
c[(tidr + bidr) * ldc + tidc + bidc] = results;
} |
Õâ¸ö°æ±¾µÄÖ´Ðнá¹ûÊÇ£º
Max error: 1.19209e-007 Average error: 4.22751e-008
Time used: 0.0780 (25.64 GFLOPS)
ËÆºõûÓиÄÉÆ¡£²»¹ý£¬Êµ¼ÊÉÏ kernel µÄÔËÐÐʱ¼äÒѾ¼õÉÙµ½ 0.042 Ã루ԼÂÔÏ൱ÓÚ 48GFLOPS£©¡£
½áÂÛ
ÓÐЩ¶ÁÕß¿ÉÄÜ»áÏ룬Èç¹û°Ñ block ÔÙ±äµÃ¸ü´ó£¨ÀýÈç 32x32£©ÊÇ·ñ»áÓаïÖúÄØ£¿µ±È»£¬ÓÉÓÚ×îºóµÄ³ÌÐòÒѾ²»ÔÙÊÇÊÜÏÞÓÚÄÚ´æ´ø¿í£¨ÔÚ
0.042 ÃëÄÚ´æÈ¡ 500MB µÄÊý¾ÝÔ¼Ï൱ÓÚ 12GB/s µÄ´ø¿í£©£¬ËùÒÔ°Ñ block ÔÙ¼Ó´ó²¢²»»áÓаïÖúÁË¡£¶øÇÒ£¬ÓÉÓÚÒ»¸ö
block ÄÚµÄ thread ÊýÄ¿×î¶àÖ»Äܵ½ 512 ¸ö£¬½« block ±ä´óÒ²»áÔì³ÉºÜ¶à¶îÍ⸺µ£¡£¶øÇÒ
shared memory µÄ´óСҲÓÐÏÞÖÆ£¨GeForce 8800GT µÄ shared memory
´óСÏÞÖÆÊÇ 16384 bytes£©£¬ËùÒÔÒ²²»ÄÜÈÎÒâÔö¼Ó block µÄ´óС¡£
×îºóÒ»°æ³ÌÐòµÄÍêÕûµµ°¸¿ÉÒÔ´ÓÕâÀïÏÂÔØ¡£
GPU µÄÓ²¼þ¼Ü¹¹
ÕâÀïÎÒÃÇ»á¼òµ¥½éÉÜ£¬NVIDIA Ŀǰ֧³Ö CUDA µÄ GPU£¬ÆäÔÚÖ´ÐÐ CUDA ³ÌÐòµÄ²¿·Ý£¨»ù±¾ÉϾÍÊÇÆä
shader µ¥Ôª£©µÄ¼Ü¹¹¡£ÕâÀïµÄÊý¾ÝÊÇ×ÛºÏ NVIDIA Ëù¹«²¼µÄÐÅÏ¢£¬ÒÔ¼° NVIDIA ÔÚ¸÷¸öÑÐÌֻᡢѧУ¿Î³ÌµÈËùÌṩµÄÊý¾Ý£¬Òò´ËÓпÉÄÜ»áÓв»ÕýÈ·µÄµØ·½¡£Ö÷ÒªµÄÊý¾ÝÔ´°üÀ¨
NVIDIA µÄ CUDA Programming Guide 1.1¡¢NVIDIA ÔÚ Supercomputing
¡®07 ½éÉÜ CUDA µÄ session£¬ÒÔ¼° UIUC µÄ CUDA ¿Î³Ì¡£
GPU µÄ»ù±¾½éÉÜ
Ŀǰ NVIDIA ÍÆ³öµÄÏÔʾоƬ£¬Ö§³Ö CUDA µÄÊÇ G80 ϵÁеÄÏÔʾоƬ¡£ÆäÖÐ G80 ÏÔʾоƬ֧³Ö
CUDA 1.0 °æ£¬¶ø G84¡¢G86¡¢G92¡¢G94¡¢G96 ÔòÖ§Ô® CUDA 1.1 °æ¡£»ù±¾ÉÏ£¬³ýÁË×îÔçµÄ
GeForce 8800 Ultra/GTX ¼° 320MB/640MB °æ±¾µÄ GeForce 8800GTS¡¢Tesla
µÈÏÔ¿¨ÊÇ CUDA 1.0 °æÖ®Í⣬ÆäËü GeForce 8 ϵÁм° 9 ϵÁÐÏÔ¿¨¶¼Ö§³Ö CUDA
1.1¡£ÏêϸÇéÐοÉÒԲο¼ CUDA Programming Guide 1.1 µÄ Appendix
A¡£
ËùÓÐĿǰ֧³Ö CUDA µÄ NVIDIA ÏÔʾоƬ£¬Æä shader ²¿·Ý¶¼ÊÇÓɶà¸ö multiprocessors
×é³É¡£Ã¿¸ö multiprocessor Àï°üº¬Á˰˸ö stream processors£¬Æä×é³ÉÊÇËĸöËĸöÒ»×飬Ҳ¾ÍÊÇ˵ʵ¼ÊÉÏ¿ÉÒÔ¿´³ÉÊÇÓÐÁ½×é
4D µÄ SIMD ´¦ÀíÆ÷¡£´ËÍ⣬ÿ¸ö multiprocessor »¹¾ßÓÐ 8192 ¸ö¼Ä´æÆ÷£¬16KB
µÄ share memory£¬ÒÔ¼° texture cache ºÍ constant cache¡£´óÖÂÉÏÈçÏÂͼËùʾ£º
ÏêϸµÄ multiprocessor ÐÅÏ¢£¬¶¼¿ÉÒÔ͸¹ý CUDA µÄ cudaGetDeviceProperties()
º¯Ê½»ò cuDeviceGetProperties() º¯Ê½È¡µÃ¡£²»¹ý£¬Ä¿Ç°»¹Ã»Óа취ֱ½ÓÈ¡µÃÒ»¸öÏÔʾоƬÖÐÓжàÉÙ
multiprocessor µÄÐÅÏ¢¡£
ÔÚ CUDA ÖУ¬´ó²¿·Ý»ù±¾µÄÔËË㶯×÷£¬¶¼¿ÉÒÔÓÉ stream processor ½øÐС£Ã¿¸ö stream
processor ¶¼°üº¬Ò»¸ö FMA£¨fused-multiply-add£©µ¥Ôª£¬¿ÉÒÔ½øÐÐÒ»¸ö³Ë·¨ºÍÒ»¸ö¼Ó·¨¡£±È½Ï¸´ÔÓµÄÔËËãÔò»áÐèÒª±È½Ï³¤µÄʱ¼ä¡£
Ö´Ðйý³Ì
ÔÚÖ´ÐÐ CUDA ³ÌÐòµÄʱºò£¬Ã¿¸ö stream processor ¾ÍÊǶÔÓ¦Ò»¸ö thread¡£Ã¿¸ö
multiprocessor Ôò¶ÔÓ¦Ò»¸ö block¡£´Ó֮ǰµÄÎÄÕÂÖУ¬¿ÉÒÔ×¢Òâµ½Ò»¸ö block ¾³£Óкܶà¸ö
thread£¨ÀýÈç 256 ¸ö£©£¬Ô¶³¬¹ýÒ»¸ö multiprocessor ËùÓÐµÄ stream processor
ÊýÄ¿¡£ÕâÓÖÊÇÔõô»ØÊÂÄØ£¿
ʵ¼ÊÉÏ£¬ËäȻһ¸ö multiprocessor Ö»Óа˸ö stream processor£¬µ«ÊÇÓÉÓÚ
stream processor ½øÐи÷ÖÖÔËËã¶¼ÓÐ latency£¬¸ü²»ÓÃÌáÄÚ´æ´æÈ¡µÄ latency£¬Òò´Ë
CUDA ÔÚÖ´ÐгÌÐòµÄʱºò£¬ÊÇÒÔ warp Ϊµ¥Î»¡£Ä¿Ç°µÄ CUDA ×°Öã¬Ò»¸ö warp ÀïÃæÓÐ
32 ¸ö threads£¬·Ö³ÉÁ½×é 16 threads µÄ half-warp¡£ÓÉÓÚ stream
processor µÄÔËËãÖÁÉÙÓÐ 4 cycles µÄ latency£¬Òò´Ë¶ÔÒ»¸ö 4D µÄ stream
processors À´Ëµ£¬Ò»´ÎÖÁÉÙÖ´ÐÐ 16 ¸ö threads£¨¼´ half-warp£©²ÅÄÜÓÐЧÒþ²Ø¸÷ÖÖÔËËãµÄ
latency¡£
ÓÉÓÚ multiprocessor Öв¢Ã»ÓÐÌ«¶à±ðµÄÄڴ棬Òò´Ëÿ¸ö thread µÄ״̬¶¼ÊÇÖ±½Ó±£´æÔÚ
multiprocessor µÄ¼Ä´æÆ÷ÖС£ËùÒÔ£¬Èç¹ûÒ»¸ö multiprocessor ͬʱÓÐÓú¶àµÄ
thread ÒªÖ´ÐУ¬¾Í»áÐèÒªÓú¶àµÄ¼Ä´æÆ÷¿Õ¼ä¡£ÀýÈ磬¼ÙÉèÒ»¸ö block ÀïÃæÓÐ 256 ¸ö threads£¬Ã¿¸ö
thread Óõ½ 20 ¸ö¼Ä´æÆ÷£¬ÄÇô×ܹ²¾ÍÐèÒª 256x20 = 5,120 ¸ö¼Ä´æÆ÷²ÅÄܱ£´æÃ¿¸ö
thread µÄ״̬¡£
Ŀǰ CUDA ×°ÖÃÖÐÿ¸ö multiprocessor ÓÐ 8,192 ¸ö¼Ä´æÆ÷£¬Òò´Ë£¬Èç¹ûÿ¸ö
thread ʹÓõ½ 16 ¸ö¼Ä´æÆ÷£¬ÄǾͱíʾһ¸ö multiprocessor ͬʱ×î¶àÖ»ÄÜά³Ö
512 ¸ö thread µÄÖ´ÐС£Èç¹ûͬʱ½øÐÐµÄ thread ÊýÄ¿³¬¹ýÕâ¸öÊý×Ö£¬ÄÇô¾Í»áÐèÒª°ÑÒ»²¿·ÝµÄÊý¾Ý´¢´æÔÚÏÔ¿¨ÄÚ´æÖУ¬¾Í»á½µµÍÖ´ÐеÄЧÂÊÁË¡£
±àÕß×¢£ºÔÚNVIDIA GT200ÖеÄRegister File´óСÔö¼ÓÁËÒ»±¶£¬ÔÚFP32Ï¿ÉÓõÄregister
fileΪ16K£¬FP64ÏÂÊÇ8K¡£
Shared memory
Ŀǰ CUDA ×°ÖÃÖУ¬Ã¿¸ö multiprocessor ÓÐ 16KB µÄ shared memory¡£Shared
memory ·Ö³É 16 ¸ö bank¡£Èç¹ûͬʱÿ¸ö thread ÊÇ´æÈ¡²»Í¬µÄ bank£¬¾Í²»»á²úÉúÈκÎÎÊÌ⣬´æÈ¡
shared memory µÄËٶȺʹæÈ¡¼Ä´æÆ÷Ïàͬ¡£²»¹ý£¬Èç¹ûͬʱÓÐÁ½¸ö£¨»ò¸ü¶à¸ö£© threads
´æÈ¡Í¬Ò»¸ö bank µÄÊý¾Ý£¬¾Í»á·¢Éú bank conflict£¬ÕâЩ threads ¾Í±ØÐëÕÕ˳ÐòÈ¥´æÈ¡£¬¶øÎÞ·¨Í¬Ê±´æÈ¡
shared memory ÁË¡£
Shared memory ÊÇÒÔ 4 bytes Ϊµ¥Î»·Ö³É banks¡£Òò´Ë£¬¼ÙÉèÒÔϵÄÊý¾Ý£º
__shared__ int data[128];
ÄÇô£¬data[0] ÊÇ bank 0¡¢data[1] ÊÇ bank 1¡¢data[2] ÊÇ bank
2¡¢¡¡¢data[15] ÊÇ bank 15£¬¶ø data[16] Óֻص½ bank 0¡£ÓÉÓÚ warp
ÔÚÖ´ÐÐʱÊÇÒÔ half-warp µÄ·½Ê½Ö´ÐУ¬Òò´Ë·ÖÊôÓÚ²»Í¬µÄ half warp µÄ threads£¬²»»áÔì³É
bank conflict¡£
Òò´Ë£¬Èç¹û³ÌÐòÔÚ´æÈ¡ shared memory µÄʱºò£¬Ê¹ÓÃÒÔϵķ½Ê½£º
int number = data[base + tid];
ÄǾͲ»»áÓÐÈκΠbank conflict£¬¿ÉÒÔ´ïµ½×î¸ßµÄЧÂÊ¡£µ«ÊÇ£¬Èç¹ûÊÇÒÔϵķ½Ê½£º
int number = data[base + 4 * tid];
ÄÇô£¬thread 0 ºÍ thread 4 ¾Í»á´æÈ¡µ½Í¬Ò»¸ö bank£¬thread 1 ºÍ thread
5 Ò²ÊÇͬÑù£¬ÕâÑù¾Í»áÔì³É bank conflict¡£ÔÚÕâ¸öÀý×ÓÖУ¬Ò»¸ö half warp µÄ
16 ¸ö threads »áÓÐËĸö threads ´æÈ¡Í¬Ò»¸ö bank£¬Òò´Ë´æÈ¡ share memory
µÄËÙ¶È»á±ä³ÉÔÀ´µÄ 1/4¡£
Ò»¸öÖØÒªµÄÀýÍâÊÇ£¬µ±¶à¸ö thread ´æÈ¡µ½Í¬Ò»¸ö shared memory µÄµØÖ·Ê±£¬shared
memory ¿ÉÒÔ½«Õâ¸öµØÖ·µÄ 32 bits Êý¾Ý¡¸¹ã²¥¡¹µ½ËùÓжÁÈ¡µÄ threads£¬Òò´Ë²»»áÔì³É
bank conflict¡£ÀýÈ磺
int number = data[3];
ÕâÑù²»»áÔì³É bank conflict£¬ÒòΪËùÓÐµÄ thread ¶¼¶Áȡͬһ¸öµØÖ·µÄÊý¾Ý¡£
ºÜ¶àʱºò shared memory µÄ bank conflict ¿ÉÒÔ͸¹ýÐÞ¸ÄÊý¾Ý´æ·ÅµÄ·½Ê½À´½â¾ö¡£ÀýÈ磬ÒÔϵijÌÐò£º
data[tid] =
global_data[tid];
¡
int number = data[16 * tid]; |
»áÔì³ÉÑÏÖØµÄ bank conflict£¬ÎªÁ˱ÜÃâÕâ¸öÎÊÌ⣬¿ÉÒÔ°ÑÊý¾ÝµÄÅÅÁз½Ê½ÉÔ¼ÓÐ޸ģ¬°Ñ´æÈ¡·½Ê½¸Ä³É£º
int row = tid
/ 16;
int column = tid % 16;
data[row * 17 + column] = global_data[tid];
¡
int number = data[17 * tid]; |
ÕâÑù¾Í²»»áÔì³É bank conflict ÁË¡£
±àÕß×¢£ºshare memoryÔÚNVIDIAµÄÎĵµÖÐÆäʵ»¹Óв»Í¬µÄ½Ð·¨£¬ÀýÈçPDC£¨Parallel
Data Cache£©¡¢PBSM£¨per-block share memory£©¡£
Global memory
ÓÉÓÚ multiprocessor ²¢Ã»ÓÐ¶Ô global memory ×ö cache£¨Èç¹ûÿ¸ö
multiprocessor ¶¼ÓÐ×Ô¼ºµÄ global memory cache£¬½«»áÐèÒª cache
coherence protocol£¬»á´ó·ùÔö¼Ó cache µÄ¸´ÔÓ¶È£©£¬ËùÒÔ global memory
´æÈ¡µÄ latency ·Ç³£µÄ³¤¡£³ý´ËÖ®Íâ£¬Ç°ÃæµÄÎÄÕÂÖÐÒ²Ìáµ½¹ý global memory µÄ´æÈ¡£¬Òª¾¡¿ÉÄܵÄÁ¬Ðø¡£ÕâÊÇÒòΪ
DRAM ´æÈ¡µÄÌØÐÔËùÔì³ÉµÄ½á¹û¡£
¸ü¾«È·µÄ˵£¬global memory µÄ´æÈ¡£¬ÐèÒªÊÇ ¡°coalesced¡±¡£ËùνµÄ coalesced£¬ÊDZíʾ³ýÁËÁ¬ÐøÖ®Í⣬¶øÇÒËü¿ªÊ¼µÄµØÖ·£¬±ØÐëÊÇÿ¸ö
thread Ëù´æÈ¡µÄ´óСµÄ 16 ±¶¡£ÀýÈ磬Èç¹ûÿ¸ö thread ¶¼¶ÁÈ¡ 32 bits µÄÊý¾Ý£¬ÄÇôµÚÒ»¸ö
thread ¶ÁÈ¡µÄµØÖ·£¬±ØÐëÊÇ 16*4 = 64 bytes µÄ±¶Êý¡£
Èç¹ûÓÐÒ»²¿·ÝµÄ thread ûÓжÁÈ¡Äڴ棬²¢²»»áÓ°Ïìµ½ÆäËüµÄ thread ËÙÐÐ coalesced
µÄ´æÈ¡¡£ÀýÈ磺
if(tid != 3)
{
int number = data[tid];
} |
ËäÈ» thread 3 ²¢Ã»ÓжÁÈ¡Êý¾Ý£¬µ«ÊÇÓÉÓÚÆäËüµÄ thread ÈÔ·ûºÏ coalesced
µÄÌõ¼þ£¨¼ÙÉè data µÄµØÖ·ÊÇ 64 bytes µÄ±¶Êý£©£¬ÕâÑùµÄÄÚ´æ¶ÁÈ¡ÈÔ»á·ûºÏ coalesced
µÄÌõ¼þ¡£
ÔÚĿǰµÄ CUDA 1.1 ×°ÖÃÖУ¬Ã¿¸ö thread Ò»´Î¶ÁÈ¡µÄÄÚ´æÊý¾ÝÁ¿£¬¿ÉÒÔÊÇ 32 bits¡¢64
bits¡¢»ò 128 bits¡£²»¹ý£¬32 bits µÄЧÂÊÊÇ×îºÃµÄ¡£64 bits µÄЧÂÊ»áÉԲ¶øÒ»´Î¶ÁÈ¡
128 bits µÄЧÂÊÔò±ÈÒ»´Î¶ÁÈ¡ 32 bits ÒªÏÔÖøÀ´µÃµÍ£¨µ«ÈÔ±È non-coalesced
µÄ´æÈ¡ÒªºÃ£©¡£
Èç¹ûÿ¸ö thread Ò»´Î´æÈ¡µÄÊý¾Ý²¢²»ÊÇ 32 bits¡¢64 bits¡¢»ò 128 bits£¬ÄǾÍÎÞ·¨·ûºÏ
coalesced µÄÌõ¼þ¡£ÀýÈ磬ÒÔϵijÌÐò£º
struct vec3d
{ float x, y, z; };
¡
__global__ void func(struct vec3d* data, float*
output)
{
output[tid] = data[tid].x * data[tid].x +
data[tid].y * data[tid].y +
data[tid].z * data[tid].z;
} |
²¢²»ÊÇ coalesced µÄ¶ÁÈ¡£¬ÒòΪ vec3d µÄ´óСÊÇ 12 bytes£¬¶ø·Ç 4 bytes¡¢8
bytes¡¢»ò 16 bytes¡£Òª½â¾öÕâ¸öÎÊÌ⣬¿ÉÒÔʹÓà __align(n)__ µÄָʾ£¬ÀýÈ磺
struct __align__(16) vec3d { float
x, y, z; }; |
Õâ»áÈà compiler ÔÚ vec3d ºóÃæ¼ÓÉÏÒ»¸ö¿ÕµÄ 4 bytes£¬ÒÔ²¹Æë 16 bytes¡£ÁíÒ»¸ö·½·¨£¬ÊǰÑÊý¾Ý½á¹¹×ª»»³ÉÈý¸öÁ¬ÐøµÄÊý×飬ÀýÈ磺
__global__ void
func(float* x, float* y, float* z, float* output)
{
output[tid] = x[tid] * x[tid] + y[tid] * y[tid]
+
z[tid] * z[tid];
} |
Èç¹ûÒòΪÆäËüÔÒòʹÊý¾Ý½á¹¹ÎÞ·¨ÕâÑùµ÷Õû£¬Ò²¿ÉÒÔ¿¼ÂÇÀûÓà shared memory ÔÚ GPU ÉÏ×ö½á¹¹µÄµ÷Õû¡£ÀýÈ磺
__global__ void
func(struct vec3d* data, float* output)
{
__shared__ float temp[THREAD_NUM * 3];
const float* fdata = (float*) data;
temp[tid] = fdata[tid];
temp[tid + THREAD_NUM] = fdata[tid + THREAD_NUM];
temp[tid + THREAD_NUM*2] = fdata[tid + THREAD_NUM*2];
__syncthreads();
output[tid] = temp[tid*3] * temp[tid*3] +
temp[tid*3+1] * temp[tid*3+1] +
temp[tid*3+2] * temp[tid*3+2];
} |
ÔÚÉÏÃæµÄÀý×ÓÖУ¬ÎÒÃÇÏÈÓÃÁ¬ÐøµÄ·½Ê½£¬°ÑÊý¾Ý´Ó global memory ¶Áµ½ shared memory¡£ÓÉÓÚ
shared memory ²»ÐèÒªµ£ÐÄ´æÈ¡Ë³Ðò£¨µ«Òª×¢Òâ bank conflict ÎÊÌ⣬²ÎÕÕǰһ½Ú£©£¬ËùÒÔ¿ÉÒԱܿª
non-coalesced ¶ÁÈ¡µÄÎÊÌâ¡£
Texture
CUDA Ö§Ô® texture¡£ÔÚ CUDA µÄ kernel ³ÌÐòÖУ¬¿ÉÒÔÀûÓÃÏÔʾоƬµÄ texture
µ¥Ôª£¬¶ÁÈ¡ texture µÄÊý¾Ý¡£Ê¹Óà texture ºÍ global memory ×î´óµÄ²î±ðÔÚÓÚ
texture Ö»ÄܶÁÈ¡£¬²»ÄÜдÈ룬¶øÇÒÏÔʾоƬÉÏÓÐÒ»¶¨´óСµÄ texture cache¡£Òò´Ë£¬¶ÁÈ¡
texture µÄʱºò£¬²»ÐèÒª·ûºÏ coalesced µÄ¹æÔò£¬Ò²¿ÉÒÔ´ïµ½²»´íµÄЧÂÊ¡£´ËÍ⣬¶ÁÈ¡ texture
ʱ£¬Ò²¿ÉÒÔÀûÓÃÏÔʾоƬÖÐµÄ texture filtering ¹¦ÄÜ£¨ÀýÈç bilinear filtering£©£¬Ò²¿ÉÒÔ¿ìËÙת»»Êý¾ÝÐÍ̬£¬ÀýÈç¿ÉÒÔÖ±½Ó½«
32 bits RGBA µÄÊý¾Ýת»»³ÉËĸö 32 bits ¸¡µãÊý¡£
ÏÔʾоƬÉ쵀 texture cache ÊÇÕë¶ÔÒ»°ã»æÍ¼Ó¦ÓÃËùÉè¼Æ£¬Òò´ËËüÈÔ×îÊʺÏÓÐÇø¿éÐÔÖʵĴæÈ¡¶¯×÷£¬¶ø·ÇËæ»úµÄ´æÈ¡¡£Òò´Ë£¬Í¬Ò»¸ö
warp Öеĸ÷¸ö thread ×îºÃÊǶÁÈ¡µØÖ·Ïà½üµÄÊý¾Ý£¬²ÅÄÜ´ïµ½×î¸ßµÄЧÂÊ¡£
¶ÔÓÚÒѾÄÜ·ûºÏ coalesced ¹æÔòµÄÊý¾Ý£¬Ê¹Óà global memory ͨ³£»á±ÈʹÓà texture
ÒªÀ´µÃ¿ì¡£
ÔËËãµ¥Ôª
Stream processor ÀïµÄÔËËãµ¥Ôª£¬»ù±¾ÉÏÊÇÒ»¸ö¸¡µãÊýµÄ fused multiply-add
µ¥Ôª£¬Ò²¾ÍÊÇ˵Ëü¿ÉÒÔ½øÐÐÒ»´Î³Ë·¨ºÍÒ»´Î¼Ó·¨£¬ÈçÏÂËùʾ£º
a = b * c + d;
compiler »á×Ô¶¯°ÑÊʵ±µÄ¼Ó·¨ºÍ³Ë·¨ÔËË㣬½áºÏ³ÉÒ»¸ö fmad Ö¸Áî¡£
³ýÁ˸¡µãÊýµÄ¼Ó·¨¼°³Ë·¨Ö®Í⣬ÕûÊýµÄ¼Ó·¨¡¢Î»ÔËËã¡¢±È½Ï¡¢È¡×îСֵ¡¢È¡×î´óÖµ¡¢¼°ÒÔÐÍ̬µÄת»»£¨¸¡µãÊýתÕûÊý»òÕûÊýת¸¡µãÊý£©¶¼ÊÇ¿ÉÒÔÈ«ËÙ½øÐеġ£ÕûÊýµÄ³Ë·¨ÔòÎÞ·¨È«ËÙ½øÐУ¬µ«
24 bits µÄ³Ë·¨Ôò¿ÉÒÔ¡£ÔÚ CUDA ÖпÉÒÔÀûÓÃÄÚ½¨µÄ __mul24 ºÍ __umul24
º¯Ê½À´½øÐÐ 24 bits µÄÕûÊý³Ë·¨¡£
¸¡µãÊýµÄ³ý·¨ÊÇÀûÓÃÏÈÈ¡µ¹Êý£¬ÔÙÏà³ËµÄ·½Ê½¼ÆË㣬Òò´Ë¾«È·¶È²¢²»ÄÜ´ïµ½ IEEE 754 µÄ¹æ·¶£¨×î´óÎó²îΪ
2 ulp£©¡£ÄÚ½¨µÄ __fdividef(x,y) Ìṩ¸ü¿ìËٵijý·¨£¬ºÍÒ»°ãµÄ³ý·¨ÓÐÏàͬµÄ¾«È·¶È£¬µ«ÊÇÔÚ
2216 < y < 2218 ʱ»áµÃµ½´íÎóµÄ½á¹û¡£
´ËÍâ CUDA »¹ÌṩÁËһЩ¾«È·¶È½ÏµÍµÄÄÚ²¿º¯Êý£¬°üÀ¨ __expf¡¢__logf¡¢__sinf¡¢__cosf¡¢__powf
µÈµÈ¡£ÕâЩº¯Ê½µÄËٶȽϿ죬µ«¾«È·¶È²»Èç±ê×¼µÄº¯Ê½¡£ÏêϸµÄÊý¾Ý¿ÉÒԲο¼ CUDA Programming
Guide 1.1 µÄ Appendix B¡£
ºÍÖ÷ÄÚ´æ¼äµÄÊý¾Ý´«Êä
ÔÚ CUDA ÖУ¬GPU ²»ÄÜÖ±½Ó´æÈ¡Ö÷Äڴ棬ֻÄÜ´æÈ¡ÏÔ¿¨ÉϵÄÏÔʾÄÚ´æ¡£Òò´Ë£¬»áÐèÒª½«Êý¾Ý´ÓÖ÷ÄÚ´æÏȸ´ÖƵ½ÏÔ¿¨ÄÚ´æÖУ¬½øÐÐÔËËãºó£¬ÔÙ½«½á¹û´ÓÏÔ¿¨ÄÚ´æÖи´ÖƵ½Ö÷ÄÚ´æÖС£ÕâЩ¸´ÖƵ͝×÷»áÏÞÓÚ
PCI Express µÄËÙ¶È¡£Ê¹Óà PCI Express x16 ʱ£¬PCI Express 1.0
¿ÉÒÔÌṩ˫Ïò¸÷ 4GB/s µÄ´ø¿í£¬¶ø PCI Express 2.0 Ôò¿ÉÌṩ 8GB/s µÄ´ø¿í¡£µ±È»Õâ¶¼ÊÇÀíÂÛÖµ¡£
´ÓÒ»°ãµÄÄÚ´æ¸´ÖÆÊý¾Ýµ½ÏÔ¿¨ÄÚ´æµÄʱºò£¬ÓÉÓÚÒ»°ãµÄÄÚ´æ¿ÉÄÜËæÊ±»á±»²Ù×÷ϵͳ°á¶¯£¬Òò´Ë CUDA »áÏȽ«Êý¾Ý¸´ÖƵ½Ò»¿éÄÚ²¿µÄÄÚ´æÖУ¬²ÅÄÜÀûÓÃ
DMA ½«Êý¾Ý¸´ÖƵ½ÏÔ¿¨ÄÚ´æÖС£Èç¹ûÏëÒª±ÜÃâÕâ¸öÖØ¸´µÄ¸´Öƶ¯×÷£¬¿ÉÒÔʹÓà cudaMallocHost
º¯Ê½£¬ÔÚÖ÷ÄÚ´æÖÐÈ¡µÃÒ»¿é page locked µÄÄÚ´æ¡£²»¹ý£¬Èç¹ûÒªÇóÌ«´óÁ¿µÄ page locked
µÄÄڴ棬½«»áÓ°Ïìµ½²Ù×÷ϵͳ¶ÔÄÚ´æµÄ¹ÜÀí£¬¿ÉÄÜ»á¼õµÍϵͳµÄЧÂÊ¡£
|