±à¼ÍƼö: |
±¾ÎĽéÉÜÁË
AI ÀàÒµÎñÔÚ¹«ÓÐÔÆÉϵÄÏÖ×´ÒÔ¼°ÏàÓ¦µÄ¼¼ÊõÑ¡ÐͺÍÃæÁÙµÄÎÊÌâ¡£×îºóͨ¹ý·ÖÎö¿ªÔ´ÉçÇøºÍÒµ½çµÄÇ÷ÊÆ£¬·ÖÏíÁ˶ÔÓÚδÀ´È«µ¯ÐÔµÄAI
»ù´¡ÉèÊ©µÄÕ¹Íû¡£
±¾ÎÄÀ´×ÔÓÚCSDN£¬ÓÉ»ðÁú¹ûÈí¼þLinda±à¼¡¢ÍƼö¡£ |
|
±³¾°ÓëÏÖ×´

Éî¶Èѧϰ·¢Õ¹ÖÁ½ñ£¬ÐµÄÄ£Ðͽṹ²ã³ö²»Çî¡£×Ô 2018 Äê GPT-1¡¢Bert
Ïà¼ÌÎÊÊÀ£¬Ä£ÐͽṹµÄ²ÎÊýÁ¿³ÊÖ¸Êý¼¶Ôö³¤¡£Ä¿Ç° Transformer µÈ½á¹¹²»½öÔÚ×ÔÈ»ÓïÑÔ´¦ÀíÁìÓò·¢¹â·¢ÈÈ£¬ÔÚ¼ÆËã»úÊÓ¾õµÈÁìÓò£¬Ò²³ÊÒ°»ðÁÇÔÖ®ÊÆ¡£Óɴ˿ɼû£¬Î´À´¶ÔÓÚËãÁ¦ºÍÏÔ´æµÄÐèÇó»áÔ½·¢Ç¿ÁÒ¡£¶øÒÔ
Nvidia Ϊ´ú±íµÄÓ²¼þ³§ÉÌÌṩµÄÓ²¼þÐÔÄÜÈ´²¢²»ÄÜÓë֮ͬ²½Ìá¸ß¡£ÉÏͼչʾÁËÁ½ÕßÖ®¼äµÄºè¹µ£¬ºìÉ«ÏßÌõÊÇÄ£ÐͲÎÊý¹æÄ£µÄ±ä»¯Ç÷ÊÆ£¬Ä¿Ç°ÕýÔÚÒÔÿÄê
120 ±¶µÄËÙ¶ÈÌáÉý¡£¶øÂÌÉ«ÏßÌõ´ú±íµÄÏÔ´æÈÝÁ¿Ã¿ÄêÌá¸ßµÄËÙ¶ÈÖ»ÓÐ 2 ±¶¡£
Òò´Ë£¬ÎÞÂÛÊÇÔÚ¼ÆËã»úÊÓ¾õ¡¢×ÔÈ»ÓïÑÔ´¦ÀíµÈÁìÓò£¬»¹ÊÇ»¥ÁªÍøÐÐÒµÂ䵨¹ã·ºµÄËÑË÷¹ã¸æÍƼöÁìÓò£¬·Ö²¼Ê½ÑµÁ·¶¼³ÉΪÁËÖ÷Á÷ѵÁ··½Ê½¡£

ÓëÖ®Ïà¶ÔÓ¦µÄ£¬Éî¶Èѧϰ¿ò¼ÜÒ²³Ê°Ù»¨Æë·ÅµÄÌ¬ÊÆ¡£´«Í³µÄ¿ò¼ÜÈç TensorFlow¡¢PyTorch¡¢Keras
ÈÔȻʮ·ÖÁ÷ÐС£¶øÒ»Ð©ÐµĿò¼ÜÒ²Öð½¥³öÏÖ£¬±ÈÈç΢ÈíµÄ DeepSpeed¡¢°Ù¶ÈµÄ Paddle µÈ¡£

×ܽáÀ´Ëµ£¬Ä¿Ç° AI ÔÚ¹¤Òµ½çµÄ¸÷¸öÁìÓò¶¼ÓÐÁ˹㷺µÄÂ䵨¡£´«Í³µÄËÑË÷¹ã¸æÍƼöÁìÓò×Ô²»±ØËµ£¬ÔÚÊÓ¾õÓë×ÔÈ»ÓïÑÔ´¦ÀíÁìÓò£¬»ùÓÚÉî¶ÈѧϰµÄ·½·¨ÒѾ³ÉΪÁË
state-of-art¡£ÔÚÓÎÏ·¡¢»úÆ÷È˵ÈÁìÓò£¬Ç¿»¯Ñ§Ï°Ò²ÔÚÂýÂý×ßÏòÉú²ú¡£ÎªÁËÂú×ãÒµÎñ¶Ô¸´ÔÓÄ£Ð͵ÄÐèÇó£¬ÐµÄÓ²¼þºÍ¿ò¼Ü²ã³ö²»Çî¡£µ±È»£¬»¹ÓÐÒ»¸ö·Ç³£Ã÷ÏÔµÄÇ÷ÊÆ£¬²»ÉÙ
AI ÀàÒµÎñÕýÔÚÉϹ«ÓÐÔÆ£¬Ï£Íû½èÖú¹«ÓÐÔÆµÄµ¯ÐÔ¼ÆËãÄÜÁ¦½µµÍËãÁ¦³É±¾£¬Ìá¸ßЧÂÊ¡£
ÔÚ¹«ÓÐÔÆÉ쵀 AI Â䵨

½ÓÏÂÀ´£¬ÎÒÃǽéÉÜÒ»ÏÂÔÚ·þÎñ¹«ÓÐÔÆÉϵĿͻ§Ê±¹ØÓÚÔÆÔÉú AI µÄһЩ¹Û²ì¡£
»ùÓÚ¹«ÓÐÔÆµÄÔÆÔÉú AI ĿǰÕýÔÚÖð½¥Â䵨£¬ÆäÖмȰüÀ¨Ï¡ÊèÀàµÄËÑË÷/¹ã¸æ/ÍÆ¼öÒµÎñ£¬Ò²°üÀ¨³íÃÜÀàµÄ¼ÆËã»úÊÓ¾õµÈÒµÎñ¡£»¥ÁªÍøÁìÓòµÄÍÆ¼ö³¡¾°Â䵨Ïà¶Ô½Ï¶à¡£Ò²ÕýÊÇÓÉÓÚËÑË÷/¹ã¸æ/ÍÆ¼öÒµÎñ³¡¾°¸´ÔÓ£¬¶Ëµ½¶ËÑÓ³ÙÒªÇóµÍ£¬Òò´Ë¸ÄÔìµÄ³É±¾Ïà¶Ô½Ï¸ß£¬ËùÒÔ´ó¶àÊýÒµÎñ£¬ÓÈÆäÊÇÀëÏßѵÁ·¹ý³Ì£¬ÈÔÈ»²»ÄܺܺõØÀûÓÃÔÆµÄµ¯ÐÔÄÜÁ¦¡£
Óë´Ëͬʱ´ÓÉî¶Èѧϰ¿ò¼ÜµÄ½Ç¶È¿´£¬Ä¿Ç°¾ø´ó¶àÊýµÄÒµÎñÈÔÈ»ÔÚʹÓà TensorFlow¡£ÕâÓë֮ǰµÄ¹Û²ìÓÐÒ»¶¨µÄÏà¹ØÐÔ¡£ËÑË÷/¹ã¸æ/ÍÆ¼öÒµÎñÖÐ
TensorFlow ÈÔȻռ¾ÝÁ˾ø¶ÔµÄÊг¡¡£µ«ÊÇĿǰ PyTorch µÄʹÓÃÒ²Ô½À´Ô½¶à£¬ÓÈÆäÊÇÔÚ¼ÆËã»úÊÓ¾õ¡¢×ÔÈ»ÓïÑÔ´¦ÀíµÈÁìÓò¡£
ÌÚÑ¶ÔÆÔÉúAI·þÎñ
½áºÏÎÒÃǵÄÕâЩ¹Û²ìºÍʵ¼ù£¬ÌÚÑ¶ÔÆÔÉúÍŶÓÎ§ÈÆ×Å Kubeflow ¹¹½¨ÁËÌÚÑ¶ÔÆÈÝÆ÷·þÎñµÄÔÆÔÉú AI
²úÆ·»¯·½°¸¡£Ä¿Ç°ÒѾ¿ªÊ¼Ãâ·ÑÄڲ⣬»¶ÓÁªÏµÎÒÃÇÊÔÓã¬ÄúµÄÈκν¨Òé¶¼»á³ÉΪÎÒÃǵı¦¹ó¶¯Á¦¡£ ÌÚÑ¶ÔÆÔÆÔÉúAI·þÎñΪÓû§ÌṩÁË
AI»·¾³µÄ¿ìËÙ½»¸¶ÒÔ¼°¹ÜÀíÄÜÁ¦¡¢µ¯Ð﵀ Jupyter ·þÎñ¡¢ÒÔ¼°·Ö²¼Ê½Ä£ÐÍ·þÎñµÈÄÜÁ¦£¬Ä¿Ç°¹ØÓÚÄ£Ð͹ÜÀíµÈ²úÆ·ÌØÐÔÒ²ÔÚÖð²½½¨ÉèÖС£
´ËÍ⣬ΪÁ˽â¾ö´ø¿íÐÔÄܵį¿¾±ÎÊÌ⣬ÎÒÃDz»½öÔÚ´æ´¢¶ËÁªºÏÌÚѶ COS ÍŶӣ¬½èÖú GooseFS »º´æÒýÇæÓÅ»¯£¬¶øÇÒÔÚ¼ÆËã¶ËÁªºÏÌÚÑ¶ÔÆÓÅͼʵÑéÊÒ£¬½èÖúÆäÔÚѵÁ·ÓëÍÆÀíÉ϶àÄêÀ´µÄ¾Ñé³Áµí£¬×¼±¸ÍƳö¸ß¶ÈÓÅ»¯µÄÉî¶Èѧϰ¿ò¼Ü¡£ÎÒÃÇ»á³ä·ÖÀûÓÃÔÆÔÉúAI×÷Ϊͳһ´°¿ÚµÄÓÅÊÆ£¬ÓëÌÚÑ¶ÔÆ¶à¸öÍŶӺÏ×÷¹²½¨Æ½Ì¨£¬Ìṩ¿ªÏä¼´ÓõIJúÆ·»¯ÄÜÁ¦£¬·´²¸¿Í»§ÓëÉçÇø¡£
¸ü¶à¹ØÓÚÔÆÔÉúAIµÄ×î¼Ñʵ¼ù»áÔÚÎÒÃǺóÐøµÄ¡¶ÔÆÔÉúAI±ê×¼Ö¸ÄÏ¡·ÒÔ¼°¡¶ÔÆÔÉúAIÇ°ÑØ¹Û²ì¡·ÏµÁÐÖÐÍÆ³ö¡£
ÂäµØÊµ¼ù

ÔÚ½éÉÜÍ깫ÓÐÔÆµÄ AI ÔÆÔÉúÂ䵨Çé¿öºó£¬ÎÒÃÇ·ÖÏíÒ»ÏÂÔÚ¹«ÓÐÔÆÉÏÔËÐÐ
AI ÀàÒµÎñµÄµäÐÍÑ¡ÐÍ¡£Ê×ÏÈÊÇѵÁ·Ïà¹ØµÄ¼¼ÊõÕ»¡£Ê×ÏÈ£¬ÔÚ×îµ×²ãµÄÔÆ·þÎñÆ÷²à£¬Ò»°ã¶øÑÔÊÇÓÉÔÆ³§ÉÌÌṩµÄÐéÄâ»ú»òÕßÂã½ðÊô»úÆ÷¡£Ä¿Ç°´ó²¿·ÖÒµÎñ¶¼²ÉÓÃ
Kubernetes ÈÝÆ÷·þÎñ£¬ËùÒÔÒ»°ã¼ÆËã²à»á½«·þÎñÆ÷×é³É Kubernetes ¼¯Èº½øÐÐ×ÊÔ´¹ÜÀíºÍµ÷¶È¡£ÔÚÆäÉÏ£¬Ò»°ã»áÒÀÀµ¶ÔÏó´æ´¢¡¢Îļþ´æ´¢»òÕß¿é´æ´¢½øÐÐѵÁ·Ñù±¾ºÍÄ£Ð͵Ĵ洢¡£Ò»°ã¶øÑÔÔÚ¶ÁдѹÁ¦²»Ì«´óµÄ³¡¾°Ï£¬´ó¶àʹÓöÔÏó´æ´¢¡£Ïà±ÈÓÚÆäËû·½Ê½£¬¶ÔÏó´æ´¢Ö§³Ö·Ö²ãѹËõ¹éµµ£¬ÐԼ۱ȸߡ£ÔÚ¶ÁдѹÁ¦±È½Ï´óµÄ³¡¾°£¬Îļþ´æ´¢ºÍ¿é´æ´¢Óиü¶àµÄÂ䵨¡£
ΪÁËÄܹ»¾¡¿ÉÄÜÌá¸ßÊý¾ÝµÄÍÌÍ£¬ÓÐʱ»áÀûÓÃһЩ¼ÆËã²àµÄ»º´æ½øÐмÓËÙ¡£ÆäÖеÄÑ¡ÐͰüÀ¨ Alluxio ºÍÌÚÑ¶ÔÆ¶ÔÏó´æ´¢»º´æ¼ÓËÙ²úÆ·
GooseFS µÈ¡£Í¨¹ý°ÑÔ¶¶ËµÄÊý¾Ý»º´æÔÚ¼ÆËã²à¼¯ÈºÖУ¬±ÜÃâÁËÔ¶¶ËÀÈ¡Êý¾ÝµÄ¿ªÏú£¬ÔÚijЩ³¡¾°ÏÂÄܹ»ÏÔÖøµØÌá¸ßѵÁ·ËÙ¶È¡£
¹¹½¨ÔÚ·þÎñÆ÷ºÍ´æ´¢Ö®ÉϵÄÊÇ·Ö²¼Ê½ÑµÁ·µÄ»ù´¡ÉèÊ©¡£Ä¿Ç° Kubeflow ±»Ó¦ÓõØ×îΪ¹ã·º¡£Í¨¹ý Kubeflow£¬Óû§¿ÉÒÔÇáËɵش´½¨³ö
TensorFlow¡¢PyTorch¡¢Horovod µÈ¿ò¼ÜµÄ·Ö²¼Ê½ÑµÁ·ÈÎÎñ¡£²¢ÇÒ Kubeflow
¿ÉÒԺܺõØÓë Kubernetes µÄ¸÷ÖÖÌØÐÔÐͬ¹¤×÷£¬Äܹ»Ö§³Ö Volcano µÈµ÷¶ÈÆ÷¡£
¾¡¹Ü Kubeflow ÒѾÄܹ»Ö§³ÖÓû§½øÐÐÄ£Ð͵ÄѵÁ·ºÍÆÀ¹À£¬µ«ÊÇÖ±½ÓʹÓà Kubeflow ÈÔÈ»¾ßÓÐһЩÎÊÌâ¡£²»Í¬µÄÊý¾ÝÒÀÀµ¿ÉÄÜÔÚ²»Í¬µÄÊý¾ÝϵͳÖУ¬Òò´ËÊý¾Ý´¦ÀíµÄÂß¼¿ÉÄܷdz£¸´ÔÓ¡£ÎªÁ˼ò»¯Ëã·¨¹¤³ÌʦµÄʹÓÃÁ÷³Ì£¬Ìá¸ßÓû§ÌåÑ飬һ°ãÔÚÉϲã»á¹¹½¨Ò»¸öÁ÷Ë®Ïßϵͳ£¬ÓÃÀ´½«»úÆ÷ѧϰÁ÷³ÌÖеĸ÷¸ö»·½Ú½øÐÐ×éºÏÁ¬½Ó¡£Í¬Ê±»áÌṩ·½±ãµÄ¿É±à³Ì»·¾³£¬°ïÖúËã·¨¹¤³Ìʦ¸ü¿ìµØÊµÏÖÒµÎñ¡£ÔÚÕâÒ»»·½ÚÖУ¬Ò»°ãÀ´Ëµ¿ÉÑ¡µÄϵͳ°üÀ¨
Jupyter¡¢Argo Workflow¡¢Airflow¡¢Kubeflow µÈ¡£´ÓÓû§µÄ½Ç¶È¿´£¬Ëã·¨¹¤³ÌʦֻÐèÒª¹ØÐÄ×îÉϲãµÄʵÑé»·¾³ºÍÁ÷Ë®Ïßϵͳ¡£¶øÆäϵĸ÷²ã
Infra ÔòÓÉ»ù´¡ÉèÊ©ÍŶӺ͹«ÓÐÔÆÌṩ¡£ÕâÑùµÄ·Ö²ãÄܹ»½µµÍ²»Í¬½ÇÉ«µÄ¹¤³ÌʦµÄÐÄÖǸºµ££¬Ìá¸ßЧÂÊ¡£

½ÓÏÂÀ´£¬ÎÒÃǾÍÒÔ·Ö²¼Ê½ÑµÁ·ÎªÀý£¬½éÉÜÑ¡ÐÍÖпÉÄÜÓöµ½µÄÎÊÌ⣬ÒÔ¼°½â¾ö°ì·¨¡£ÔÚ·Ö²¼Ê½ÑµÁ·ÖУ¬°´ÕÕ²ÎÊý¸üеķ½Ê½²»Í¬£¬¿ÉÒÔ·ÖΪ
Parameter Server£¨ÒÔϼò³ÆÎª PS£©Worker µÄģʽºÍ AllReduce µÄģʽ¡£ÔÚ
PS ģʽÏ£¬Ò»¹²ÓÐÁ½¸ö½ÇÉ«²ÎÓëѵÁ·£¬·Ö±ðÊÇ PS ºÍ Worker¡£ÆäÖÐ Worker ¸ºÔðÖ÷ÒªµÄ¼ÆË㣬¼ÆËãºÃµÄÌݶȻᷢË͸ø¶ÔÓ¦µÄ
PS£¬PS ¸üжÔÓ¦µÄ²ÎÊý£¬Ëæºó·¢»Ø¸ø Worker¡£ÔÚ AllReduce ģʽÖУ¬Ã¿¸ö Worker
ÖÐÓÐÈ«Á¿µÄÄ£ÐÍ£¬²»Í¬ Worker ½ÓÊܲ»Í¬µÄÊý¾Ý£¬Ï໥֮¼ä´«µÝÌݶȣ¬½øÐÐÌݶȵĸüÐÂÓëͬ²½¡£

ÎÞÂÛÉÏÊöµÄÄÄÖÖѵÁ··½Ê½£¬¶¼´æÔÚһЩÎÊÌâ¡£Ê×ÏÈÊÇÔÚÄ£ÐͲÎÊý½Ï¶àµÄÇé¿öÏ£¬ÌݶȻò²ÎÊýͨÐÅʱµÄÍøÂç´ø¿íÐèÇóºÜ¸ß£¬ÍøÂç»á³ÉΪѵÁ·¹ý³ÌÖÐµÄÆ¿¾±¡£ÕâÒ»ÎÊÌâÔÚ³íÃÜÀàÄ£Ð͵ÄѵÁ·ÖÐÓÈΪÃ÷ÏÔ¡£Æä´Î£¬ÔÚÒ»¸öÔËÐÐÉî¶ÈѧϰÈÎÎñµÄ¼¯ÈºÉÏ£¬ÍùÍùÔËÐÐ×Ŷà¸öÉî¶ÈѧϰÈÎÎñ¡£²»Í¬µÄÈÎÎñ¶¼ÐèÒª·ÃÎÊ´æ´¢£¬Õâʱ´æ´¢´ø¿íÒ²¿ÉÄܳÉΪƿ¾±¡£×ܽáÆðÀ´£¬ÔÚÍøÂçºÍ´æ´¢ÉÏ£¬¶¼ÓпÉÄÜÓöµ½´ø¿í²»×ãµÄÎÊÌâ¡£

ÔÚ¹«ÓÐÔÆÉÏ£¬Í¨³£ÔÆ·þÎñÆ÷²»Ìṩ RDMA Íø¿¨£¬ÄÚÍø´ø¿íͨ³£ÔÚ 20-50Gbps ×óÓÒ¡£ÔÚÕâÑùµÄ»·¾³Ï£¬ÎªÁËÄܹ»½µµÍÌݶÈͬ²½´øÀ´µÄ´ø¿íѹÁ¦£¬Ò»°ã»áÐèÒª½øÐÐÌݶÈѹËõµÈÓÅ»¯¡£ÌݶÈѹËõ¿ÉÒÔ½µµÍµ¥´Îͬ²½µÄÌݶȴóС£¬Óë´Ëͬʱ£¬Ò²¿ÉÒÔÌæ»»
AllReduce µÄʵÏÖ£¬Ñ¡Ôñ¶ÔµÍ´ø¿í»·¾³¸üΪÓѺõÄʵÏÖ£¬Èç 2DReduce µÈ¡£ÕâЩ¹¤×÷ÔÚÌÚÑ¶ÔÆµÄ
Ti-Horovod Öж¼ÓжÔӦʵÏÖ¡£ËüÔڵʹø¿íµÄÇé¿öÏ»áÓбÈÔÉúµÄ Horovod ¸üºÃµÄ±íÏÖ¡£

¶øÈç¹ûÔÚÂã½ðÊôµÈ·þÎñÆ÷ÉϽøÐÐѵÁ·£¬Ôò¿ÉÒÔÀûÓà RDMA Íø¿¨½øÐÐÌݶȵļÓËÙ¡£ÔÚÕâÑùµÄѵÁ·»·¾³ÖУ¬´æÔÚÒ»ÕÅ
VPC Íø¿¨£¬ÓÃÓÚÓë¶ÔÏó´æ´¢µÈÔÆ²úÆ·½»»¥£»Ò»ÕÅ RoCE Íø¿¨ÒÔ¼°Ò»¸öÏÔ¿¨¡£Òò´ËÐèÒª½øÐÐÒ»¶¨µÄ¸ÄÔ죬À´Ö§³Öͨ¹ý
VPC Íø¿¨½øÐÐѵÁ·Ñù±¾µÄÀÈ¡£¬¶øÌݶÈͬ²½¸üÐÂÔòͨ¹ý RDMA Íø¿¨½øÐС£

¶øÕâÑùµÄ·½Ê½£¬»áÓбȽϸߵĸÅÂÊÓöµ½Ö®Ç°Ëù˵µÄ´æ´¢´ø¿íµÄÎÊÌâ¡£ÌݶȵÄͬ²½Í¨¹ý¸ß´ø¿íµÄ RDMA ½øÐÐÁ˼ÓËÙ£¬Ïà¶ÔÓ¦µØ´æ´¢ÉϾ͸üÓпÉÄܳÉΪƿ¾±¡£ÎªÁ˽â¾öÕâÒ»ÎÊÌ⣬ÔÚ¹«ÓÐÔÆÉÏ¿ÉÒÔÀûÓüÆËã²àµÄ»º´æ²úÆ·£¬ÈçÌÚÑ¶ÔÆµÄ
GooseFS£¬»òÕß¿ªÔ´µÄ Allxuio µÈ·½°¸£¬½«Êý¾Ý»º´æÔÚ¼¯ÈºÄÚ£¬±ÜÃâÔÚѵÁ·Ê±ÔÚÏßÀÈ¡¶ÔÏó´æ´¢ÖеÄÊý¾Ý£¬±ÜÃâ´æ´¢´øÀ´µÄÆ¿¾±ÎÊÌâ¡£

ÔÚÍÆÀí³¡¾°Ï£¬¼Ü¹¹Ïà¶Ô¸üΪ¼òµ¥¡£×îµ×²ãÒÀÈ»ÊÇÔÆ·þÎñÆ÷×é³ÉµÄ Kubernetes ¼¯Èº£¬Ä£ÐÍÒ»°ã¶øÑÔ»á´æ´¢ÔÚ¶ÔÏó´æ´¢ÖУ¬Ä£ÐÍ·þÎñÔò»áͨ¹ý
TFServing¡¢Triton Inference Server »òÕß×ÔÑзþÎñ¿ò¼ÜµÄ·½Ê½¶ÔÍâÌṩ·þÎñ¡£

ÓÉÓÚ²¿·ÖÒµÎñµÄ¶Ëµ½¶ËÁ÷³ÌÏà¶Ô¸´ÔÓ£¬Óз±¸´µÄǰ´¦ÀíºÍºó´¦Àí»·½Ú¡£Èç¹ûʹÓà TFServing »òÕß
Triton Inference ServerÀ´ÊµÏÖ£¬Âß¼»áÓÈΪ¸´ÔÓ¡£Óë´Ëͬʱ£¬Ä£ÐÍ·þÎñ»áÓëÄÚ²¿µÄ»ù´¡ÉèÊ©ÓÐñîºÏ£¬ÐèÒª¶Ô½ÓÄÚ²¿µÄÍø¹ØµÈ·þÎñ¡£Òò´Ë×ÔÑзþÎñ¿ò¼ÜµÄÐèÇóÒ²Ïà¶ÔÍúÊ¢¡£¾¡¹Ü
TFServing ºÍ Triton Inference Server ÔÚ¿ªÔ´ÁìÓò¹ãÊܹØ×¢£¬µ«ÊÇĿǰÈÔÓÐÏ൱¹æÄ£µÄÒµÎñʹÓÃ×ÔÑзþÎñ¿ò¼Ü¡£
δÀ´Õ¹Íû
AI ÒµÎñÔÚÉϹ«ÓÐÔÆµÄ¹ý³ÌÖУ¬Óи÷ÖÖ¸÷ÑùµÄÎÊÌâ¡£ÔÚͨÐÅ¡¢´æ´¢²àµÄ´ø¿íÆ¿¾±×Ô²»±ØËµ¡£³ý´ËÖ®Í⣬Éî¶ÈѧϰÍùÍùÒÀÀµ
Nvidia µÄÖî¶àµ×²ã¿â£¬ÒÔ¼° Python µÄ¸÷ÀàÒÀÀµ¡£ÔÚ¼¯³É»·¾³ÖУ¬Jupyter Õ¼ÓÃµÄ GPU
ÏÔ´æÒÔ¼°¼ÆËãµÄÀûÓÃÂʹýµÍµÈ¡£
 »ù´¡¼Ü¹¹µÄÑݽøÒ²Ò»¶¨»á³¯×Žâ¾öÕâЩÎÊÌâµÄ·½Ïòǰ½ø¡£ÎÒÃÇÈÏΪ£¬Î´À´µÄ AI »ù´¡Éèʩһ¶¨ÊÇÈ«µ¯ÐԵġ£ÔÚѵÁ·³¡¾°Ï£¬Ô±¾µÄѵÁ··½Ê½ÐèÒª½«²ÎÓëѵÁ·µÄ¸÷¸ö½ÇÉ«µÄÅäÖù̶¨ÏÂÀ´¡£±ÈÈçÓÉ
5 ¸ö Worker ²ÎÓëµÄ·Ö²¼Ê½ÑµÁ·ÈÎÎñ£¬ÔÚѵÁ·¹ý³ÌÖÐÐèÒª±£Ö¤ÓÐÇÒ½öÓÐ 5 ¸ö Worker ²ÎÓë¡£ÕâʹµÃ×ÊÔ´µÄÅäÖÃÖ»Äܾ²Ì¬µØÖ¸¶¨£¬ÔÚ¼¯Èº×ÊÔ´Çé¿ö·¢Éú±ä»¯Ê±ÎÞ·¨¶¯Ì¬µØµ÷Õû²ÎÓëѵÁ·µÄ
Worker ÊýÁ¿¡£
Ŀǰ£¬ÄÜ¿´µ½ÓÐÔ½À´Ô½¶àµÄÉî¶Èѧϰ¿ò¼ÜÕýÔÚÖ§³Öµ¯ÐÔѵÁ·¡£ÒÔ Horovod ΪÀý£¬ËüÒýÈëÁË Driver
µÄ¸ÅÄ¹ÜÀí Worker µÄÉúÃüÖÜÆÚ¡£µ±ÓÐÈκÎÒ»¸ö Worker ³öÏÖÎÊÌâʱ£¬Driver »á²¶»ñµ½Òì³£²¢ÇÒ¸ù¾ÝÅäÖÃÖØÐ½¨Á¢»·£¬ÈÃѵÁ·¼ÌÐøÏÂÈ¥¡£ÔÚÕâÒ»¹ý³ÌÖУ¬ÑµÁ·²»»áÖжϡ£ÕâʹµÃѵÁ·ÈÎÎñ¿ÉÒÔÔÚ¼¯Èº¸ºÔصͣ¬ÓпÕÏÐ
GPU µÄʱºòÀ©ÈÝ£¬ÔÚ¼¯Èº¸ºÔظߵÄʱºòËõÈÝ¡£ÕâÑùµÄ¼Ü¹¹Äܹ»½áºÏ¹«ÓÐÔÆµÄµ¯ÐÔʵÀýµÈÄÜÁ¦£¬ÔÚÌá¸ßÈÝ´íÐÔµÄͬʱ£¬½µµÍѵÁ·µÄ³É±¾¡£

ÓëÖ®ÏàËÆµÄ£¬»¹Óе¯Ð﵀ Jupyter ÄÜÁ¦¡£ÔÚ Jupyter Ô±¾µÄʵÏÖÖУ¬Ã¿¸ö Kernel
¶¼ÊÇÓë Notebook ÔËÐÐÔÚÒ»ÆðµÄ£¬ÕâÒ²¾ÍÒâζ×ÅËüÐèÒª³¤ÆÚÕ¼ÓÐÒ»ÕÅÍêÕûµÄ GPU ¿¨£¬ÕâͬÑùʹµÃ
GPU µÄÀûÓÃÂʵò»µ½ÌáÉý¡£Jupyter ÔÚ¿¨µÄʹÓÃÉÏÈç¹ûÄܹ»×öµ½°´ÐèÉêÇëʹÓã¬Ò²Ò»¶¨»á½øÒ»²½µØÌá¸ß¼¯ÈºµÄ×ÊÔ´ÀûÓÃÂÊ£¬½µ±¾ÔöЧ¡£
×ܽá

×îºó£¬ÎÒÃÇ×ܽ᱾´Î·ÖÏíµÄÖ÷Òª¹Ûµã¡£Ä¿Ç°¹«ÓÐÔÆµÄÄÚÍø´ø¿íÈÔÈ»ÊÇÖÆÔ¼ AI
ÒµÎñÉÏÔÆµÄÒ»¸öÖ÷ÒªÎÊÌâ¡£ÎÒÃÇÕë¶Ô²»Í¬µÄ³¡¾°Óв»Í¬µÄ·½·¨¿ÉÒÔ»º½âËü£¬Ò²ÓаüÀ¨Âã½ðÊôÔÚÄÚµÄ RDMA ·½°¸¿É¹©Ñ¡Ôñ¡£ÏàÐÅÔÚδÀ´Ëæ×Ź«ÓÐÔÆÍøÂç´ø¿íµÄÖð²½ÌáÉý£¬Õ⽫²»ÔÙ³ÉΪÎÊÌâ¡£
Æä´Î£¬¹¤Òµ½çĿǰÈÔȻȱ·¦ AI »ù´¡ÉèÊ©µÄÊÂʵ±ê×¼¡£Ä¿Ç°Óзdz£¶àµÄ¿ªÔ´ AI »ù´¡ÉèÊ©ÏîÄ¿£¬ÆäÖÐ Kubeflow
ÊÇÂ䵨×î¶àµÄ£¬Æ¾½è×ÅÓë Kubernetes µÄÉî¶È¼¯³É£¬Ó빫˾ÄÚ²¿ÏÖÓеĻù´¡ÉèÊ©Äܹ»¸üºÃµØÐͬ¹¤×÷£¬ÓÐÒ»¶¨µÄÓÅÊÆ¡£²»¹ýÕûÌå¶øÑÔ£¬Ä¿Ç°ÕâÒ»ÁìÓòÈÔȻȱ·¦ÊÂʵ±ê×¼¡£¸÷¸öϵͳ֮¼äµÄ²îÒì·Ç³£´ó¡£ÕâÒ²ÊÇĿǰÕâÒ»ÁìÓò×î´óµÄÎÊÌâÖ®Ò»£¬¸÷¸ö¹«Ë¾µÄ
AI »ù´¡ÉèÊ©¶¼¸÷ÓÐÌØÉ«£¬ÄÑÒÔÏñ¼¯Èºµ÷¶ÈÁìÓò Kubernetes Ò»Ñù£¬ÔÚÉçÇøÐγɺÏÁ¦£¬¹²Í¬Íƶ¯ÐÐÒµ½ø²½¡£
×îºó£¬È«µ¯ÐԵļܹ¹ÊÇÎÒÃÇÈÏΪµÄÏÂÒ»²½Ñݽø·½Ïò¡£Ä¿Ç°ÔÚ AI ÒµÎñÖл¹²»ÄܺܺõØÀûÓõ¯ÐÔÄÜÁ¦£¬¶øÕâÊÇÔÆ¼ÆËã´ø¸øÎÒÃÇ×î´óµÄºìÀû¡£Ö»ÓÐÒÀÍÐÕæÕýµÄµ¯ÐԼܹ¹£¬Ó¦ÓòÅÄÜÉúÓÚÔÆÉÏ£¬³¤ÔÚÔÆÉÏ£¬·þÎñÓÚÆóÒµ½µ±¾ÔöЧµÄÖÕ¼«Ä¿±ê¡£
|