±à¼ÍƼö: |
ÎÄÕ½éÉÜÁËTransformerµÄÍøÂç½á¹¹£¬Ä£ÐÍÖи÷ÖÖÕÅÁ¿/ÏòÁ¿£¬ÈçºÎÓÃÏòÁ¿µÄ·½Ê½À´¼ÆËãself
attention²¢ËüÊÇÈçºÎʹÓþØÕóÀ´ÊµÏֵĵÈÏà¹ØÄÚÈÝ¡£
±¾ÎÄÀ´×ÔÓÚcsdn£¬ÓÉ»ðÁú¹ûÈí¼þLuca±à¼¡¢ÍƼö¡£ |
|
ǰÑÔ
TransformerÔÚGooleµÄһƪÂÛÎÄAttention is All You Need±»Ìá³ö£¬ÎªÁË·½±ãʵÏÖµ÷ÓÃTransformer
Google»¹¿ªÔ´ÁËÒ»¸öµÚÈý·½¿â£¬»ùÓÚTensorFlowµÄTensor2Tensor£¬Ò»¸öNLPµÄÉçÇøÑо¿Õß¹±Ï×ÁËÒ»¸öTorch°æ±¾µÄÖ§³Ö£ºguide
annotating the paper with PyTorch implementation¡£ÕâÀÎÒÏëÓÃһЩ·½±ãÀí½âµÄ·½Ê½À´Ò»²½Ò»²½½âÊÍTransformerµÄѵÁ·¹ý³Ì£¬ÕâÑù¼´±ãÄãûÓкÜÉîµÄÉî¶Èѧϰ֪ʶÄãÒ²ÄÜ´ó¸ÅÃ÷°×ÆäÖеÄÔÀí¡£
A High-Level Look
ÎÒÃÇÏȰÑTransformerÏëÏó³ÉÒ»¸öºÚÏ»×Ó£¬ÔÚ»úÆ÷·ÒëµÄÁìÓòÖУ¬Õâ¸öºÚÏ»×ӵŦÄܾÍÊÇÊäÈëÒ»ÖÖÓïÑÔÈ»ºó½«Ëü·Òë³ÉÆäËûÓïÑÔ¡£ÈçÏÂͼ£º

ÏÆÆðThe TransformerµÄ¸ÇÍ·£¬ÎÒÃÇ¿´µ½ÔÚÕâ¸öºÚÏ»×ÓÓÉ2¸ö²¿·Ö×é³É£¬Ò»¸öEncodersºÍÒ»¸öDecoders¡£

ÎÒÃÇÔÙ¶ÔÕâ¸öºÚÏ»×Ó½øÒ»²½µÄÆÊÎö£¬·¢ÏÖÿ¸öEncodersÖзֱðÓÉ6¸öEncoder×é³É£¨ÂÛÎÄÖÐÊÇÕâÑùÅäÖõģ©¡£¶øÃ¿¸öDecodersÖÐͬÑùÒ²ÊÇÓÉ6¸öDecoder×é³É¡£

¶ÔÓÚEncodersÖеÄÿһ¸öEncoder£¬ËûÃǽṹ¶¼ÊÇÏàͬµÄ£¬µ«ÊDz¢²»»á¹²ÏíȨֵ¡£Ã¿²ãEncoderÓÐ2¸ö²¿·Ö×é³É£¬ÈçÏÂͼ£º

ÿ¸öEncoderµÄÊäÈëÊ×ÏÈ»áͨ¹ýÒ»¸öself-attention²ã£¬Í¨¹ýself-attention²ã°ïÖúEndcoderÔÚ±àÂëµ¥´ÊµÄ¹ý³ÌÖв鿴ÊäÈëÐòÁÐÖÐµÄÆäËûµ¥´Ê¡£Èç¹ûÄã²»Çå³þÕâÀïÔÚ˵ʲô£¬²»ÓÃ׿±£¬Ö®ºóÎÒÃÇ»áÏêϸ½éÉÜself-attentionµÄ¡£
Self-attentionµÄÊä³ö»á±»´«ÈëÒ»¸öÈ«Á¬½ÓµÄǰÀ¡Éñ¾ÍøÂ磬ÿ¸öencoderµÄǰÀ¡Éñ¾ÍøÂç²ÎÊý¸öÊý¶¼ÊÇÏàͬµÄ£¬µ«ÊÇËûÃǵÄ×÷ÓÃÊǶÀÁ¢µÄ¡£
ÿ¸öDecoderҲͬÑù¾ßÓÐÕâÑùµÄ²ã¼¶½á¹¹£¬µ«ÊÇÔÚÕâÖ®¼äÓÐÒ»¸öAttention²ã£¬°ïÖúDecoderרעÓÚÓëÊäÈë¾ä×ÓÖжÔÓ¦µÄÄǸöµ¥´Ê£¨ÀàËÆÓëseq2seq
modelsµÄ½á¹¹£©

Bringing The Tensors Into The Picture
ÔÚÉÏÒ»½Ú£¬ÎÒÃǽéÉÜÁËTransformerµÄÍøÂç½á¹¹¡£ÏÖÔÚÎÒÃÇÒÔͼʾµÄ·½Ê½À´Ñо¿TransformerÄ£ÐÍÖи÷ÖÖÕÅÁ¿/ÏòÁ¿£¬¹Û²ì´ÓÊäÈëµ½Êä³öµÄ¹ý³ÌÖÐÕâЩÊý¾ÝÔÚ¸÷¸öÍøÂç½á¹¹ÖеÄÁ÷¶¯¡£
Ê×ÏÈ»¹ÊÇNLPµÄ³£¹æ×ö·¨£¬ÏÈ×öÒ»¸ö´ÊǶÈ룺ʲôÊÇÎı¾µÄ´ÊǶÈ룿

ÎÒÃǽ«Ã¿¸öµ¥´Ê±àÂëΪһ¸ö512ά¶ÈµÄÏòÁ¿£¬ÎÒÃÇÓÃÉÏÃæÕâÕżò¶ÌµÄͼÐÎÀ´±íʾÕâЩÏòÁ¿¡£´ÊǶÈëµÄ¹ý³ÌÖ»·¢ÉúÔÚ×îµ×²ãµÄEncoder¡£µ«ÊǶÔÓÚËùÓеÄEncoderÀ´Ëµ£¬Äã¶¼¿ÉÒÔ°´ÏÂͼÀ´Àí½â¡£ÊäÈ루һ¸öÏòÁ¿µÄÁÐ±í£¬Ã¿¸öÏòÁ¿µÄά¶ÈΪ512ά£¬ÔÚ×îµ×²ãEncoder×÷ÓÃÊÇ´ÊǶÈ룬ÆäËû²ã¾ÍÊÇÆäǰһ²ãµÄoutput£©¡£ÁíÍâÕâ¸öÁбíµÄ´óСºÍ´ÊÏòÁ¿Î¬¶ÈµÄ´óС¶¼ÊÇ¿ÉÒÔÉèÖõij¬²ÎÊý¡£Ò»°ãÇé¿öÏ£¬ËüÊÇÎÒÃÇѵÁ·Êý¾Ý¼¯ÖÐ×µÄ¾ä×ӵij¤¶È¡£

ÉÏͼÆäʵ½éÉܵ½ÁËÒ»¸öTransformerµÄ¹Ø¼üµã¡£Äã×¢Òâ¹Û²ì£¬ÔÚÿ¸öµ¥´Ê½øÈëSelf-Attention²ãºó¶¼»áÓÐÒ»¸ö¶ÔÓ¦µÄÊä³ö¡£Self-Attention²ãÖеÄÊäÈëºÍÊä³öÊÇ´æÔÚÒÀÀµ¹ØÏµµÄ£¬¶øÇ°À¡²ãÔòûÓÐÒÀÀµ£¬ËùÒÔÔÚǰÀ¡²ã£¬ÎÒÃÇ¿ÉÒÔÓõ½²¢Ðл¯À´ÌáÉýËÙÂÊ¡£
ÏÂÃæÎÒÓÃÒ»¸ö¼ò¶ÌµÄ¾ä×Ó×÷ΪÀý×Ó£¬À´Ò»²½Ò»²½ÍƵ¼transformerÿ¸ö×Ó²ãµÄÊý¾ÝÁ÷¶¯¹ý³Ì¡£
Now We¡¯re Encoding!
ÕýÈç֮ǰËù˵£¬TransformerÖеÄÿ¸öEncoder½ÓÊÕÒ»¸ö512ά¶ÈµÄÏòÁ¿µÄÁбí×÷ΪÊäÈ룬Ȼºó½«ÕâЩÏòÁ¿´«µÝµ½¡®self-attention¡¯²ã£¬self-attention²ã²úÉúÒ»¸öµÈÁ¿512άÏòÁ¿ÁÐ±í£¬È»ºó½øÈëǰÀ¡Éñ¾ÍøÂ磬ǰÀ¡Éñ¾ÍøÂçµÄÊä³öҲΪһ¸ö512ά¶ÈµÄÁÐ±í£¬È»ºó½«Êä³öÏòÉÏ´«µÝµ½ÏÂÒ»¸öencoder¡£

ÈçÉÏͼËùʾ£¬Ã¿¸öλÖõĵ¥´ÊÊ×ÏȻᾹýÒ»¸öself attention²ã£¬È»ºóÿ¸öµ¥´Ê¶¼Í¨¹ýÒ»¸ö¶ÀÁ¢µÄǰÀ¡Éñ¾ÍøÂ磨ÕâЩÉñ¾ÍøÂç½á¹¹ÍêÈ«Ïàͬ£©¡£
Self-Attention at a High Level
Self attentionÕâ¸öµ¥´Ê¿´ÆðÀ´ºÃÏñÿ¸öÈ˶¼ÖªµÀÊÇʲôÒâ˼£¬µ«ÊµÖÊÉÏËûÊÇËã·¨ÁìÓòÖÐгöµÄ¸ÅÄÄã¿ÉÒÔͨ¹ýÔĶÁ£ºAttention
is All You Need À´Àí½âself attentionµÄÔÀí¡£
¼ÙÉèÏÂÃæµÄ¾ä×Ó¾ÍÊÇÎÒÃÇÐèÒª·ÒëµÄÊäÈë¾ä£º
¡±The animal didn't cross the street because it was
too tired¡±
Õâ¾ä»°ÖеÄ"it"Ö¸µÄÊÇʲô£¿ËüÖ¸µÄÊÇ¡°animal¡±»¹ÊÇ¡°street¡±£¿¶ÔÓÚÈËÀ´Ëµ£¬ÕâÆäʵÊÇÒ»¸öºÜ¼òµ¥µÄÎÊÌ⣬µ«ÊǶÔÓÚÒ»¸öËã·¨À´Ëµ£¬´¦ÀíÕâ¸öÎÊÌâÆäʵ²¢²»ÈÝÒס£self
attentionµÄ³öÏÖ¾ÍÊÇΪÁ˽â¾öÕâ¸öÎÊÌ⣬ͨ¹ýself attention£¬ÎÒÃÇÄܽ«¡°it¡±Óë¡°animal¡±ÁªÏµÆðÀ´¡£
µ±Ä£ÐÍ´¦Àíµ¥´ÊµÄʱºò£¬self attention²ã¿ÉÒÔͨ¹ýµ±Ç°µ¥´ÊÈ¥²é¿´ÆäÊäÈëÐòÁÐÖÐµÄÆäËûµ¥´Ê£¬ÒÔ´ËÀ´Ñ°ÕÒ±àÂëÕâ¸öµ¥´Ê¸üºÃµÄÏßË÷¡£
Èç¹ûÄãÊìϤRNNs£¬ÄÇôÄã¿ÉÒÔ»ØÏëһϣ¬RNNÊÇÔõô´¦ÀíÏÈǰµ¥´Ê(ÏòÁ¿£©Ó뵱ǰµ¥´Ê(ÏòÁ¿£©µÄ¹ØÏµµÄ£¿RNNÊÇÔõô¼ÆËãËûµÄhidden
stateµÄ¡£self-attentionÕýÊÇtransformerÖÐÉè¼ÆµÄÒ»ÖÖͨ¹ýÆäÉÏÏÂÎÄÀ´Àí½âµ±Ç°´ÊµÄÒ»ÖÖ°ì·¨¡£Äã»áºÜÈÝÒ×·¢ÏÖ...Ïà½ÏÓÚRNNs£¬transformer¾ßÓиüºÃµÄ²¢ÐÐÐÔ¡£

ÈçÉÏͼ£¬ÊÇÎÒÃǵÚÎå²ãEncoderÕë¶Ôµ¥´Ê'it'µÄͼʾ£¬¿ÉÒÔ·¢ÏÖ£¬ÎÒÃǵÄEncoderÔÚ±àÂëµ¥´Ê¡®it¡¯Ê±£¬²¿·Ö×¢ÒâÁ¦»úÖÆ¼¯ÖÐÔÚÁË¡®animl¡¯ÉÏ£¬Õⲿ·ÖµÄ×¢ÒâÁ¦»áͨ¹ýȨֵ´«µÝµÄ·½Ê½Ó°Ïìµ½'it'µÄ±àÂë¡£
Self-Attention in Detail
ÕâÒ»½ÚÎÒÃÇÏȽéÉÜÈçºÎÓÃÏòÁ¿µÄ·½Ê½À´¼ÆËãself attention£¬È»ºóÔÙÀ´¿´¿´ËüÊÇÈçºÎʹÓþØÕóÀ´ÊµÏֵġ£
¼ÆËãself attentionµÄµÚÒ»²½ÊÇ´Óÿ¸öEncoderµÄÊäÈëÏòÁ¿ÉÏ´´½¨3¸öÏòÁ¿£¨ÔÚÕâ¸öÇé¿öÏ£¬¶Ôÿ¸öµ¥´Ê×ö´ÊǶÈ룩¡£ËùÒÔ£¬¶ÔÓÚÿ¸öµ¥´Ê£¬ÎÒÃÇ´´½¨Ò»¸öQueryÏòÁ¿£¬Ò»¸öKeyÏòÁ¿ºÍÒ»¸öValueÏòÁ¿¡£ÕâЩÏòÁ¿ÊÇͨ¹ý´ÊǶÈë³ËÒÔÎÒÃÇѵÁ·¹ý³ÌÖд´½¨µÄ3¸öѵÁ·¾ØÕó¶ø²úÉúµÄ¡£
×¢ÒâÕâЩÐÂÏòÁ¿µÄά¶È±ÈǶÈëÏòÁ¿Ð¡¡£ÎÒÃÇÖªµÀǶÈëÏòÁ¿µÄά¶ÈΪ512£¬¶øÕâÀïµÄÐÂÏòÁ¿µÄά¶ÈÖ»ÓÐ64ά¡£ÐÂÏòÁ¿²¢²»ÊDZØÐëСһЩ£¬ÕâÊÇÍøÂç¼Ü¹¹ÉϵÄÑ¡ÔñʹµÃMulti-Headed
Attention£¨´ó²¿·Ö£©µÄ¼ÆËã²»±ä¡£

ÎÒÃǽ«X1³ËÒÔWQµÄÈ¨ÖØ¾ØÕóµÃµ½ÐÂÏòÁ¿q1£¬q1¼ÈÊÇ¡°query¡±µÄÏòÁ¿¡£Í¬Àí£¬×îÖÕÎÒÃÇ¿ÉÒÔ¶ÔÊäÈë¾ä×ÓµÄÿ¸öµ¥´Ê´´½¨¡°query¡±£¬
¡°key¡±£¬¡°value¡±µÄÐÂÏòÁ¿±íʾÐÎʽ¡£
¶ÔÁË..¡°query¡±£¬¡°key¡±£¬¡°value¡±ÊÇʲôÏòÁ¿ÄØ£¿ÓÐʲôÓÃÄØ£¿
ÕâЩÏòÁ¿µÄ¸ÅÄîÊǺܳéÏ󣬵«ÊÇËüȷʵÓÐÖúÓÚ¼ÆËã×¢ÒâÁ¦¡£²»¹ýÏȲ»ÓþÀ½áÈ¥Àí½âËü£¬ºóÃæµÄµÄÄÚÈÝ£¬»á°ïÖúÄãÀí½âµÄ¡£
¼ÆËãself attentionµÄµÚ¶þ²½ÊǼÆËãµÃ·Ö¡£ÒÔÉÏͼΪÀý£¬¼ÙÉèÎÒÃÇÔÚ¼ÆËãµÚÒ»¸öµ¥´Ê¡°thinking¡±µÄself
attention¡£ÎÒÃÇÐèÒª¸ù¾ÝÕâ¸öµ¥´Ê¶ÔÊäÈë¾ä×ÓµÄÿ¸öµ¥´Ê½øÐÐÆÀ·Ö¡£µ±ÎÒÃÇÔÚij¸öλÖñàÂëµ¥´Êʱ£¬·ÖÊý¾ö¶¨Á˶ÔÊäÈë¾ä×ӵįäËûµ¥´ÊµÄ¹ØÕճ̶ȡ£
ͨ¹ý½«queryÏòÁ¿ºÍkeyÏòÁ¿µã»÷À´¶ÔÏàÓ¦µÄµ¥´Ê´ò·Ö¡£ËùÒÔ£¬Èç¹ûÎÒÃÇ´¦Àí¿ªÊ¼Î»ÖõĵÄself
attention£¬ÔòµÚÒ»¸ö·ÖÊýΪºÍµÄµã»ý£¬µÚ¶þ¸ö·ÖÊýΪq2ºÍk2µÄµã»ý¡£ÈçÏÂͼ

µÚÈý²½ºÍµÚËIJ½µÄ¼ÆË㣬Êǽ«µÚ¶þ²¿µÄµÃ·Ö³ýÒÔ8£¨ £©£¨ÂÛÎÄÖÐʹÓÃkeyÏòÁ¿µÄά¶ÈÊÇ64ά£¬Æäƽ·½¸ù=8£¬ÕâÑù¿ÉÒÔʹµÃѵÁ·¹ý³ÌÖоßÓиüÎȶ¨µÄÌݶȡ£Õâ¸ö ²¢²»ÊÇΨһֵ£¬¾ÑéËùµÃ£©¡£È»ºóÔÙ½«µÃµ½µÄÊä³öͨ¹ýsoftmaxº¯Êý±ê×¼»¯£¬Ê¹µÃ×îºóµÄÁбíºÍΪ1¡£

Õâ¸ösoftmaxµÄ·ÖÊý¾ö¶¨Á˵±Ç°µ¥´ÊÔÚÿ¸ö¾ä×ÓÖÐÿ¸öµ¥´ÊλÖõıíʾ³Ì¶È¡£ºÜÃ÷ÏÔ£¬µ±Ç°µ¥´Ê¶ÔÓ¦¾ä×ÓÖд˵¥´ÊËùÔÚλÖõÄsoftmaxµÄ·ÖÊý×î¸ß£¬µ«ÊÇ£¬ÓÐʱºòattention»úÖÆÒ²ÄܹØ×¢µ½´Ëµ¥´ÊÍâµÄÆäËûµ¥´Ê£¬ÕâºÜÓÐÓá£
µÚÎå²½Êǽ«Ã¿¸öValueÏòÁ¿³ËÒÔsoftmaxºóµÄµÃ·Ö¡£ÕâÀïʵ¼ÊÉϵÄÒâÒåÔÚÓÚ±£´æ¶Ôµ±Ç°´ÊµÄ¹Ø×¢¶È²»±äµÄÇé¿öÏ£¬½µµÍ¶Ô²»Ïà¹Ø´ÊµÄ¹Ø×¢¡£
µÚÁù²½ÊÇ ÀÛ¼Ó¼ÓȨֵµÄÏòÁ¿¡£ Õâ»áÔÚ´ËλÖòúÉúself-attention²ãµÄÊä³ö£¨¶ÔÓÚµÚÒ»¸öµ¥´Ê£©¡£

×ܽáself-attentionµÄ¼ÆËã¹ý³Ì£¬£¨µ¥´Ê¼¶±ð£©¾ÍÊǵõ½Ò»¸öÎÒÃÇ¿ÉÒԷŵ½Ç°À¡Éñ¾ÍøÂçµÄʸÁ¿¡£ È»¶øÔÚʵ¼ÊµÄʵÏÖ¹ý³ÌÖУ¬¸Ã¼ÆËã»áÒÔ¾ØÕóµÄÐÎʽÍê³É£¬ÒÔ±ã¸ü¿ìµØ´¦Àí¡£ÏÂÃæÎÒÃÇÀ´¿´¿´Self-AttentionµÄ¾ØÕó¼ÆË㷽ʽ¡£
Matrix Calculation of Self-Attention
µÚÒ»²½ÊÇÈ¥¼ÆËãQuery£¬KeyºÍValue¾ØÕó¡£ÎÒÃǽ«´ÊǶÈëת»¯³É¾ØÕóXÖУ¬²¢½«Æä³ËÒÔÎÒÃÇѵÁ·µÄȨֵ¾ØÕó£¨,,£©

X¾ØÕóÖеÄÿһÐжÔÓ¦ÓÚÊäÈë¾ä×ÓÖеÄÒ»¸öµ¥´Ê¡£ ÎÒÃÇ¿´µ½µÄXÿһÐеķ½¿òÊýʵ¼ÊÉÏÊÇ´ÊǶÈëµÄά¶È£¬Í¼ÖÐËùʾµÄºÍÂÛÎÄÖÐÊÇÓвî¾àµÄ¡£X£¨Í¼ÖеÄ4¸ö·½¿òÂÛÎÄÖÐΪ512¸ö£©ºÍq
/ k / vÏòÁ¿£¨Í¼ÖеÄ3¸ö·½¿òÂÛÎÄÖÐΪ64¸ö£©
×îºó£¬ÓÉÓÚÎÒÃÇÕýÔÚ´¦Àí¾ØÕó£¬ÎÒÃÇ¿ÉÒÔÔÚÒ»¸ö¹«Ê½ÖÐŨËõÇ°Ãæ²½Öè2µ½6À´¼ÆËãself attention²ãµÄÊä³ö¡£

The Beast With Many Heads
±¾ÎÄͨ¹ýʹÓá°Multi-headed¡±µÄ»úÖÆÀ´½øÒ»²½ÍêÉÆself attention²ã¡£¡°Multi-headed¡±Ö÷Ҫͨ¹ýÏÂÃæ2Öз½Ê½¸ÄÉÆÁËattention²ãµÄÐÔÄÜ£º
1. ËüÍØÕ¹ÁËÄ£Ð͹Ø×¢²»Í¬Î»ÖõÄÄÜÁ¦¡£ÔÚÉÏÃæÀý×ÓÖпÉÒÔ¿´³ö£¬¡±The animal didn't cross
the street because it was too tired¡±£¬ÎÒÃǵÄattention»úÖÆ¼ÆËã³ö¡°it¡±Ö¸´úµÄΪ¡°animal¡±£¬ÕâÔÚ¶ÔÓïÑÔµÄÀí½â¹ý³ÌÖÐÊǺÜÓÐÓõġ£
2.ËüΪattention²ãÌṩÁ˶à¸ö¡°representation subspaces¡±¡£ÓÉÏÂͼ¿ÉÒÔ¿´µ½£¬ÔÚself
attentionÖУ¬ÎÒÃÇÓжà¸ö¸öQuery / Key / ValueÈ¨ÖØ¾ØÕó£¨TransformerʹÓÃ8¸öattention
heads£©¡£ÕâЩ¼¯ºÏÖеÄÿ¸ö¾ØÕó¶¼ÊÇËæ»ú³õʼ»¯Éú³ÉµÄ¡£È»ºóͨ¹ýѵÁ·£¬ÓÃÓÚ½«´ÊǶÈ루»òÕßÀ´×ԽϵÍEncoder/DecoderµÄʸÁ¿£©Í¶Ó°µ½²»Í¬µÄ¡°representation
subspaces£¨±íʾ×ӿռ䣩¡±ÖС£

ͨ¹ýmulti-headed attention£¬ÎÒÃÇΪÿ¸ö¡°header¡±¶¼¶ÀÁ¢Î¬»¤Ò»Ì×Q/K/VµÄȨֵ¾ØÕó¡£È»ºóÎÒÃÇ»¹ÊÇÈç֮ǰµ¥´Ê¼¶±ðµÄ¼ÆËã¹ý³ÌÒ»Ñù´¦ÀíÕâЩÊý¾Ý¡£
Èç¹û¶ÔÉÏÃæµÄÀý×Ó×öͬÑùµÄself attention¼ÆË㣬¶øÒòΪÎÒÃÇÓÐ8Í·attention£¬ËùÒÔÎÒÃÇ»áÔڰ˸öʱ¼äµãÈ¥¼ÆËãÕâЩ²»Í¬µÄȨֵ¾ØÕ󣬵«×îºó½áÊøÊ±£¬ÎÒÃÇ»áµÃµ½8¸ö²»Í¬µÄ¾ØÕó¡£ÈçÏÂͼ£º

ÇÆÇÆ£¬Õâ»á¸øÎÒÃǺóÐø¹¤×÷Ôì³ÉʲôÎÊÌ⣿
ÎÒÃÇÖªµÀÔÚself-attentionºóÃæ½ô¸ú×ŵÄÊÇǰÀ¡Éñ¾ÍøÂ磬¶øÇ°À¡Éñ¾ÍøÂç½ÓÊܵÄÊǵ¥¸ö¾ØÕóÏòÁ¿£¬¶ø²»ÊÇ8¸ö¾ØÕó¡£ËùÒÔÎÒÃÇÐèÒªÒ»ÖÖ°ì·¨£¬°ÑÕâ8¸ö¾ØÕóѹËõ³ÉÒ»¸ö¾ØÕó¡£
ÎÒÃÇÔõô×ö£¿
ÎÒÃǽ«Õâ8¸ö¾ØÕóÁ¬½ÓÔÚÒ»ÆðÈ»ºóÔÙÓëÒ»¸ö¾ØÕóÏà³Ë¡£²½ÖèÈçÏÂͼËùʾ£º

ÕâÑùmulti-headed self attentionµÄÈ«²¿ÄÚÈݾͽéÉÜÍêÁË¡£Ö®Ç°¿ÉÄܶ¼ÊÇһЩ¹ý³ÌµÄͼ½â£¬ÏÖÔÚÎÒ½«ÕâЩ¹ý³ÌÁ¬½ÓÔÚÒ»Æð£¬ÓÃÒ»¸öÕûÌåµÄ¿òͼÀ´±íʾһϼÆËãµÄ¹ý³Ì£¬Ï£Íû¿ÉÒÔ¼ÓÉîÀí½â¡£

ÏÖÔÚÎÒÃÇÒѾ´¥¼°ÁËattentionµÄheader£¬ÈÃÎÒÃÇÖØÐÂÉóÊÓÎÒÃÇ֮ǰµÄÀý×Ó£¬¿´¿´Àý¾äÖеġ°it¡±Õâ¸öµ¥´ÊÔÚ²»Í¬µÄattention headerÇé¿öÏ»áÓÐÔõÑù²»Í¬µÄ¹Ø×¢µã¡£

Èçͼ£ºµ±ÎÒÃǶԡ°it¡±Õâ¸ö´Ê½øÐбàÂëʱ£¬Ò»¸ö×¢ÒâÁ¦µÄ½¹µãÖ÷Òª¼¯ÖÐÔÚ¡°animal¡±ÉÏ£¬¶øÁíÒ»¸ö×¢ÒâÁ¦¼¯ÖÐÔÚ¡°tired¡±
µ«ÊÇ£¬Èç¹ûÎÒÃǽ«ËùÓÐ×¢ÒâÁ¦Ìí¼Óµ½Í¼Æ¬ÖУ¬ÄÇôÊÂÇé¿ÉÄܸüÄÑÀí½â£º

Representing The Order of The Sequence Using Positional
Encoding
# ʹÓÃλÖñàÂë±íʾÐòÁеÄ˳Ðò
ÎÒÃÇ¿ÉÄܺöÂÔÁËÈ¥½éÉÜÒ»¸öÖØÒªµÄÄÚÈÝ£¬¾ÍÊÇÔõô¿¼ÂÇÊäÈëÐòÁÐÖе¥´Ê˳ÐòµÄ·½·¨¡£
ΪÁ˽â¾öÕâ¸öÎÊÌ⣬transformerΪÿ¸öÊäÈëµ¥´ÊµÄ´ÊǶÈëÉÏÌí¼ÓÁËÒ»¸öÐÂÏòÁ¿-λÖÃÏòÁ¿¡£
ΪÁ˽â¾öÕâ¸öÎÊÌ⣬±ä»»Æ÷Ϊÿ¸öÊäÈëǶÈëÌí¼ÓÁËÒ»¸öÏòÁ¿¡£ÕâЩλÖñàÂëÏòÁ¿Óй̶¨µÄÉú³É·½Ê½£¬ËùÒÔ»ñÈ¡ËûÃÇÊǺܷ½±ãµÄ£¬µ«ÊÇÕâЩÐÅϢȷÊǺÜÓÐÓõģ¬ËûÃÇÄܲ¶×½´ó°Âÿ¸öµ¥´ÊµÄλÖ㬻òÕßÐòÁÐÖв»Í¬µ¥´ÊÖ®¼äµÄ¾àÀë¡£½«ÕâЩÐÅÏ¢Ò²Ìí¼Óµ½´ÊǶÈëÖУ¬È»ºóÓëQ/K/VÏòÁ¿µã»÷£¬»ñµÃµÄattention¾ÍÓÐÁ˾àÀëµÄÐÅÏ¢ÁË¡£

ΪÁËÈÃÄ£ÐͲ¶×½µ½µ¥´ÊµÄ˳ÐòÐÅÏ¢£¬ÎÒÃÇÌí¼ÓλÖñàÂëÏòÁ¿ÐÅÏ¢£¨POSITIONAL ENCODING£©-λÖñàÂëÏòÁ¿²»ÐèҪѵÁ·£¬ËüÓÐÒ»¸ö¹æÔòµÄ²úÉú·½Ê½¡£
Èç¹ûÎÒÃǵÄǶÈëά¶ÈΪ4£¬ÄÇôʵ¼ÊÉϵÄλÖñàÂë¾ÍÈçÏÂͼËùʾ£º

ÄÇôÉú³ÉλÖÃÏòÁ¿ÐèÒª×ñÑÔõÑùµÄ¹æÔòÄØ£¿
¹Û²ìÏÂÃæµÄͼÐΣ¬Ã¿Ò»Ðж¼´ú±í×ŶÔÒ»¸öʸÁ¿µÄλÖñàÂë¡£Òò´ËµÚÒ»ÐоÍÊÇÎÒÃÇÊäÈëÐòÁÐÖеÚÒ»¸ö×ÖµÄǶÈëÏòÁ¿£¬Ã¿Ðж¼°üº¬512¸öÖµ£¬Ã¿¸öÖµ½éÓÚ1ºÍ-1Ö®¼ä¡£ÎÒÃÇÓÃÑÕÉ«À´±íʾ1£¬-1Ö®¼äµÄÖµ£¬ÕâÑù·½±ã¿ÉÊÓ»¯µÄ·½Ê½±íÏÖ³öÀ´£º

ÕâÊÇÒ»¸ö20¸ö×Ö£¨ÐУ©µÄ£¨512£©ÁÐλÖñàÂëʾÀý¡£Äã»á·¢ÏÖËüÕ¦ÖÐÐÄλÖñ»·ÖΪÁË2°ë£¬ÕâÊÇÒòΪ×ó°ë²¿·ÖµÄÖµÊÇÒ»ÓÉÒ»¸öÕýÏÒº¯ÊýÉú³ÉµÄ£¬¶øÓҰ벿·ÖÊÇÓÉÁíÒ»¸öº¯Êý£¨ÓàÏÒ£©Éú³É¡£È»ºó½«ËüÃÇÁ¬½ÓÆðÀ´ÐγÉÿ¸öλÖñàÂëʸÁ¿¡£
λÖñàÂëµÄ¹«Ê½ÔÚÂÛÎÄ£¨3.5½Ú£©ÖÐÓÐÃèÊö¡£ÄãÒ²¿ÉÒÔÔÚÖв鿴ÓÃÓÚÉú³ÉλÖñàÂëµÄ´úÂëget_timing_signal_1d()¡£Õâ²»ÊÇλÖñàÂëµÄΨһ¿ÉÄÜ·½·¨¡£È»¶ø£¬Ëü¾ßÓÐÄܹ»À©Õ¹µ½¿´²»¼ûµÄÐòÁг¤¶ÈµÄÓŵ㣨ÀýÈ磬Èç¹ûÎÒÃÇѵÁ·µÄÄ£Ðͱ»ÒªÇó·ÒëµÄ¾ä×Ó±ÈÎÒÃÇѵÁ·¼¯ÖеÄÈκξä×Ó¶¼³¤£©¡£
The Residuals
ÕâÒ»½ÚÎÒÏë½éÉܵÄÊÇencoder¹ý³ÌÖеÄÿ¸öself-attention²ãµÄ×óÓÒÁ¬½ÓÇé¿ö£¬ÎÒÃdzÆÕâ¸öΪ£ºlayer-normalization ²½Öè¡£ÈçÏÂͼËùʾ£º

ÔÚ½øÒ»²½Ì½Ë÷ÆäÄÚ²¿¼ÆË㷽ʽ£¬ÎÒÃÇ¿ÉÒÔ½«ÉÏÃæÍ¼²ã¿ÉÊÓ»¯ÎªÏÂͼ£º

DecoderµÄ×Ó²ãÒ²ÊÇͬÑùµÄ£¬Èç¹ûÎÒÃÇÏë×ö¶ÑµþÁË2¸öEncoderºÍ2¸öDecoderµÄTransformer£¬ÄÇôËü¿ÉÊÓ»¯¾Í»áÈçÏÂͼËùʾ£º

The Decoder Side
ÎÒÃÇÒѾ»ù±¾½éÉÜÍêÁËEncoderµÄ´ó¶àÊý¸ÅÄÎÒÃÇ»ù±¾ÉÏÒ²¿ÉÒÔÔ¤ÖªDecoderÊÇÔõô¹¤×÷µÄ¡£ÏÖÔÚÎÒÃÇÀ´×Ðϸ̽ÌÖÏÂDecoderµÄÊý¾Ý¼ÆËãÔÀí£¬
µ±ÐòÁÐÊäÈëʱ£¬Encoder¿ªÊ¼¹¤×÷£¬×îºóÔÚÆä¶¥²ãµÄEncoderÊä³öʸÁ¿×é³ÉµÄÁÐ±í£¬È»ºóÎÒÃǽ«Æäת»¯ÎªÒ»×éattentionµÄ¼¯ºÏ£¨K,V£©¡££¨K,V£©½«´øÈëÿ¸öDecoderµÄ¡°encoder-decoder
attention¡±²ãÖÐÈ¥¼ÆË㣨ÕâÑùÓÐÖúÓÚdecoder²¶»ñÊäÈëÐòÁеÄλÖÃÐÅÏ¢£©

Íê³Éencoder½×¶Îºó£¬ÎÒÃÇ¿ªÊ¼decoder½×¶Î£¬decoder½×¶ÎÖеÄÿ¸ö²½ÖèÊä³öÀ´×ÔÊä³öÐòÁеÄÔªËØ£¨ÔÚÕâÖÖÇé¿öÏÂΪӢÓï·Òë¾ä×Ó£©¡£
ÉÏÃæÊµ¼ÊÉÏÒѾÊÇÓ¦ÓõĽ׶ÎÁË£¬ÄÇÎÒÃÇѵÁ·½×¶ÎÊÇÈçºÎµÄÄØ£¿
ÎÒÃÇÒÔÏÂͼµÄ²½Öè½øÐÐѵÁ·£¬Ö±µ½Êä³öÒ»¸öÌØÊâµÄ·ûºÅ<end of sentence>£¬±íʾÒѾÍê³ÉÁË¡£ The
output of each step is fed to the bottom decoder in
the next time step, and the decoders bubble up their
decoding results just like the encoders did. ¶ÔÓÚDecoder£¬ºÍEncoderÒ»Ñù£¬ÎÒÃÇÔÚÿ¸öDecoderµÄÊäÈë×ö´ÊǶÈë²¢Ìí¼ÓÉϱíʾÿ¸ö×ÖλÖõÄλÖñàÂë

DecoderÖеÄself attentionÓëEncoderµÄself attentionÂÔÓв»Í¬£º
ÔÚDecoderÖУ¬self attentionÖ»¹Ø×¢Êä³öÐòÁÐÖеĽÏÔçµÄλÖá£ÕâÊÇÔÚself
attention¼ÆËãÖеÄsoftmax²½Öè֮ǰÆÁ±ÎÁËÌØÕ÷λÖã¨ÉèÖÃΪ -inf£©À´Íê³ÉµÄ¡£
¡°Encoder-Decoder Attention¡±²ãµÄ¹¤×÷·½Ê½Óë"Multi-Headed
Self-Attention"Ò»Ñù£¬Ö»ÊÇËü´ÓÏÂÃæµÄ²ã´´½¨ÆäQuery¾ØÕ󣬲¢ÔÚEncoder¶ÑÕ»µÄÊä³öÖлñÈ¡KeyºÍValueµÄ¾ØÕó¡£
The Final Linear and Softmax Layer
DecoderµÄÊä³öÊǸ¡µãÊýµÄÏòÁ¿ÁÐ±í¡£ÎÒÃÇÊÇÈçºÎ½«Æä±ä³ÉÒ»¸öµ¥´ÊµÄÄØ£¿Õâ¾ÍÊÇ×îÖÕµÄÏßÐÔ²ãºÍsoftmax²ãËù×öµÄ¹¤×÷¡£
ÏßÐÔ²ãÊÇÒ»¸ö¼òµ¥µÄÈ«Á¬½ÓÉñ¾ÍøÂ磬ËüÊÇÓÉDecoder¶ÑÕ»²úÉúµÄÏòÁ¿Í¶Ó°µ½Ò»¸ö¸ü´ó£¬¸ü´óµÄÏòÁ¿ÖУ¬³ÆÎª¶ÔÊýÏòÁ¿
¼ÙÉèʵÑéÖÐÎÒÃǵÄÄ£ÐÍ´ÓѵÁ·Êý¾Ý¼¯ÉÏ×ܹ²Ñ§Ï°µ½1Íò¸öÓ¢Óïµ¥´Ê£¨¡°Output Vocabulary¡±£©¡£Õâ¶ÔÓ¦µÄLogitsʸÁ¿Ò²ÓÐ1Íò¸ö³¤¶È-ÿһ¶Î±íʾÁËÒ»¸öΨһµ¥´ÊµÄµÃ·Ö¡£ÔÚÏßÐÔ²ãÖ®ºóÊÇÒ»¸ösoftmax²ã£¬softmax½«ÕâЩ·ÖÊýת»»Îª¸ÅÂÊ¡£Ñ¡È¡¸ÅÂÊ×î¸ßµÄË÷Òý£¬È»ºóͨ¹ýÕâ¸öË÷ÒýÕÒµ½¶ÔÓ¦µÄµ¥´Ê×÷ΪÊä³ö¡£

ÉÏͼÊÇ´ÓDecoderµÄÊä³ö¿ªÊ¼µ½×îÖÕsoftmaxµÄÊä³ö¡£Ò»²½Ò»²½µÄͼ½â¡£
Recap Of Training
ÏÖÔÚÎÒÃÇÒѾ½²½âÁËtransformerµÄѵÁ·È«¹ý³ÌÁË£¬ÈÃÎÒÃǻعËһϡ£
ÔÚѵÁ·ÆÚ¼ä£¬Î´¾ÑµÁ·µÄÄ£Ðͽ«Í¨¹ýÈçÉϵÄÁ÷³ÌÒ»²½Ò»²½¼ÆËãµÄ¡£¶øÇÒÒòΪÎÒÃÇÊÇÔÚ¶Ô±ê¼ÇµÄѵÁ·Êý¾Ý¼¯½øÐÐѵÁ·£¨»úÆ÷·Òë¿ÉÒÔ¿´×öË«ÓïÆ½ÐÐÓïÁÄ£©£¬ÄÇôÎÒÃÇ¿ÉÒÔ½«Ä£ÐÍÊä³öÓëʵ¼ÊµÄÕýÈ·´ð°¸Ïà±È½Ï£¬À´½øÐз´Ïò´«²¥¡£
ΪÁ˸üºÃµÄÀí½âÕⲿ·ÖÄÚÈÝ£¬ÎÒÃǼÙÉèÎÒÃÇÊä³öµÄ´Ê»ãÖ»ÓУ¨¡°a¡±£¬¡°am¡±£¬¡°i¡±£¬¡°thanks¡±£¬
¡°student¡±ºÍ¡°<eos>¡±£¨¡°¾äÄ©¡±µÄËõд£©£©

ÔÚÎÒÃÇ¿ªÊ¼ÑµÁ·Ö®Ç°£¬ÎÒÃÇÄ£Ð͵ÄÊä³ö´Ê»ãÊÇÔÚÔ¤´¦Àí½×¶Î´´½¨µÄ¡£
Ò»µ©ÎÒÃǶ¨ÒåÁËÊä³öµÄ´Ê»ã±í£¬ÄÇôÎÒÃǾͿÉÒÔʹÓÃÏàͬ¿í¶ÈµÄÏòÁ¿À´±íʾ´Ê»ã±íÖеÄÿ¸öµ¥´Ê¡£³ÆÎªone-hot±àÂë¡£ÀýÈ磬ÎÒÃÇ¿ÉÒÔʹÓÃÏÂÃæÏòÁ¿À´±íʾµ¥´Ê¡°am¡±£º

ʾÀý£ºÎÒÃǵÄÊä³ö´Ê»ã±íµÄone-hot±àÂë
ÏÂÒ»½ÚÎÒÃÇÔÙÌÖÂÛÒ»ÏÂÄ£Ð͵ÄËðʧº¯Êý£¬ÎÒÃÇÓÅ»¯µÄÖ¸±ê£¬Òýµ¼Ò»¸öѵÁ·ÓÐËØÇÒÁîÈ˾ªÑȵľ«È·Ä£ÐÍ¡£
The Loss Function
¼ÙÉèÎÒÃÇÕýÔÚѵÁ·Ò»¸öÄ£ÐÍ£¬±ÈÈ罫¡°merci¡±·Òë³É¡°Ð»Ð»¡±¡£ÕâÒâζ×ÅÎÒÃÇÏ£ÍûÄ£ÐͼÆËãºóµÄÊä³öΪ¡°Ð»Ð»¡±£¬µ«ÓÉÓÚÕâÖÖģʽ»¹Ã»ÓнÓÊܹýѵÁ·£¬ËùÒÔÕâÖÖÇé¿ö²»Ì«¿ÉÄÜ·¢Éú¡£

ÕâÊÇÒòΪģÐ͵IJÎÊý£¨È¨ÖØ£©¶¼ÊÇËæ»ú³õʼ»¯µÄ£¬Òò´Ë£¨Î´¾ÑµÁ·µÄ£©Ä£ÐͶÔÿ¸öµ¥´Ê²úÉúµÄ¸ÅÂÊ·Ö²¼ÊǾßÓÐÎÞÏÞ¿ÉÄܵ쬵«ÊÇÎÒÃÇ¿ÉÒÔͨ¹ýÆäÓàʵ¼ÊÎÒÃÇÆÚÍûµÄÊä³ö½øÐбȽϣ¬È»ºóÀûÓ÷´Ïò´«²¥µ÷ÕûËùÓÐÄ£Ð͵ÄÈ¨ÖØ£¬Ê¹µÃÊä³ö¸ü½Ó½üËùÐèµÄÊä³ö¡£
ÄÇôÈçºÎ±È½ÏËã·¨Ô¤²âÖµÓëÕæÊµÆÚÍûֵĨ£¿
ʵ¼ÊÉÏ£¬ÎÒÃÇ¶ÔÆä×öÒ»¸ö¼òµ¥µÄ¼õ·¨¼´¿É¡£ÄãÒ²¿ÉÒÔÁ˽⽻²æìغÍKullback-LeiblerÉ¢¶ÈÀ´ÕÆÎÕÕâÖÖ²îÖµµÄÅжϷ½Ê½¡£
µ«ÊÇ£¬ÐèҪעÒâµÄÊÇ£¬ÕâÖ»ÊÇÒ»¸öºÜ¼òµ¥µÄdemo£¬ÕæÊµÇé¿öÏ£¬ÎÒÃÇÐèÒªÊä³öÒ»¸ö¸ü³¤µÄ¾ä×Ó£¬ÀýÈç¡£ÊäÈ룺¡°je
suis ¨¦tudiant¡±ºÍÔ¤ÆÚÊä³ö£º¡°I am a student¡±¡£ÕâÑùµÄ¾ä×ÓÊäÈ룬Òâζ×ÅÎÒÃǵÄÄ£ÐÍÄܹ»Á¬ÐøµÄÊä³ö¸ÅÂÊ·Ö²¼¡£ÆäÖУº
ÿ¸ö¸ÅÂÊ·Ö²¼ÓÉ¿í¶ÈΪvocab_sizeµÄÏòÁ¿±íʾ£¨ÔÚÎÒÃǵÄʾÀýÖÐvocab_sizeΪ6£¬µ«Êµ¼ÊÉÏ¿ÉÄÜΪ3,000»ò10,000ά¶È£©
µÚÒ»¸ÅÂÊ·Ö²¼ÔÚÓëµ¥´Ê¡°i¡±Ïà¹ØÁªµÄµ¥Ôª´¦¾ßÓÐ×î¸ß¸ÅÂÊ
µÚ¶þ¸ÅÂÊ·Ö²¼ÔÚÓëµ¥´Ê¡°am¡±Ïà¹ØÁªµÄµ¥Ôª¸ñÖоßÓÐ×î¸ß¸ÅÂÊ
ÒÀ´ËÀàÍÆ£¬Ö±µ½µÚÎå¸öÊä³ö·Ö²¼±íʾ' <end of sentence>'·ûºÅ£¬Òâζ×ÅÔ¤²â½áÊø¡£

ÉÏͼΪ£ºÊäÈ룺¡°je suis ¨¦tudiant¡±ºÍÔ¤ÆÚÊä³ö£º¡°I am a student¡±µÄÆÚÍûÔ¤²â¸ÅÂÊ·Ö²¼Çé¿ö¡£
ÔÚË㷨ģÐÍÖУ¬ËäÈ»²»ÄÜ´ïµ½ÆÚÍûµÄÇé¿ö£¬µ«ÊÇÎÒÃÇÐèÒªÔÚѵÁ·ÁË×ã¹»³¤Ê±¼äÖ®ºó£¬ÎÒÃǵÄË㷨ģÐÍÄܹ»ÓÐÈçÏÂͼËùʾµÄ¸ÅÂÊ·Ö²¼Çé¿ö£º

ÏÖÔÚ£¬ÒòΪģÐÍÒ»´ÎÉú³ÉÒ»¸öÊä³ö£¬ÎÒÃÇ¿ÉÒÔÀí½âΪÕâ¸öÄ£ÐʹӸøÅÂÊ·Ö²¼£¨softmax£©Ê¸Á¿ÖÐÑ¡ÔñÁ˾ßÓÐ×î¸ß¸ÅÂʵĵ¥´Ê²¢¶ªÆúÁËÆäÓàµÄµ¥´Ê¡£
ÏÖÔÚ£¬ÒòΪģÐÍÒ»´ÎÉú³ÉÒ»¸öÊä³ö£¬ÎÒÃÇ¿ÉÒÔ¼ÙÉèÄ£ÐʹӸøÅÂÊ·Ö²¼ÖÐÑ¡Ôñ¾ßÓÐ×î¸ß¸ÅÂʵĵ¥´Ê²¢¶ªÆúÆäÓàµÄµ¥´Ê¡£
ÕâÀïÓÐ2¸ö·½·¨£ºÒ»¸öÊÇ̰À·Ëã·¨£¨greedy decoding£©£¬Ò»¸öÊDz¨ÊøËÑË÷£¨beam
search£©¡£²¨ÊøËÑË÷ÊÇÒ»¸öÓÅ»¯ÌáÉýµÄ¼¼Êõ£¬¿ÉÒÔ³¢ÊÔÈ¥Á˽âһϣ¬ÕâÀï²»×ö¸ü¶à½âÊÍ¡£ |