Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Model Center   Code  
»áÔ±   
   
 
     
   
 ¶©ÔÄ
  ¾èÖú
Transformerѧϰ×ܽᡪ¡ªÔ­ÀíÆª
 
×÷Õߣº ÒäéäDR
  6200  次浏览      27
 2019-11-7
 
±à¼­ÍƼö:
±¾ÎÄÖ÷ÒªÖ÷ÒªÊÇÊáÀíÒ»ÏÂTransformerµÄÀíÂÛ֪ʶ, Ï£Íû¶ÔÄúµÄѧϰÓÐËù°ïÖú¡£
±¾ÎÄÀ´×Ôcsdn£¬ÓÉ»ðÁú¹ûÈí¼þDelores±à¼­ÍƼö

1. ÕûÌå½á¹¹

Ê×ÏÈ´ÓÕûÌåÉÏ¿´Ò»ÏÂTransformerµÄ½á¹¹£º

´ÓͼÖпÉÒÔ¿´³ö£¬ÕûÌåÉÏTransformerÓÉËIJ¿·Ö×é³É£º

Inputs : Inputs=WordEmbedding(Inputs)+PositionalEmbedding Inputs = WordEmbedding(Inputs)+ PositionalEmbeddingInputs=WordEmbedding(Inputs)+PositionalEmbedding

Outputs : Ouputs=WordEmbedding(Outputs)+PositionalEmbedding Ouputs = WordEmbedding(Outputs) + PositionalEmbeddingOuputs=WordEmbedding(Outputs)+PositionalEmbedding

Encoders stack : ÓÉÁù¸öÏàͬµÄEncoder²ã×é³É£¬³ýÁ˵ÚÒ»¸öEncoder²ãµÄÊäÈëΪInputs£¬ÆäËûEncoder²ãµÄÊäÈëΪÉÏÒ»¸öEncoder²ãµÄÊä³ö

Decoders stack : ÓÉÁù¸öÏàͬµÄDecoder²ã×é³É£¬³ýÁ˵ÚÒ»¸öDecoder²ãµÄÊäÈëΪOutputsºÍ×îºóÒ»¸öEncoder²ãµÄÊä³ö£¬ÆäËûDecoder²ãµÄÊäÈëΪÉÏÒ»¸öDecoder²ãµÄÊä³öºÍ×îºóÒ»¸öEncoder²ãµÄÊä³ö

ÈçÏÂͼËùʾ£¬ÔÚ¸ü¸ßµÄ²ã¼¶ÉÏÀ´Àí½âEncoderºÍDecoder²ãÖ®¼äµÄÊäÈëºÍÊä³öµÄ¹ØÏµ£¬¿ÉÒÔ¸üÖ±¹Û¡£

¶øEncoder²ãºÍDecoder²ãµÄÄÚ²¿×é³ÉÖ®¼äµÄ²îÒìÈçÏÂͼËùʾ¡£Ã¿Ò»¸öEncoder²ã¶¼°üº¬ÁËÒ»¸öSelf-Attention×Ó²ãºÍÒ»¸öFeed Forward×Ӳ㡣ÿ¸öDecoder²ã¶¼°üº¬ÁËÒ»¸öSelf-Attention×Ӳ㡢һ¸öEncoder-Decoder Attention×Ó²ãºÍÒ»¸öFeed Forward×Ӳ㡣Encoder²ãºÍDecoder²ãÖ®¼äµÄ²î±ðÔÚÓÚDecoderÖжàÁËÒ»¸öEncoder-Decoder Attention×Ӳ㣬¶øÆäËûÁ½¸ö×Ó²ãµÄ½á¹¹ÔÚÁ½ÕßÖÐÊÇÏàͬµÄ¡£

2. Self-Attention

2.1 ΪʲôѡÔñSelf-Attention

Ê×ÏÈͨ¹ýÒ»¸ö¼òµ¥µÄÀý×Ó£¬À´¼òµ¥ËµÃ÷Ò»ÏÂself-AttentionÕâÖÖ»úÖÆ½ÏÖ®´«Í³µÄÐòÁÐÄ£Ð͵ÄÓÅÊÆËùÔÚ¡£±ÈÈçÎÒÃǵ±Ç°Òª·­ÒëµÄ¾ä×ÓΪThe animal didn¡¯t cross the street because it was too tired£¬ÔÚ·­Òëitʱ£¬Ëü¾¿¾¹Ö¸´úµÄÊÇÊ²Ã´ÄØ£¿ÒªÈ·¶¨itÖ¸´úµÄÄÚÈÝ£¬ºÁÎÞÒÉÎÊÎÒÃÇÐèҪͬʱ¹Ø×¢µ½Õâ¸ö´ÊµÄÉÏÏÂÎÄÓï¾³ÖеÄËùÓдʣ¬ÔÚÕâ¾ä»°ÖÐÖØµãΪanimal, street, tired£¬È»ºó¸ù¾Ý³£Ê¶£¬ÎÒÃÇÖªµÀÖ»ÓÐanimal²Å»átired£¬ËùÒÔÈ·¶¨ÁËitÖ¸´úµÄÊÇanimal¡£Èç¹û½«tired¸ÄΪnarrow£¬ÄǺÜÏÔÈ»itÓ¦¸ÃÖ¸µÄÊÇstreet£¬ÒòΪֻÓÐstreet²ÅÄÜÓÃnarrowÐÞÊΡ£

Self-Attention»úÖÆÔÚ¶ÔÒ»¸ö´Ê½øÐбàÂëʱ£¬»á¿¼ÂÇÕâ¸ö´ÊÉÏÏÂÎÄÖеÄËùÓдʺÍÕâЩ´Ê¶Ô×îÖÕ±àÂëµÄ¹±Ï×£¬ÔÙ¸ù¾ÝµÃµ½µÄÐÅÏ¢¶Ôµ±Ç°´Ê½øÐбàÂ룬Õâ¾Í±£Ö¤ÁËÔÚ·­Òëitʱ£¬ÔÚËüÉÏÏÂÎÄÖеÄanimal, street, tired¶¼»á±»¿¼ÂǽøÀ´£¬´Ó¶ø½«itÕýÈ·µÄ·­Òë³Éanimal

ÄÇôÈç¹ûÎÒÃDzÉÓô«Í³µÄÏñLSTMÕâÑùµÄÐòÁÐÄ£ÐÍÀ´½øÐз­ÒëÄØ£¿ÓÉÓÚLSTMÄ£ÐÍÊǵ¥ÏòµÄ(ǰÏò»òÕߺóÏò)£¬ÏÔÈ»ËüÎÞ·¨Í¬Ê±¿¼Âǵ½itµÄÉÏÏÂÎÄÐÅÏ¢£¬Õâ»áÔì³É·­ÒëµÄ´íÎó¡£ÒÔǰÏòLSTMΪÀý£¬µ±·­Òëitʱ£¬ÄÜ¿¼ÂǵÄÐÅÏ¢Ö»ÓÐThe animal didn't cross the street because£¬¶øÎÞ·¨¿¼ÂÇwas too tired£¬ÕâʹµÃÄ£ÐÍÎÞ·¨È·¶¨itµ½µ×Ö¸´úµÄÊÇstreet»¹ÊÇanimal¡£µ±È»ÎÒÃÇ¿ÉÒÔ²ÉÓöà²ãµÄLSTM½á¹¹£¬µ«ÕâÖֽṹ²¢·ÇÏñSelf-AttentionÒ»ÑùÊÇÕæÕýÒâÒåÉϵÄË«Ïò£¬¶øÊÇͨ¹ýÆ´½ÓǰÏòLSTMºÍºóÏòLSTMµÄÊä³öʵÏֵģ¬Õâ»áʹµÃÄ£Ð͵ĸ´ÔÓ¶È»áÔ¶Ô¶¸ßÓÚSelf-Attention¡£

ÏÂͼÊÇÄ£Ð͵Ä×îÉÏÒ»²ã(ϱê0ÊǵÚÒ»²ã£¬5ÊǵÚÁù²ã)EncoderµÄAttention¿ÉÊÓ»¯Í¼¡£ÕâÊÇtensor2tensorÕâ¸ö¹¤¾ßÊä³öµÄÄÚÈÝ¡£ÎÒÃÇ¿ÉÒÔ¿´µ½£¬ÔÚ±àÂëitµÄʱºòÓÐÒ»¸öAttention Head(ºóÃæ»á½²µ½)×¢Òâµ½ÁËAnimal£¬Òò´Ë±àÂëºóµÄitÓÐAnimalµÄÓïÒå¡£

Self-AttentionµÄÓÅÊÆ²»½ö½öÔÚÓÚ¶Ô´ÊÓï½øÐбàÂëʱÄܳä·Ö¿¼Âǵ½´ÊÓïÉÏÏÂÎÄÖеÄËùÓÐÐÅÏ¢£¬»¹ÔÚÓÚÕâÖÖ»úÖÆÄܹ»ÊµÏÖÄ£ÐÍѵÁ·¹ý³ÌÖеIJ¢ÐУ¬ÕâʹµÃÄ£Ð͵ÄѵÁ·Ê±¼äÄܹ»½Ï´«Í³µÄÐòÁÐÄ£ÐÍ´ó´óËõ¶Ì¡£´«Í³µÄÐòÁÐÄ£ÐÍÓÉÓÚt ttʱ¿ÌµÄ״̬»áÊܵ½t*1 t-1t*1ʱ¿Ì״̬µÄÓ°Ï죬ËùÒÔÔÚѵÁ·µÄ¹ý³ÌÖÐÊÇÎÞ·¨ÊµÏÖ²¢Ðеģ¬Ö»ÄÜ´®ÐС£¶øSelf-AttentionÄ£ÐÍÖУ¬Õû¸ö²Ù×÷¿ÉÒÔͨ¹ý¾ØÕóÔËËãºÜÈÝÒ×µÄʵÏÖ²¢ÐС£

2.2 Self-Attention½á¹¹

ΪÁËÄܸüºÃµÄÀí½âSelf-AttentionµÄ½á¹¹£¬Ê×ÏȽéÉÜÏòÁ¿ÐÎʽµÄSelf-AttentionµÄʵÏÖ£¬ÔÙ´ÓÏòÁ¿ÐÎÊ½ÍÆ¹ãµ½¾ØÕóÐÎʽ¡£

¶ÔÓÚÄ£ÐÍÖеÄÿһ¸öÊäÈëÏòÁ¿(µÚÒ»²ãµÄÊäÈëΪ´ÊÓï¶ÔÓ¦µÄEmbeddingÏòÁ¿£¬Èç¹ûÓжà²ãÔòÆäËü²ãµÄÊäÈëΪÉÏÒ»²ãµÄÊä³öÏòÁ¿)£¬Ê×ÏÈÎÒÃÇÐèÒª¸ù¾ÝÊäÈëÏòÁ¿Éú³ÉÈý¸öеÄÏòÁ¿£ºQ(Query)¡¢K(Key)¡¢V(Value) Q(Query)¡¢K(Key)¡¢V(Value)Q(Query)¡¢K(Key)¡¢V(Value)£¬ÆäÖÐQuery QueryQueryÏòÁ¿±íʾΪÁ˱àÂ뵱ǰ´ÊÐèҪȥעÒâ(attend to)µÄÆäËû´Ê(°üÀ¨µ±Ç°´ÊÓï±¾Éí)£¬Key KeyKeyÏòÁ¿±íʾµ±Ç°´ÊÓÃÓÚ±»¼ìË÷µÄ¹Ø¼üÐÅÏ¢£¬¶øValue ValueValueÏòÁ¿ÊÇÕæÕýµÄÄÚÈÝ¡£Èý¸öÏòÁ¿¶¼ÊÇÒÔµ±Ç°´ÊµÄEmbeddingÏòÁ¿ÎªÊäÈ룬¾­¹ý²»Í¬µÄÏßÐÔ²ã±ä»»µÃµ½µÄ¡£

ÏÂÃæÒÔ¾ßÌåʵÀýÀ´Àí½âSelf-Attention»úÖÆ¡£±ÈÈçµ±ÎÒÃǵÄÊäÈëΪthinkingºÍmachinesʱ£¬Ê×ÏÈÎÒÃÇÐèÒª¶ÔËüÃÇ×öWord Embedding£¬µÃµ½¶ÔÓ¦µÄ´ÊÏòÁ¿±íʾx1,x2 x_{1},x_{2}x 1 ,x 2 £¬ÔÙ½«¶ÔÓ¦µÄ´ÊÏòÁ¿·Ö±ðͨ¹ýÈý¸ö²»Í¬µÄ¾ØÕó½øÐÐÏßÐԱ任£¬µÃµ½¶ÔÓ¦µÄÏòÁ¿q1,k1,v1 q_{1},k_{1},v_{1}q 1 ,k 1 ,v 1 ºÍq2,k2,v2 q_{2},k_{2},v_{2}q 2 ,k 2 ,v 2 ¡£ÎªÁËʹµÃQuery QueryQueryºÍKey KeyKeyÏòÁ¿Äܹ»×öÄÚ»ý£¬Ä£ÐÍÒªÇóWK¡¢WQ W^{K}¡¢W^{Q}W K ¡¢W Q µÄ´óСÊÇÒ»ÑùµÄ£¬¶ø¶ÔWV W^{V}W V µÄ´óС²¢Ã»ÓÐÒªÇó¡£

ͼÖиø³öÁËÉÏÊö¹ý³ÌµÄ¿ÉÊÓ»¯£¬Ôڵõ½ËùÓеÄÊäÈë¶ÔÓ¦µÄqi¡¢ki¡¢vi q_{i}¡¢k_{i}¡¢v_{i}q i ¡¢k i ¡¢v i

ÏòÁ¿ºó£¬¾Í¿ÉÒÔ½øÐÐSelf-AttentionÏòÁ¿µÄ¼ÆËãÁË¡£ÈçÏÂͼËùʾ£¬µ±ÎÒÃÇÐèÒª¼ÆËãthinking¶ÔÓ¦µÄattentionÏòÁ¿Ê±£¬Ê×ÏȽ«q1 q_{1}q 1 ºÍËùÓÐÊäÈë¶ÔÓ¦µÄki k_{i}k i ×öµã»ý£¬·Ö±ðµÃµ½²»Í¬µÄScore ScoreScore£º

ÈçÏÂͼËùʾ£¬ÔÙ¶ÔScore ScoreScoreÖµ×öscale scalescale²Ù×÷£¬Í¨¹ý³ýÒÔ8dk**¡Ì 8\sqrt{d_{k}}8 d k ½«score scorescoreÖµËõС£¬ÕâÑùÄÜʹµÃscore scorescoreÖµ¸üƽ»¬£¬ÔÚ×öÌݶÈϽµÊ±¸üÎȶ¨£¬ÓÐÀûÓÚÄ£Ð͵ÄѵÁ·¡£ÔٶԵõ½µÄеÄScore ScoreScoreÖµ×öSoftmax SoftmaxSoftmax£¬ÀûÓÃ$ Softmax²Ù×÷µÃµ½µÄ¸ÅÂÊ·Ö²¼¶ÔËùÓÐµÄ ²Ù×÷µÃµ½µÄ¸ÅÂÊ·Ö²¼¶ÔËùÓеIJÙ×÷µÃµ½µÄ¸ÅÂÊ·Ö²¼¶ÔËùÓеÄv_{i}½øÐмÓȨƽ¾ù£¬µÃµ½µ±Ç°´ÊÓïµÄ×îÖÕ±íʾ ½øÐмÓȨƽ¾ù£¬µÃµ½µ±Ç°´ÊÓïµÄ×îÖÕ±íʾ½øÐмÓȨƽ¾ù£¬µÃµ½µ±Ç°´ÊÓïµÄ×îÖÕ±íʾz_{1}$¡£¶ÔmachinesµÄ±àÂëºÍÉÏÊö¹ý³ÌÒ»Ñù¡£

Èç¹ûÎÒÃÇÒÔÏòÁ¿ÐÎʽѭ»·ÊäÈëËùÓдÊÓïµÄEmbeddingÏòÁ¿µÃµ½ËüÃǵÄ×îÖÕ±àÂ룬ÕâÖÖ·½Ê½ÒÀÈ»ÊÇ´®Ðе쬶øÈç¹ûÎÒÃǰÑÉÏÃæµÄÏòÁ¿¼ÆËã±äΪ¾ØÕóµÄÔËË㣬Ôò¿ÉÒÔʵÏÖÒ»´Î¼ÆËã³öËùÓдÊÓï¶ÔÓ¦µÄ×îÖÕ±àÂ룬ÕâÑùµÄ¾ØÕóÔËËã¿ÉÒÔ³ä·ÖµÄÀûÓõçÄÔµÄÓ²¼þºÍÈí¼þ×ÊÔ´£¬´Ó¶øÊ¹³ÌÐò¸ü¸ßЧµÄÖ´ÐС£

ÏÂͼËùʾΪ¾ØÕóÔËËãµÄÐÎʽ¡£ÆäÖÐX XXΪÊäÈë¶ÔÓ¦µÄ´ÊÏòÁ¿¾ØÕó£¬WQ¡¢WK¡¢WV W^{Q}¡¢W^{K}¡¢W^{V}W Q ¡¢W K ¡¢W V ΪÏàÓ¦µÄÏßÐԱ任¾ØÕó£¬Q¡¢K¡¢V Q¡¢K¡¢VQ¡¢K¡¢VΪX XX¾­¹ýÏßÐԱ任µÃµ½µÄQuery QueryQueryÏòÁ¿¾ØÕó¡¢Key KeyKeyÏòÁ¿¾ØÕóºÍValue ValueValueÏòÁ¿¾ØÕó

2.3 Scaled Dot-Production Attetntion

__Scaled Dot-Product Attention__ÆäʵÊÇÔÚÉÏÒ»½ÚµÄattentionµÄ»ù´¡ÉϼÓÈëÁË__scale__ºÍ__mask__²Ù×÷½øÐÐÓÅ»¯£¬¾ßÌå½á¹¹ÈçÏÂͼËùʾ¡£Ê×ÏȶÔÓÚ__scale__²Ù×÷£¬Ëõ·ÅÒò×ӵļÓÈëÊÇ¿¼Âǵ½Q*K Q*KQ*KµÄ½á¹û¾ØÕóÖеÄÖµ¿ÉÄÜ»áºÜ´ó£¬³ýÒÔÒ»¸öËõ·ÅÒò×Ó¿ÉÒÔʹֵ±äС£¬ÕâÑùÄ£ÐÍÔÚ×öÌݶÈϽµÊ±¿ÉÒÔ¸ü¼ÓÎȶ¨¡£__mask__²Ù×÷Ö÷ÒªÊÇΪÁËÆÁ±ÎµôÊäÈëÖÐûÓÐÒâÒåµÄ²¿·Ö(padding mask)ºÍÕë¶ÔÌØ¶¨ÈÎÎñÐèÒªÆÁ±ÎµÄ²¿·Ö(sequence mask)£¬´Ó¶ø½µµÍÆä¶Ô×îºó½á¹ûµÄÓ°Ïì¡£ÕâÁ½ÖÖ²»Í¬µÄmask·½·¨ÔÚºóÃæ»áÏêϸ˵Ã÷¡£

ÉÏͼ¸ø³öÁËÔÚ²»Í¬headÖеÄattention²Ù×÷£¬ÓÉͼÖпÉÖª£¬Ã¿¸öheadÖж¼´æÔÚÒ»×éWQi,WKi,WVi W_{i}^{Q}, W_{i}^{K}, W_{i}^{V}W i Q * ,W i K * ,W i V * £¬Í¨¹ýÓëÊäÈë½øÐоØÕóÏà³ËÔËË㣬¿ÉÒԵõ½Ò»×é¶ÔÓ¦µÄ(Qi,Ki,Vi) (Q_{i}, K_{i}, V_{i})(Q i * ,K i * ,V i * )£¬²¢Óɴ˵õ½headµÄÊä³özi z_{i}z i * ¡£ÔÚµ¥¸öheadÖнøÐеÄattention²Ù×÷£¬ÓëÉÏÒ»½ÚËù½²µÄÍêÈ«Ïàͬ¡£×îºó¶à¸öheadµÃµ½µÄ½á¹ûÈçÏÂͼËùʾ¡£

¶ø×îºóÎÒÃÇËùÐèÒªµÄÊä³ö²»ÊǶà¸ö¾ØÕ󣬶øÊǵ¥¸ö¾ØÕó£¬ËùÒÔ×îºó¶à¸öheadµÄÊä³ö¶¼ÔÚ¾ØÕóµÄ×îºóÒ»¸öά¶ÈÉϽøÐÐÆ´½Ó£¬ÔÙ½«µÃµ½µÄ¾ØÕóÓëÒ»¸ö¾ØÕóWO W^{O}W O Ïà³Ë£¬ÕâÒ»´ÎÏßÐԱ任µÄÄ¿µÄÊÇ¶ÔÆ´½ÓµÃµ½µÄ¾ØÕó½øÐÐѹËõ£¬ÒԵõ½×îÀíÏëµÄÊä³ö¾ØÕó¡£

ÏÂͼËùʾ£¬ÎªÍêÕûµÄMulti-head Attention¹ý³Ì£¬

ÄÇôÎÒÃÇÈçºÎÀ´Àí½â¶à¸öhead·Ö±ð×¢Òâµ½µÄÄÚÈÝÄØ£¿ÏÂÃæ¸ø³öÁ½¸öͼÀ´¾ÙÀý˵Ã÷¡£µÚÒ»¸öͼ¸ø³öÁËÔÚ·­ÒëitʱÁ½¸öhead·Ö±ð×¢Òâµ½µÄÄÚÈÝ£¬´ÓÖпÉÒÔºÜÃ÷ÏԵĿ´µ½£¬µÚÒ»¸öhead×¢Òâµ½ÁËanimal£¬¶øµÚ¶þ¸öhead×¢Òâµ½ÁËtired£¬Õâ¾Í±£Ö¤ÁË·­ÒëµÄÕýÈ·ÐÔ¡£

µÚ¶þ¸öͼÖиø³öÁËËùÒÔhead·Ö±ð×¢Òâµ½µÄÄÚÈÝ£¬Õâʱºòattention¾¿¾¹ÄÜ·ñץȡµ½×îÐèÒª±»»ñÈ¡µÄÐÅÏ¢±äµÃ²»ÔÙÄÇôֱ¹Û¡£

3. The Residual Connection ²Ð²îÁ¬½Ó

ÔÚTransformerÖУ¬Ã¿¸öMulti-Head Attention²ãºÍFeed Forward²ã¶¼»áÓÐÒ»¸ö²Ð²îÁ¬½Ó£¬È»ºóÔÙ½ÓÒ»¸öLayer Norm²ã¡£²Ð²îÁ¬½ÓÔÚEncoderºÍDecoderÖж¼´æÔÚ£¬ÇҽṹÍêÈ«Ïàͬ¡£ÈçÏÂͼËùʾ£¬Ò»¸öEncoderÖÐSelf-Attention²ãµÄÊä³öz1,z2 ºÍÊäÈë(x1,x2) Ïà¼Ó£¬×÷ΪLayerNorm²ãµÄÊäÈë¡£²Ð²îÁ¬½Ó±¾ÉíÓкܶàºÃ´¦£¬µ«²¢²»ÊÇTransformer½á¹¹µÄÖØµã£¬ÕâÀï²»×öÏêÊö¡£

4. Positional Encoding

ÎÒÃÇÔÚTransformerÖÐʹÓÃSelf-AttentionµÄÄ¿µÄÊÇÓÃËüÀ´´úÌæRNN¡£RNNÖ»ÄܹØ×¢µ½¹ýÈ¥µÄÐÅÏ¢£¬¶øSelf-Attentionͨ¹ý¾ØÕóÔËËã¿ÉÒÔͬʱ¹Ø×¢µ½µ±Ç°Ê±¿ÌµÄÉÏÏÂÎÄÖÐËùÓеÄÐÅÏ¢£¬ÕâʹµÃÆä¿ÉÒÔʵÏÖºÍRNNµÈ¼ÛÉõÖÁ¸üºÃµÄЧ¹û¡£Í¬Ê±£¬RNN×÷ΪһÖÖ´®ÐеÄÐòÁÐÄ£ÐÍ»¹ÓÐÒ»¸öºÜÖØÒªµÄÌØÕ÷£¬¾ÍÊÇËüÄܹ»¿¼Âǵ½µ¥´ÊµÄ˳Ðò(λÖÃ)¹ØÏµ¡£ÔÚͬһ¸ö¾ä×ÓÖУ¬¼´Ê¹ËùÓеĴʶ¼Ïàͬµ«´ÊÐòµÄ±ä»¯Ò²¿ÉÄܵ¼Ö¾ä×ÓµÄÓïÒåÍêÈ«²»Í¬£¬±ÈÈ硱±±¾©µ½ÉϺ£µÄ»úƱ¡±Ó롱ÉϺ£µ½±±¾©µÄ»úƱ¡±£¬ËüÃǵÄÓïÒå¾ÍÓкܴóµÄ²î±ð¡£¶øSelf-Attention½á¹¹ÊDz»¿¼ÂǴʵÄ˳ÐòµÄ£¬Èç¹û²»ÒýÈëλÖÃÐÅÏ¢£¬Ç°Ò»¸öÀý×ÓÁ½¾ä»°ÖÐÖеÄ"±±¾©"»á±»±àÂë³ÉÏàͬµÄÏòÁ¿£¬µ«Êµ¼ÊÉÏÎÒÃÇÏ£ÍûÁ½ÕߵıàÂëÏòÁ¿ÊDz»Í¬µÄ£¬Ç°Ò»¾äÖеÄ"±±¾©"ÐèÒª±àÂë³ö·¢³ÇÊеÄÓïÒ壬¶øºóÒ»¾äÖеÄ"±±¾©"ÔòΪĿµÄ³ÇÊС£»»ÑÔÖ®£¬Èç¹ûûÓÐλÖÃÐÅÏ¢£¬Self-AttentionÖ»ÊÇÒ»¸ö½á¹¹¸ü¸´ÔӵĴʴüÄ£ÐÍ¡£ËùÒÔ£¬ÔÚ´ÊÏòÁ¿±àÂëÖÐÒýÈëλÖÃÐÅÏ¢ÊDZØÒªµÄ¡£

ΪÁ˽â¾öÕâ¸öÎÊÌ⣬ÎÒÃÇÐèÒªÒýÈëλÖñàÂ룬Ҳ¾ÍÊÇtʱ¿ÌµÄÊäÈ룬³ýÁËEmbeddingÖ®Íâ(ÕâÊÇÓëλÖÃÎ޹صÄ)£¬ÎÒÃÇ»¹ÒýÈëÒ»¸öÏòÁ¿£¬Õâ¸öÏòÁ¿ÊÇÓëtÓйصģ¬ÎÒÃǰÑEmbeddingºÍλÖñàÂëÏòÁ¿¼ÓÆðÀ´×÷ΪģÐ͵ÄÊäÈë¡£ÕâÑùµÄ»°Èç¹ûÁ½¸ö´ÊÔÚ²»Í¬µÄλÖóöÏÖÁË£¬ËäÈ»ËüÃǵÄEmbeddingÊÇÏàͬµÄ£¬µ«ÊÇÓÉÓÚλÖñàÂ벻ͬ£¬×îÖյõ½µÄÏòÁ¿Ò²ÊDz»Í¬µÄ¡£

λÖñàÂëÓкܶ෽·¨£¬ÆäÖÐÐèÒª¿¼ÂǵÄÒ»¸öÖØÒªÒòËØ¾ÍÊÇÐèÒªËü±àÂëµÄÊÇÏà¶ÔλÖõĹØÏµ¡£±ÈÈçÁ½¸ö¾ä×Ó£º¡±±±¾©µ½ÉϺ£µÄ»úƱ¡±ºÍ¡±ÄãºÃ£¬ÎÒÃÇÒªÒ»Õű±¾©µ½ÉϺ£µÄ»úƱ¡±¡£ÏÔÈ»¼ÓÈëλÖñàÂëÖ®ºó£¬Á½¸ö±±¾©µÄÏòÁ¿ÊDz»Í¬µÄÁË£¬Á½¸öÉϺ£µÄÏòÁ¿Ò²ÊDz»Í¬µÄÁË£¬µ«ÊÇÎÒÃÇÆÚÍûQuery(±±¾©1)*Key(ÉϺ£1) Query(±±¾©1)*Key(ÉϺ£1)Query(±±¾©1)*Key(ÉϺ£1)È´ÊǵÈÓÚQuery(±±¾©2)*Key(ÉϺ£2) Query(±±¾©2)*Key(ÉϺ£2)Query(±±¾©2)*Key(ÉϺ£2)µÄ¡£

ÓÉÉÏͼ¿ÉÖª£¬Î»ÖñàÂëÆäʵÊÇÒÔ¼Ó·¨µÄÐÎʽ½«´ÊÓïµÄEmbeddingÏòÁ¿¼ÓÉÏÆäλÖÃÏòÁ¿×÷Ϊ×îºóµÄÊä³ö£¬Õâ±£Ö¤Á˵±Í¬Ò»¸ö´Ê³öÏÖÔÚ¾ä×ӵIJ»Í¬Î»ÖÃʱ£¬Æä¶ÔÓ¦µÄ´ÊÏòÁ¿±íʾÊDz»Í¬µÄ¡£ÔÚTransformerÖеÄpositional

¾ßÌåʵÏÖÉÏ£¬Ê×ÏÈÒªÃ÷È·µÄÊÇ£¬ÔÚTransformerÖеĵÄpositional encoding¾ØÕóÊǹ̶¨µÄ£¬µ±Ã¿¸öÊäÈëÑù±¾µÄ´óСΪmaxlen*dmodel maxlen*d_{model}maxlen*d model ʱ£¬ÔòÎÒÃÇÐèÒªµÄpositional enccoding¾ØÕóµÄ´óСͬÑùΪmaxlen*dmodel maxlen*d_{model}maxlen*d model

ÇÒÕâ¸öλÖñàÂë¾ØÕó±»Ó¦ÓÃÓÚºÍËùÓÐÊäÈëÑù±¾×ö¼Ó·¨£¬ÒԴ˽«Î»ÖÃÐÅÏ¢±àÂë½øÑù±¾µÄ´ÊÏòÁ¿±íʾ¡£½ÓÏÂÀ´ËµÒ»ÏÂÈçºÎµÃµ½Õâ¸öpositional encoding¾ØÕó¡£

µ±ÎÒÃÇÐèÒªµÄpositional encoding¾ØÕóPE PEPEµÄ´óСΪmaxlen*dmodel maxlen*d_{model}maxlen*d model ʱ£¬Ê×Ïȸù¾Ý¾ØÕóÖÐÿ¸öλÖõÄϱê(i,j) (i,j)(i,j)°´ÏÂÃæµÄ¹«Ê½È·¶¨¸ÃλÖõÄÖµ£º

½Ó×Å£¬ÔÚżÊýλÖã¬Ê¹ÓÃÕýÏÒ±àÂ룬ÔÚÆæÊýλÖã¬Ê¹ÓÃÓàÏÒ±àÂ룺

ΪʲôÕâÑù±àÂë¾ÍÄÜÒýÈë´ÊÓïµÄλÖÃÐÅÏ¢ÄØ£¿Èç¹ûÖ»°´ÕÕµÚÒ»¸ö¹«Ê½ÄÇÑù¸ù¾ÝÊäÈë¾ØÕóµÄϱêÖµ½øÐбàÂëµÄ»°£¬ÏÔÈ»±àÂëµÄÊÇ´Ê»ãµÄ¾ø¶ÔλÖÃÐÅÏ¢£¬¼´__¾ø¶ÔλÖñàÂë__¡£µ«´ÊÓïµÄÏà¶ÔλÖÃÒ²ÊǷdz£ÖØÒªµÄ£¬ÕâÒ²ÊÇTransformerÖÐÒýÈëÕýÏÒº¯ÊýµÄÔ­Òò½øÐÐ__Ïà¶ÔλÖñàÂë__µÄÔ­Òò¡£ÕýÏÒº¯ÊýÄܹ»±íʾ´ÊÓïµÄÏà¶ÔλÖÃÐÅÏ¢£¬Ö÷ÒªÊÇ»ùÓÚÒÔÏÂÁ½¸ö¹«Ê½£¬Õâ±íÃ÷λÖÃk+p k+pk+pµÄλÖÃÏòÁ¿¿ÉÒÔ±íʾΪλÖÃk kkµÄÌØÕ÷ÏòÁ¿µÄÏßÐԱ仯£¬ÕâΪģÐͲ¶×½µ¥´ÊÖ®¼äµÄÏà¶ÔλÖùØÏµÌṩÁ˷dz£´óµÄ±ãÀû¡£

5. Layer Norm

¼ÙÉèÎÒÃǵÄÊäÈëÊÇÒ»¸öÏòÁ¿£¬Õâ¸öÏòÁ¿ÖеÄÿ¸öÔªËØ¶¼´ú±íÁËÊäÈëµÄÒ»¸ö²»Í¬ÌØÕ÷£¬¶øLayerNormÒª×öµÄ¾ÍÊǶÔÒ»¸öÑù±¾ÏòÁ¿µÄËùÓÐÌØÕ÷½øÐÐNormalization£¬ÕâÒ²±íÃ÷LayNormµÄÊäÈë¿ÉÒÔÖ»ÓÐÒ»¸öÑù±¾¡£

¼ÙÉèÒ»¸öÑù±¾ÏòÁ¿ÎªX=x1,x2,¡­,xn X={x_{1},x_{2},¡­, x_{n}}X=x 1 ,x 2 ,¡­,x n £¬Ôò¶ÔÆä×öLayer NormalizationµÄ¹ý³ÌÈçÏÂËùʾ¡£ÏÈÇó²»Í¬ÌØÕ÷µÄ¾ùÖµºÍ·½²î£¬ÔÙÀûÓþùÖµºÍ·½²î¶ÔÑù±¾µÄ¸÷¸öÌØÕ÷Öµ½øÐÐNormalization²Ù×÷¡£

ÏÂͼ¸ø³öÒ»¸ö¶Ô²»Í¬Ñù±¾×öLayer NormalizationµÄʵÀý¡£

Layer NormalizationµÄ·½·¨¿ÉÒÔºÍBatch Normalization¶Ô±È׎øÐÐÀí½â£¬ÒòΪBatch Normalization²»ÊÇTransformerÖеĽṹ£¬ÕâÀï²»×öÏê½â¡£

6. Mask

Mask£¬¹ËÃû˼Òå¾ÍÊÇÑÚÂ룬¿ÉÒÔÀí½âΪ¶ÔÊäÈëµÄÏòÁ¿»òÕß¾ØÕóÖеÄÒ»Ð©ÌØÕ÷Öµ½øÐÐÑڸǣ¬Ê¹Æä²»·¢»Ó×÷Óã¬ÕâЩ±»ÑڸǵÄÌØÕ÷Öµ¿ÉÄÜÊDZ¾Éí²¢Ã»ÓÐÒâÒå(±ÈÈçΪÁË¶ÔÆë¶øÌî³äµÄ¡¯0¡¯)»òÕßÊÇÕë¶Ôµ±Ç°ÈÎÎñΪÁË×öÌØÊâ´¦Àí¶øÌØÒâ½øÐÐÑڸǡ£

ÔÚTransformerÖÐÓÐÁ½ÖÖmask·½·¨£¬·Ö±ðΪ__padding mask__ºÍ__sequence mask__£¬ÕâÁ½ÖÖmask·½·¨ÔÚTransformerÖеÄ×÷Óò¢²»Ò»Ñù¡£__padding mask__ÔÚEncoderºÍDecoderÖж¼»áÓõ½£¬¶ø__sequence mask__Ö»ÔÚDecoderÖÐʹÓá£

6.1 padding mask

ÔÚ×ÔÈ»ÓïÑÔ´¦ÀíµÄÏà¹ØÈÎÎñÖУ¬ÊäÈëÑù±¾Ò»°ãΪ¾ä×Ó£¬¶ø²»Í¬µÄ¾ä×ÓÖаüº¬µÄ´Ê»ãÊýÄ¿±ä»¯ºÜ´ó£¬µ«»úÆ÷ѧϰģÐÍÒ»°ãÒªÇóÊäÈëµÄ´óСÊÇÒ»Öµģ¬Ò»°ã½â¾öÕâ¸öÎÊÌâµÄ·½·¨ÊǶÔÊäÈëµÄµ¥´ÊÐòÁиù¾Ý×î´ó³¤¶È½øÐÐ__¶ÔÆë__£¬¼´ÔÚ³¤¶ÈСÓÚ×î´ó³¤¶ÈµÄÊäÈëºóÃæÌ0¡¯¡£¾Ù¸öÀý×Ó£¬µ±maxlen=20 maxlen=20maxlen=20£¬¶øÎÒÃÇÊäÈëµÄ¾ØÕó´óСΪ12*dmodel 12*d_{model}12*d model ʱ£¬ÕâÊÇÎÒÃǶÔÊäÈë½øÐÐ¶ÔÆë£¬¾ÍÐèÒªÔÚÊäÈëµÄºóÃæÆ´½ÓÒ»¸ö´óСΪ8*dmodel 8*d_{model}8*d model µÄÁã¾ØÕó£¬Ê¹ÊäÈëµÄ´óС±äΪmaxlen*dmodel maxlen*d_{model}maxlen*d model ¡£µ«ÏÔÈ»£¬ÕâЩÌî³äµÄ¡¯0¡¯²¢Ã»ÓÐÒâÒ壬ËüµÄ×÷ÓÃÖ»ÊÇʵÏÖÊäÈëµÄ¶ÔÆë¡£ÔÚ×öattentionʱ£¬ÎªÁËʹattentionÏòÁ¿²»½«×¢ÒâÁ¦·ÅÔÚÕâЩûÓÐÒâÒåµÄÖµÉÏ£¬ÎÒÃÇÐèÒª¶ÔÕâЩֵ×ö__padding mask__¡£ ¾ßÌåÀ´Ëµ£¬×ö__padding mask__µÄ·½·¨ÊÇ£¬½«ÕâЩûÓÐÒâÒåµÄλÖÃÉϵÄÖµÖÃΪһ¸öºÜСµÄÊý£¬ÕâÑùÔÚ×ösoftmax softmaxsoftmaxʱ£¬ÕâЩλÖÃÉ϶ÔÓ¦µÄ¸ÅÂÊÖµ»á·Ç³£Ð¡½Ó½üÓÚ0£¬Æä¶Ô×îÖÕ½á¹ûµÄÓ°ÏìÒ²»á½µµÍµ½×îС¡£

6.2 sequence mask

Ç°ÃæÒѾ­Ëµ¹ý£¬ÔÚTransformerÖУ¬__sequence mask__Ö»ÓÃÔÚDecoderÖУ¬ËüµÄ×÷ÓÃÊÇʹµÃDecoderÔÚ½øÐнâÂëʱ²»ÄÜ¿´µ½µ±Ç°Ê±¿ÌÖ®ºóµÄµÄÐÅÏ¢¡£Ò²¾ÍÊÇ˵£¬¶ÔÓÚÒ»¸öÊäÈëÐòÁУ¬µ±ÎÒÃÇÒª¶Ôt ttʱ¿Ì½øÐнâÂëʱ£¬ÎÒÃÇÖ»ÄÜ¿¼ÂÇ(1,2,¡­,t*1) (1,2,¡­,t-1)(1,2,¡­,t*1)ʱ¿ÌµÄÐÅÏ¢£¬¶ø²»ÄÜ¿¼ÂÇÖ®ºóµÄ(t+1,¡­,n) (t+1,¡­, n)(t+1,¡­,n)ʱ¿ÌµÄÐÅÏ¢¡£

¾ßÌå×ö·¨£¬ÊDzúÉúÒ»¸öÏÂÈý½Ç¾ØÕó£¬Õâ¸ö¾ØÕóµÄÉÏÈý½ÇµÄֵȫΪ0£¬ÏÂÈý½ÇµÄֵȫΪÊäÈë¾ØÕó¶ÔӦλÖõÄÖµ£¬ÕâÑù¾ÍʵÏÖÁËÔÚÿ¸öʱ¿Ì¶ÔδÀ´ÐÅÏ¢µÄÑڸǡ£

7. Encoder and Decoder stacks

ÉÏÃæ¼¸½ÚÖÐÒѾ­½éÉÜÁËTransformerµÄÖ÷Òª½á¹¹£¬ÕâÒ»½Ú½«ÔÚ´Ë»ù´¡ÉÏ£¬´ÓÕûÌåÉÏÔÙ´ÎÀí½âÒ»ÏÂTranformerµÄ½á¹¹

ÈçÉÏͼËùʾ£¬TransformerÓÉ6¸öEncoder²ãºÍ6¸öDecoder²ã×é³É£¬ÆäÖи÷¸öEncoder²ãµÄ½á¹¹ÍêÈ«Ïàͬ£¬¸÷¸öDecoder²ãµÄ½á¹¹Ò²ÊÇÍêȫһÑùµÄ¡£¶øDecoder²ãºÍEncoder²ãÖ®¼äµÄ²î±ðÔÚÓÚDecoder²ãÖжàÁËÒ»¸öEncoder-Decoder Attention×Ó²ãºÍAdd & Normalize×Ӳ㣬ÕâÒ»²ãµÄÊäÈëΪDecoder²ãµÄÉÏÒ»¸ö×Ó²ãµÄÊä³öºÍEncoder²ãµÄ×îÖÕÊä³ö£¬ÆäÖÐEncoder²ãµÄ×îÖÕÊä³ö×÷ΪK ºÍV £¬Decoder²ãÖÐÉÏÒ»¸ö×Ó²ãµÄÊä³ö×÷ΪQ ¡£

 

   
6200 ´Îä¯ÀÀ       27
Ïà¹ØÎÄÕÂ

»ùÓÚͼ¾í»ýÍøÂçµÄͼÉî¶Èѧϰ
×Ô¶¯¼ÝÊ»ÖеÄ3DÄ¿±ê¼ì²â
¹¤Òµ»úÆ÷ÈË¿ØÖÆÏµÍ³¼Ü¹¹½éÉÜ
ÏîĿʵս£ºÈçºÎ¹¹½¨ÖªÊ¶Í¼Æ×
 
Ïà¹ØÎĵµ

5GÈ˹¤ÖÇÄÜÎïÁªÍøµÄµäÐÍÓ¦ÓÃ
Éî¶ÈѧϰÔÚ×Ô¶¯¼ÝÊ»ÖеÄÓ¦ÓÃ
ͼÉñ¾­ÍøÂçÔÚ½»²æÑ§¿ÆÁìÓòµÄÓ¦ÓÃÑо¿
ÎÞÈË»úϵͳԭÀí
Ïà¹Ø¿Î³Ì

È˹¤ÖÇÄÜ¡¢»úÆ÷ѧϰ&TensorFlow
»úÆ÷ÈËÈí¼þ¿ª·¢¼¼Êõ
È˹¤ÖÇÄÜ£¬»úÆ÷ѧϰºÍÉî¶Èѧϰ
ͼÏñ´¦ÀíËã·¨·½·¨Óëʵ¼ù