Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Model Center   Code  
»áÔ±   
   
 
     
   
 ¶©ÔÄ
  ¾èÖú
¡¶ÉîÈëdz³ö»úÆ÷ѧϰ¡·Ö®Ç¿»¯Ñ§Ï°
 
  2456  次浏览      28
 2018-11-21
 
±à¼­ÍƼö:

±¾ÎÄÀ´×ÔÓÚ¼òÊ飬±¾ÎÄÕÂÖ÷Ҫͨ¹ý¾ÙÀýÀ´ÂÛÖ¤»úÆ÷ѧϰËã·¨£¬Í¨¹ý¾ØÕó½øÐÐÇ¿»¯Ñ§Ï°½éÉÜ¡£

Ëùνǿ»¯Ñ§Ï°¾ÍÊÇÖÇÄÜϵͳ´Ó»·¾³µ½ÐÐΪӳÉäµÄѧϰ£¬ÒÔʹ½±ÀøÐźÅ(Ç¿»¯ÐźÅ)º¯ÊýÖµ×î´ó¡£Èç¹ûAgentµÄij¸öÐÐΪ²ßÂÔµ¼Ö»·¾³ÕýµÄ½±ÉÍ(Ç¿»¯ÐźÅ)£¬ÄÇôAgentÒÔºó²úÉúÕâ¸öÐÐΪ²ßÂÔµÄÇ÷ÊÆ±ã»á¼ÓÇ¿ -¡¶°Ù¿Æ¡·

¼òµ¥À´Ëµ¾ÍÊǸøÄãһֻС°×ÊóÔÚÃÔ¹¬ÀïÃæ£¬Èç¹ûËû×ß³öÁËÕýÈ·µÄ²½×Ó£¬¾Í»á¸øËüÕý·´À¡£¨ÌÇ£©£¬·ñÔò¸ø³ö¸º·´À¡£¨µã»÷£©£¬ÄÇô£¬µ±Ëü×ßÍêËùÓеĵÀ·ºó¡£ÎÞÂ۱ȰÑËü·Åµ½ÄĶù£¬Ëü¶¼ÄÜͨ¹ýÒÔÍùµÄѧϰÕÒµ½×îÕýÈ·µÄµÀ·¡£

ÏÂÃæÖ±½ÓÉÏÀý×Ó£º

¼ÙÉèÎÒÃÇÓÐ5¼ä·¿£¬ÈçÏÂͼËùʾ£¬Õâ5¼ä·¿ÓÐЩ·¿¼äÊÇÏëͨµÄ£¬ÎÒÃÇ·Ö±ðÓÃ0-4½øÐÐÁ˱ê×¢£¬ÆäÖÐ5´ú±íÁËÊÇÊdzö¿Ú¡£

ÎÒÃÇʹÓÃÒ»¸±Í¼À´±íʾ£¬¾ÍÊÇÏÂÃæÕâ¸öÑù×Ó

ÔÚÕâ¸öÀý×ÓÀÎÒÃǵÄÄ¿±êÊÇÄܹ»×ß³ö·¿¼ä£¬¾ÍÊǵ½´ï5µÄλÖã¬ÎªÁËÄܸüºÃµÄ´ïµ½Õâ¸öÄ¿±ê£¬ÎÒÃÇΪÿһ¸öÃÅÉèÖÃÒ»¸ö½±Àø¡£±ÈÈçÈç¹ûÄÜÁ¢¼´µ½´ï5£¬ÄÇôÎÒÃǸøÓè100µÄ½±Àø£¬ÆäËüû·¨µ½5µÄÎÒÃDz»¸øÓè½±Àø£¬È¨ÖØÊÇ0ÁË£¬ÈçÏÂͼËùʾ

ÒòΪҲ¿ÉÒÔµ½Ëü×Ô¼º£¬ËùÒÔÒ²ÊǸø100µÄ½±Àø£¬ÆäËü·½Ïòµ½5µÄÒ²¶¼ÊÇ100µÄ½±Àø¡£ ÔÚQ-learningÖУ¬Ä¿±êÊÇÈ¨ÖØÖµÀÛ¼ÓµÄ×î´ó»¯£¬ËùÒÔÒ»µ©´ïµ½5£¬Ëü½«»áÒ»Ö±±£³ÖÔÚÕâ¶ù¡£

ÏëÏóÏÂÎÒÃÇÓÐÒ»¸öÐéÄâµÄ»úÆ÷ÈË£¬Ëü¶Ô»·¾³Ò»ÎÞËùÖª£¬µ«ËüÐèҪͨ¹ý×ÔÎÒѧϰ֪µÀÔõôÑùµ½ÍâÃæ£¬¾ÍÊǵ½´ï5µÄλÖá£

ºÃÀ²£¬ÏÖÔÚ¿ÉÒÔÒý³öQ-learningµÄ¸ÅÄîÁË£¬¡°×´Ì¬¡±ÒÔ¼°¡°¶¯×÷¡±£¬ÎÒÃÇ¿ÉÒÔ½«Ã¿¸ö·¿¼ä¿´³ÉÒ»¸östate£¬´ÓÒ»¸ö·¿¼äµ½ÁíÍâÒ»¸ö·¿¼äµÄ¶¯×÷½Ð×öaction£¬stateÊÇÒ»¸ö½Úµã£¬¶øactionÊÇÓÃÒ»¸ö¼ôÍ·±íʾ¡£

ÏÖÔÚ¼ÙÉèÎÒÃÇÔÚ״̬2£¬´Ó״̬2¿ÉÒÔµ½×´Ì¬3£¬¶øÎÞ·¨µ½×´Ì¬0¡¢1¡¢4£¬ÒòΪ2û·¨Ö±½Óµ½0¡¢1¡¢4£»´Ó״̬3£¬¿ÉÒÔµ½1¡¢4»òÕß2£»¶ø4¿ÉÒÔµ½0¡¢3¡¢5£»ÆäËüÒÀ´ÎÀàÍÆ¡£

ËùÒÔÎÒÃÇÄܹ»°ÑÕâЩÓÃÒ»¸ö¾ØÕóÀ´±íʾ£º

Õâ¸ö¾ØÕó¾ÍÊÇ´«ËµÖеÄQ¾ØÕóÁË£¬Õâ¸ö¾ØÕóµÄÁбí±íʾµÄÊǵ±Ç°×´Ì¬£¬¶øÐбê±íʾµÄÔòÊÇÏÂÒ»¸ö״̬£¬±ÈÈçµÚÈýÐеÄÐбêÊÇ2£¬Èç¹ûÈ¡µÚËÄÁУ¬±ÈÈç˵2£¬4¾Í±íʾÁË´Ó2->4µÄÊÕÒæÊÇ0£¬¶ø-1¾Í±íʾÁËû·¨´ÓÒ»¸ö״̬µ½ÁíÍâÒ»¸ö״̬¡£

Q¾ØÕó³õʼ»¯µÄʱºòȫΪ0£¬ÒòΪËüµÄ״̬ÎÒÃÇÒѾ­È«²¿ÖªµÀÁË£¬ËùÒÔÎÒÃÇÖªµÀ×ܵÄ״̬ÊÇ6¡£Èç¹ûÎÒÃDz¢²»ÖªµÀÓжàÉÙ¸ö״̬£¬ÄÇôÇë´Ó1¸ö״̬¿ªÊ¼£¬Ò»µ©·¢ÏÖеÄ״̬£¬ÄÇôΪÕâ¸ö¾ØÕóÌí¼ÓÉÏеÄÐкÍÁС£

ÓÚÊÇÎÒÃǾ͵óöÁËÈçÏµĹ«Ê½£º

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

¸ù¾ÝÕâ¸ö¹«Ê½£¬Q¾ØÕóÖµ £½ RµÄµ±Ç°Öµ + ?Gamma£¨ÏµÊý£©* Q×î´óµÄaction£¨¿´²»¶®²»Òª½ô£¬ºóÃæÓÐÀý×Ó£©

ÎÒÃǵÄÐéÄâ»úÆ÷È˽«Í¨¹ý»·¾³À´Ñ§Ï°£¬»úÆ÷ÈË»á´ÓÒ»¸ö×´Ì¬Ìø×ªµ½ÁíÒ»¸ö״̬£¬Ö±µ½ÎÒÃǵ½´ï×îÖÕ״̬¡£ÎÒÃǰѴӿªÊ¼×´Ì¬¿ªÊ¼Ò»Ö±´ïµ½×îÖÕ״̬µÄÕâ¸ö¹ý³Ì³ÆÖ®ÎªÒ»¸ö³¡¾°£¬»úÆ÷ÈË»á´ÓÒ»¸öËæ»úµÄ¿ªÊ¼³¡¾°³ö·¢£¬Ö±µ½µ½´ï×îÖÕ״̬Íê³ÉÒ»¸ö³¡¾°£¬È»ºóÁ¢¼´ÖØÐ³õʼ»¯µ½Ò»¸ö¿ªÊ¼×´Ì¬£¬´Ó¶ø½øÈëÏÂÒ»¸ö³¡¾°¡£

Òò´Ë£¬ÎÒÃÇ¿ÉÒÔ½«Ëã·¨¹éÄÉÈçÏÂ

Q-learningËã·¨ÈçÏ£º

1 ÉèÖÃgammaÏà¹ØÏµÊý£¬ÒÔ¼°½±Àø¾ØÕóR

2 ½«Q¾ØÕó³õʼ»¯ÎªÈ«0

3 For each episode£º

ÉèÖÃËæ»úµÄ³õʹ״̬

Do While µ±Ã»Óе½´ïÄ¿±êʱ?

Ñ¡ÔñÒ»¸ö×î´ó¿ÉÄÜÐÔµÄaction£¨actionµÄÑ¡ÔñÓÃÒ»¸öËã·¨À´×ö£¬ºóÃæÔÙ½²£©

¸ù¾ÝÕâ¸öactionµ½´ïÏÂÒ»¸ö״̬

¸ù¾Ý¼ÆË㹫ʽ£ºQ(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]¼ÆËãÕâ¸ö״̬QµÄÖµ

ÉèÖõ±Ç°×´Ì¬ÎªËùµ½´ïµÄ״̬

End Do

End For

ÆäÖÐGammaµÄÖµÔÚ0£¬1Ö®¼ä£¨0 <= Gamma <1)¡£Èç¹ûGramma½Ó½ü0£¬¶ÔÁ¢¼´µÄ½±Àø¸üÓÐЧ¡£Èç¹û½Ó½ü1£¬Õû¸öϵͳ»á¸ü¿¼Âǽ«À´µÄ½±Àø¡£

ÒÔÉϾÍÊÇÕû¸öËã·¨ÁË£¬²¢²»ÊǺÜÄѵģ¬ÏÂÃæÀ´¿´¸öÒ»¶ÎÈËÈâËã·¨²Ù×÷£¬ÈÃÄã³¹µ×Ã÷°×Õâ¸öËã·¨¡£

ÈËÈâËã·¨²½Öè

Ê×ÏȽ«Q³õʼ»¯Ò»¸öȫΪ0µÄ¾ØÕó£¬QÊÇÎÒÃÇÄ¿±ê¾ØÕó£¬ÎÒÃÇÏ£ÍûÄܹ»°ÑÕâ¸ö¾ØÕóÌîÂúÈ»

ºó³õʼ»¯ÎÒÃǵÄR¾ØÕ󣬼ÙÉèÕâ¸öÖµÎÒÃǶ¼ÊÇÖªµÀµÄ£¬ÈçÏÂͼËùʾ

ÏÖÔÚ£¬¼ÙÉèÎÒÃǵijõʼλÖÃÊÇstate1£¬Ê×Ïȼì²éÒ»ÏÂÎÒÃǵÄR¾ØÕó£¬ÔÚR¾ØÕóÖз¢ÏÖ´Óstate1¿ÉÒÔµ½2¸öλÖãºstate3¡¢state5£¬ÎÒÃÇËæ»úÑ¡ÔñÒ»¸ö·½Ïò£¬±ÈÈçÎÒÃÇÏÖÔÚ´Ó1µ½5£¬ÎÒÃÇ¿ÉÒÔÓù«Ê½

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)]= 100 + 0.8 * 0 = 100

À´¼ÆËã³öQ(1,5)£¬ ÒòΪQ¾ØÕóÊdzõʼ»¯Îª0£¬ËùÒÔ Q(5,1), Q(5,4),Q(5,5)¶¼ÊÇ0£¬ËùÒÔQ(1,5)µÄֵΪ100£¬ÏÖÔÚ5±ä³ÉÁ˵±Ç°×´Ì¬£¬ÒòΪ5ÒѾ­ÊÇ×îÖÕ״̬ÁË£¬ËùÒÔ£¬Õâ¸ö³¡¾°¾Í½áÊøÄñ£¬Q¾ØÕó±ä³ÉÈçÏÂ

È»ºóÎÒÃÇÔÙËæ»úµÄÑ¡ÔñÒ»¸ö״̬£¬±ÈÈçÏÖÔÚÑ¡ÁË״̬3ΪÎÒÃǵijõʼ״̬£¬ºÃÀ²£¬À´¿´ÎÒÃÇR¾ØÕó£»ÓÐ3¸ö¿ÉÄÜÐÔµÄ1¡¢2¡¢4ÎÒÃÇËæ»úµÄÑ¡Ôñ1£¬¼ÌÐøÓù«Ê½¼ÆË㣺

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

Q(3, 1) = R(3, 1) + 0.8 * Max[Q(1, 2), Q(1, 5)]= 0 + 0.8 * Max(0, 100) = 80

È»ºó£¬¸üоØÕ󣬾ØÕó±ä³ÉÁËÕâ¸öÑù×Ó

ÎÒÃǵĵ±Ç°×´Ì¬±ä³ÉÁË1£¬1²¢²»ÊÇ×îÖÕ״̬£¬ËùÒÔËã·¨»¹ÊÇÒªÍùÏÂÖ´ÐУ¬´Ëʱ£¬¹Û²ìR¾ØÕó£¬1ÓÐ1->3, 1->5Á½¸öÑ¡Ôñ£¬×ÓÕâÀïÎÒÃÇÑ¡Ôñ 1->5Õâ¸öactionÓÐ׎ϸ߻ر¨£¬ËùÒÔÎÒÃÇÑ¡ÔñÁË1->5, ÖØÐ¼ÆËãQ(1,5)µÄÖµ

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

Q(1, 5) = R(1, 5) + 0.8 * Max[Q(1, 2), Q(1, 5)]= 0 + 0.8 * Max(0, 100) = 80

ÎªÊ²Ã´ÒªÖØÐ¼ÆËãÄØ£¿ÒòΪijЩֵ¿ÉÄܻᷢÉú±ä»¯£¬¼ÆËãÍêºó¸üоØÕó

ÒòΪ5ÒѾ­ÊÇ×îÖÕ״̬ÁË£¬ËùÒÔ½áÊøÎÒÃDZ¾´Î³¡¾°µü´ú¡£

¾­¹ýÑ­»·µü´ú£¬ÎÒÃǵóöÁË×îÖÕ½á¹û£¬ÊÇÕâ¸öÑù×ÓµÄ

¾­¹ýÕýÔò»¯´¦Àí£¬¾ØÕó×îÖÕ»á±ä³ÉÕâ¸öÑù×Ó

Ç¿»¯Ñ§Ï°µ½´Ë½áÊø¡£ÎÒÃǵĻúÆ÷ÈË×Ô¶¯Ñ§Ï°µ½ÁË×îÓŵÄ·¾¶£¬¾ÍÊǰ´ÕÕ×î´ó½±ÀøÖµµÄ·¾¶¾Í¿ÉÒÔÀ²

ÈçͼºìÏßËùʾ£¬´ú±íÁ˸÷¸öµãµ½´ïÖÕµãµÄ×îÓÅ·¾¶

ÕâÊÇÒ»¸ö¼¶¼òµÄËã·¨£¬Òþ²ØÁ˺ܶàϸ½Ú£¬³öÈ¥´µNBÊǹ»ÁË£¬Êµ¼ùÉÏʵÏÖÆðÀ´»¹ÊÇÓÐÐí¶àÎÊÌâµÄ¡£

ÏÂÃæ¾ÍÊÇϸ½Ú´úÂëÁË£¬¶ÔʵÏÖ¸ÕÐËȤµÄ¼ÌÐøÍùÏ¿´¡£

ÎÒÃÇ֮ǰ˵ÁË£¬Ñ¡Ôñ¶¯×÷µÄÒÀ¾ÝÊÇ¡°Ñ¡ÔñÒ»¸ö×î´ó¿ÉÄÜÐÔµÄaction¡±£¬ÄÇôÕâ¸ö¶¯×÷ÒªÔõÃ´Ñ¡ÄØ£¿

ÎÒÃÇÑ¡Ôñ×î´óÊÕÒæµÄÄǸöÖµ£¬±ÈÈçÔÚR¾ØÕóÖУ¬×ÜÊÇÑ¡ÔñÖµ×î´óµÄÄǸö

Ëã·¨ÎÒÃÇ¿ÉÒÔͨ¹ý´úÂëÀ´±íʾ¾ÍÊÇÕâÑù

´ó¼ÒÏëÒ»ÏÂÕâÑùÊÇ·ñ»á´æÔÚÎÊÌâÄØ£¿µ±È»ÓУ¬Èç¹ûÓм¸¸ö×î´óÖµÔõô´¦ÀíÄØ£¿£¬Èç¹ûÓм¸¸ö×î´óÖµµÄ»°ÎÒÃǾÍËæ»úµÄȡһ¸ößÂ

ÊDz»ÊÇÕâÑù¾Í¿ÉÒÔÁËÄØ£¿´ó¼ÒÏëһϣ¬ÍòÒ»ÔÚµ±Ç°¶¯×÷ÊÕÒæºÜС£¬Ð¡ÊÕÒæµ½´ïµÄ״̬µÄºóÐøaction¿ÉÄÜ»á¸ü´ó£¬ËùÒÔ£¬ÎÒÃDz»ÄÜÖ±½Óѡȡ×î´óµÄÊÕÒæ£¬¶øÊÇÐèҪʹÓÃÒ»¸öеļ¼ÊõÀ´Ì½Ë÷£¬ÔÚÕâÀÎÒÃÇʹÓÃÁËepsilon£¬Ê×ÏÈÎÒÃÇÓòúÉúÒ»¸öËæ»úÖµ£¬Èç¹ûÕâ¸öËæ»úֵСÓÚepsilon£¬ÄÇôÏÂÒ»¸öaction»áÊÇËæ»ú¶¯×÷£¬·ñÔò²ÉÓÃ×é´óÖµ£¬´úÂëÈçÏÂ

µ«Êµ¼ÊÉÏÕâÖÖ×ö·¨»¹ÊÇÓÐÎÊÌâµÄ£¬ÎÊÌâÊǼ´Ê¹ÎÒÃÇÒѾ­Ñ§Ï°Íê±ÏÁË£¬ÒѾ­ÖªµÀÁË×îÓŽ⣬µ±ÎÒÃÇÑ¡ÔñÒ»¸ö¶¯×÷ʱ£¬Ëü»¹ÊÇ»á¼ÌÐø²ÉÈ¡Ëæ»úµÄ¶¯×÷¡£ÓÐÐí¶à·½·¨¿ÉÒÔ¿Ë·þÕâ¸ö£¬±È½ÏÓÐÃû³ÆÖ®Îªmouse learns: ûѭ»·Ò»´Î¾Í¼õÉÙepsilonµÄÖµ£¬ÕâÑùËæ×ÅѧϰµÄ½øÐУ¬Ëæ»úÔ½À´Ô½²»ÈÝÒ×´¥·¢£¬´Ó¶ø¼õÉÙËæ»ú¶ÔϵͳµÄÓ°Ï죬³£ÓõļõÉÙ·½·¨ÓÐÒÔϼ¸ÖÖ£¬´ó¼Ò¿ÉÒÔ¸ù¾ÝÇé¿öÑ¡ÓÃ

 

   
2456 ´Îä¯ÀÀ       28
Ïà¹ØÎÄÕÂ

»ùÓÚͼ¾í»ýÍøÂçµÄͼÉî¶Èѧϰ
×Ô¶¯¼ÝÊ»ÖеÄ3DÄ¿±ê¼ì²â
¹¤Òµ»úÆ÷ÈË¿ØÖÆÏµÍ³¼Ü¹¹½éÉÜ
ÏîĿʵս£ºÈçºÎ¹¹½¨ÖªÊ¶Í¼Æ×
 
Ïà¹ØÎĵµ

5GÈ˹¤ÖÇÄÜÎïÁªÍøµÄµäÐÍÓ¦ÓÃ
Éî¶ÈѧϰÔÚ×Ô¶¯¼ÝÊ»ÖеÄÓ¦ÓÃ
ͼÉñ¾­ÍøÂçÔÚ½»²æÑ§¿ÆÁìÓòµÄÓ¦ÓÃÑо¿
ÎÞÈË»úϵͳԭÀí
Ïà¹Ø¿Î³Ì

È˹¤ÖÇÄÜ¡¢»úÆ÷ѧϰ&TensorFlow
»úÆ÷ÈËÈí¼þ¿ª·¢¼¼Êõ
È˹¤ÖÇÄÜ£¬»úÆ÷ѧϰºÍÉî¶Èѧϰ
ͼÏñ´¦ÀíËã·¨·½·¨Óëʵ¼ù