| ±à¼ÍƼö: |
±¾ÎÄÀ´×ÔÓÚ¼òÊ飬±¾ÎÄÕÂÖ÷Ҫͨ¹ý¾ÙÀýÀ´ÂÛÖ¤»úÆ÷ѧϰËã·¨£¬Í¨¹ý¾ØÕó½øÐÐÇ¿»¯Ñ§Ï°½éÉÜ¡£
|
|
Ëùνǿ»¯Ñ§Ï°¾ÍÊÇÖÇÄÜϵͳ´Ó»·¾³µ½ÐÐΪӳÉäµÄѧϰ£¬ÒÔʹ½±ÀøÐźÅ(Ç¿»¯ÐźÅ)º¯ÊýÖµ×î´ó¡£Èç¹ûAgentµÄij¸öÐÐΪ²ßÂÔµ¼Ö»·¾³ÕýµÄ½±ÉÍ(Ç¿»¯ÐźÅ)£¬ÄÇôAgentÒÔºó²úÉúÕâ¸öÐÐΪ²ßÂÔµÄÇ÷ÊÆ±ã»á¼ÓÇ¿
-¡¶°Ù¿Æ¡·
¼òµ¥À´Ëµ¾ÍÊǸøÄãһֻС°×ÊóÔÚÃÔ¹¬ÀïÃæ£¬Èç¹ûËû×ß³öÁËÕýÈ·µÄ²½×Ó£¬¾Í»á¸øËüÕý·´À¡£¨ÌÇ£©£¬·ñÔò¸ø³ö¸º·´À¡£¨µã»÷£©£¬ÄÇô£¬µ±Ëü×ßÍêËùÓеĵÀ·ºó¡£ÎÞÂ۱ȰÑËü·Åµ½ÄĶù£¬Ëü¶¼ÄÜͨ¹ýÒÔÍùµÄѧϰÕÒµ½×îÕýÈ·µÄµÀ·¡£
ÏÂÃæÖ±½ÓÉÏÀý×Ó£º
¼ÙÉèÎÒÃÇÓÐ5¼ä·¿£¬ÈçÏÂͼËùʾ£¬Õâ5¼ä·¿ÓÐЩ·¿¼äÊÇÏëͨµÄ£¬ÎÒÃÇ·Ö±ðÓÃ0-4½øÐÐÁ˱ê×¢£¬ÆäÖÐ5´ú±íÁËÊÇÊdzö¿Ú¡£

ÎÒÃÇʹÓÃÒ»¸±Í¼À´±íʾ£¬¾ÍÊÇÏÂÃæÕâ¸öÑù×Ó

ÔÚÕâ¸öÀý×ÓÀÎÒÃǵÄÄ¿±êÊÇÄܹ»×ß³ö·¿¼ä£¬¾ÍÊǵ½´ï5µÄλÖã¬ÎªÁËÄܸüºÃµÄ´ïµ½Õâ¸öÄ¿±ê£¬ÎÒÃÇΪÿһ¸öÃÅÉèÖÃÒ»¸ö½±Àø¡£±ÈÈçÈç¹ûÄÜÁ¢¼´µ½´ï5£¬ÄÇôÎÒÃǸøÓè100µÄ½±Àø£¬ÆäËüû·¨µ½5µÄÎÒÃDz»¸øÓè½±Àø£¬È¨ÖØÊÇ0ÁË£¬ÈçÏÂͼËùʾ

ÒòΪҲ¿ÉÒÔµ½Ëü×Ô¼º£¬ËùÒÔÒ²ÊǸø100µÄ½±Àø£¬ÆäËü·½Ïòµ½5µÄÒ²¶¼ÊÇ100µÄ½±Àø¡£
ÔÚQ-learningÖУ¬Ä¿±êÊÇÈ¨ÖØÖµÀÛ¼ÓµÄ×î´ó»¯£¬ËùÒÔÒ»µ©´ïµ½5£¬Ëü½«»áÒ»Ö±±£³ÖÔÚÕâ¶ù¡£
ÏëÏóÏÂÎÒÃÇÓÐÒ»¸öÐéÄâµÄ»úÆ÷ÈË£¬Ëü¶Ô»·¾³Ò»ÎÞËùÖª£¬µ«ËüÐèҪͨ¹ý×ÔÎÒѧϰ֪µÀÔõôÑùµ½ÍâÃæ£¬¾ÍÊǵ½´ï5µÄλÖá£

ºÃÀ²£¬ÏÖÔÚ¿ÉÒÔÒý³öQ-learningµÄ¸ÅÄîÁË£¬¡°×´Ì¬¡±ÒÔ¼°¡°¶¯×÷¡±£¬ÎÒÃÇ¿ÉÒÔ½«Ã¿¸ö·¿¼ä¿´³ÉÒ»¸östate£¬´ÓÒ»¸ö·¿¼äµ½ÁíÍâÒ»¸ö·¿¼äµÄ¶¯×÷½Ð×öaction£¬stateÊÇÒ»¸ö½Úµã£¬¶øactionÊÇÓÃÒ»¸ö¼ôÍ·±íʾ¡£

ÏÖÔÚ¼ÙÉèÎÒÃÇÔÚ״̬2£¬´Ó״̬2¿ÉÒÔµ½×´Ì¬3£¬¶øÎÞ·¨µ½×´Ì¬0¡¢1¡¢4£¬ÒòΪ2û·¨Ö±½Óµ½0¡¢1¡¢4£»´Ó״̬3£¬¿ÉÒÔµ½1¡¢4»òÕß2£»¶ø4¿ÉÒÔµ½0¡¢3¡¢5£»ÆäËüÒÀ´ÎÀàÍÆ¡£
ËùÒÔÎÒÃÇÄܹ»°ÑÕâЩÓÃÒ»¸ö¾ØÕóÀ´±íʾ£º

Õâ¸ö¾ØÕó¾ÍÊÇ´«ËµÖеÄQ¾ØÕóÁË£¬Õâ¸ö¾ØÕóµÄÁбí±íʾµÄÊǵ±Ç°×´Ì¬£¬¶øÐбê±íʾµÄÔòÊÇÏÂÒ»¸ö״̬£¬±ÈÈçµÚÈýÐеÄÐбêÊÇ2£¬Èç¹ûÈ¡µÚËÄÁУ¬±ÈÈç˵2£¬4¾Í±íʾÁË´Ó2->4µÄÊÕÒæÊÇ0£¬¶ø-1¾Í±íʾÁËû·¨´ÓÒ»¸ö״̬µ½ÁíÍâÒ»¸ö״̬¡£
Q¾ØÕó³õʼ»¯µÄʱºòȫΪ0£¬ÒòΪËüµÄ״̬ÎÒÃÇÒѾȫ²¿ÖªµÀÁË£¬ËùÒÔÎÒÃÇÖªµÀ×ܵÄ״̬ÊÇ6¡£Èç¹ûÎÒÃDz¢²»ÖªµÀÓжàÉÙ¸ö״̬£¬ÄÇôÇë´Ó1¸ö״̬¿ªÊ¼£¬Ò»µ©·¢ÏÖеÄ״̬£¬ÄÇôΪÕâ¸ö¾ØÕóÌí¼ÓÉÏеÄÐкÍÁС£
ÓÚÊÇÎÒÃǾ͵óöÁËÈçÏµĹ«Ê½£º
Q(state, action) = R(state, action) + Gamma * Max[Q(next
state, all actions)]
¸ù¾ÝÕâ¸ö¹«Ê½£¬Q¾ØÕóÖµ £½ RµÄµ±Ç°Öµ + ?Gamma£¨ÏµÊý£©* Q×î´óµÄaction£¨¿´²»¶®²»Òª½ô£¬ºóÃæÓÐÀý×Ó£©
ÎÒÃǵÄÐéÄâ»úÆ÷È˽«Í¨¹ý»·¾³À´Ñ§Ï°£¬»úÆ÷ÈË»á´ÓÒ»¸ö×´Ì¬Ìø×ªµ½ÁíÒ»¸ö״̬£¬Ö±µ½ÎÒÃǵ½´ï×îÖÕ״̬¡£ÎÒÃǰѴӿªÊ¼×´Ì¬¿ªÊ¼Ò»Ö±´ïµ½×îÖÕ״̬µÄÕâ¸ö¹ý³Ì³ÆÖ®ÎªÒ»¸ö³¡¾°£¬»úÆ÷ÈË»á´ÓÒ»¸öËæ»úµÄ¿ªÊ¼³¡¾°³ö·¢£¬Ö±µ½µ½´ï×îÖÕ״̬Íê³ÉÒ»¸ö³¡¾°£¬È»ºóÁ¢¼´ÖØÐ³õʼ»¯µ½Ò»¸ö¿ªÊ¼×´Ì¬£¬´Ó¶ø½øÈëÏÂÒ»¸ö³¡¾°¡£
Òò´Ë£¬ÎÒÃÇ¿ÉÒÔ½«Ëã·¨¹éÄÉÈçÏÂ
Q-learningËã·¨ÈçÏ£º
1 ÉèÖÃgammaÏà¹ØÏµÊý£¬ÒÔ¼°½±Àø¾ØÕóR
2 ½«Q¾ØÕó³õʼ»¯ÎªÈ«0
3 For each episode£º
ÉèÖÃËæ»úµÄ³õʹ״̬
Do While µ±Ã»Óе½´ïÄ¿±êʱ?
Ñ¡ÔñÒ»¸ö×î´ó¿ÉÄÜÐÔµÄaction£¨actionµÄÑ¡ÔñÓÃÒ»¸öËã·¨À´×ö£¬ºóÃæÔÙ½²£©
¸ù¾ÝÕâ¸öactionµ½´ïÏÂÒ»¸ö״̬
¸ù¾Ý¼ÆË㹫ʽ£ºQ(state, action) = R(state,
action) + Gamma * Max[Q(next state, all actions)]¼ÆËãÕâ¸ö״̬QµÄÖµ
ÉèÖõ±Ç°×´Ì¬ÎªËùµ½´ïµÄ״̬
End Do
End For
ÆäÖÐGammaµÄÖµÔÚ0£¬1Ö®¼ä£¨0 <= Gamma <1)¡£Èç¹ûGramma½Ó½ü0£¬¶ÔÁ¢¼´µÄ½±Àø¸üÓÐЧ¡£Èç¹û½Ó½ü1£¬Õû¸öϵͳ»á¸ü¿¼Âǽ«À´µÄ½±Àø¡£
ÒÔÉϾÍÊÇÕû¸öËã·¨ÁË£¬²¢²»ÊǺÜÄѵģ¬ÏÂÃæÀ´¿´¸öÒ»¶ÎÈËÈâËã·¨²Ù×÷£¬ÈÃÄã³¹µ×Ã÷°×Õâ¸öËã·¨¡£
ÈËÈâËã·¨²½Öè
Ê×ÏȽ«Q³õʼ»¯Ò»¸öȫΪ0µÄ¾ØÕó£¬QÊÇÎÒÃÇÄ¿±ê¾ØÕó£¬ÎÒÃÇÏ£ÍûÄܹ»°ÑÕâ¸ö¾ØÕóÌîÂúÈ»

ºó³õʼ»¯ÎÒÃǵÄR¾ØÕ󣬼ÙÉèÕâ¸öÖµÎÒÃǶ¼ÊÇÖªµÀµÄ£¬ÈçÏÂͼËùʾ

ÏÖÔÚ£¬¼ÙÉèÎÒÃǵijõʼλÖÃÊÇstate1£¬Ê×Ïȼì²éÒ»ÏÂÎÒÃǵÄR¾ØÕó£¬ÔÚR¾ØÕóÖз¢ÏÖ´Óstate1¿ÉÒÔµ½2¸öλÖãºstate3¡¢state5£¬ÎÒÃÇËæ»úÑ¡ÔñÒ»¸ö·½Ïò£¬±ÈÈçÎÒÃÇÏÖÔÚ´Ó1µ½5£¬ÎÒÃÇ¿ÉÒÔÓù«Ê½
Q(state, action) = R(state, action) + Gamma * Max[Q(next
state, all actions)]
Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5,
5)]= 100 + 0.8 * 0 = 100
À´¼ÆËã³öQ(1,5)£¬ ÒòΪQ¾ØÕóÊdzõʼ»¯Îª0£¬ËùÒÔ Q(5,1),
Q(5,4),Q(5,5)¶¼ÊÇ0£¬ËùÒÔQ(1,5)µÄֵΪ100£¬ÏÖÔÚ5±ä³ÉÁ˵±Ç°×´Ì¬£¬ÒòΪ5ÒѾÊÇ×îÖÕ״̬ÁË£¬ËùÒÔ£¬Õâ¸ö³¡¾°¾Í½áÊøÄñ£¬Q¾ØÕó±ä³ÉÈçÏÂ

È»ºóÎÒÃÇÔÙËæ»úµÄÑ¡ÔñÒ»¸ö״̬£¬±ÈÈçÏÖÔÚÑ¡ÁË״̬3ΪÎÒÃǵijõʼ״̬£¬ºÃÀ²£¬À´¿´ÎÒÃÇR¾ØÕó£»ÓÐ3¸ö¿ÉÄÜÐÔµÄ1¡¢2¡¢4ÎÒÃÇËæ»úµÄÑ¡Ôñ1£¬¼ÌÐøÓù«Ê½¼ÆË㣺
Q(state, action) = R(state, action) + Gamma * Max[Q(next
state, all actions)]
Q(3, 1) = R(3, 1) + 0.8 * Max[Q(1, 2), Q(1, 5)]= 0
+ 0.8 * Max(0, 100) = 80
È»ºó£¬¸üоØÕ󣬾ØÕó±ä³ÉÁËÕâ¸öÑù×Ó

ÎÒÃǵĵ±Ç°×´Ì¬±ä³ÉÁË1£¬1²¢²»ÊÇ×îÖÕ״̬£¬ËùÒÔËã·¨»¹ÊÇÒªÍùÏÂÖ´ÐУ¬´Ëʱ£¬¹Û²ìR¾ØÕó£¬1ÓÐ1->3,
1->5Á½¸öÑ¡Ôñ£¬×ÓÕâÀïÎÒÃÇÑ¡Ôñ 1->5Õâ¸öactionÓÐ׎ϸ߻ر¨£¬ËùÒÔÎÒÃÇÑ¡ÔñÁË1->5,
ÖØÐ¼ÆËãQ(1,5)µÄÖµ
Q(state, action) = R(state, action) + Gamma * Max[Q(next
state, all actions)]
Q(1, 5) = R(1, 5) + 0.8 * Max[Q(1, 2), Q(1, 5)]= 0
+ 0.8 * Max(0, 100) = 80
ÎªÊ²Ã´ÒªÖØÐ¼ÆËãÄØ£¿ÒòΪijЩֵ¿ÉÄܻᷢÉú±ä»¯£¬¼ÆËãÍêºó¸üоØÕó

ÒòΪ5ÒѾÊÇ×îÖÕ״̬ÁË£¬ËùÒÔ½áÊøÎÒÃDZ¾´Î³¡¾°µü´ú¡£
¾¹ýÑ»·µü´ú£¬ÎÒÃǵóöÁË×îÖÕ½á¹û£¬ÊÇÕâ¸öÑù×ÓµÄ

¾¹ýÕýÔò»¯´¦Àí£¬¾ØÕó×îÖÕ»á±ä³ÉÕâ¸öÑù×Ó

Ç¿»¯Ñ§Ï°µ½´Ë½áÊø¡£ÎÒÃǵĻúÆ÷ÈË×Ô¶¯Ñ§Ï°µ½ÁË×îÓŵÄ·¾¶£¬¾ÍÊǰ´ÕÕ×î´ó½±ÀøÖµµÄ·¾¶¾Í¿ÉÒÔÀ²

ÈçͼºìÏßËùʾ£¬´ú±íÁ˸÷¸öµãµ½´ïÖÕµãµÄ×îÓÅ·¾¶
ÕâÊÇÒ»¸ö¼¶¼òµÄËã·¨£¬Òþ²ØÁ˺ܶàϸ½Ú£¬³öÈ¥´µNBÊǹ»ÁË£¬Êµ¼ùÉÏʵÏÖÆðÀ´»¹ÊÇÓÐÐí¶àÎÊÌâµÄ¡£
ÏÂÃæ¾ÍÊÇϸ½Ú´úÂëÁË£¬¶ÔʵÏÖ¸ÕÐËȤµÄ¼ÌÐøÍùÏ¿´¡£
ÎÒÃÇ֮ǰ˵ÁË£¬Ñ¡Ôñ¶¯×÷µÄÒÀ¾ÝÊÇ¡°Ñ¡ÔñÒ»¸ö×î´ó¿ÉÄÜÐÔµÄaction¡±£¬ÄÇôÕâ¸ö¶¯×÷ÒªÔõÃ´Ñ¡ÄØ£¿
ÎÒÃÇÑ¡Ôñ×î´óÊÕÒæµÄÄǸöÖµ£¬±ÈÈçÔÚR¾ØÕóÖУ¬×ÜÊÇÑ¡ÔñÖµ×î´óµÄÄǸö

Ëã·¨ÎÒÃÇ¿ÉÒÔͨ¹ý´úÂëÀ´±íʾ¾ÍÊÇÕâÑù

´ó¼ÒÏëÒ»ÏÂÕâÑùÊÇ·ñ»á´æÔÚÎÊÌâÄØ£¿µ±È»ÓУ¬Èç¹ûÓм¸¸ö×î´óÖµÔõô´¦ÀíÄØ£¿£¬Èç¹ûÓм¸¸ö×î´óÖµµÄ»°ÎÒÃǾÍËæ»úµÄȡһ¸ößÂ

ÊDz»ÊÇÕâÑù¾Í¿ÉÒÔÁËÄØ£¿´ó¼ÒÏëһϣ¬ÍòÒ»ÔÚµ±Ç°¶¯×÷ÊÕÒæºÜС£¬Ð¡ÊÕÒæµ½´ïµÄ״̬µÄºóÐøaction¿ÉÄÜ»á¸ü´ó£¬ËùÒÔ£¬ÎÒÃDz»ÄÜÖ±½Óѡȡ×î´óµÄÊÕÒæ£¬¶øÊÇÐèҪʹÓÃÒ»¸öеļ¼ÊõÀ´Ì½Ë÷£¬ÔÚÕâÀÎÒÃÇʹÓÃÁËepsilon£¬Ê×ÏÈÎÒÃÇÓòúÉúÒ»¸öËæ»úÖµ£¬Èç¹ûÕâ¸öËæ»úֵСÓÚepsilon£¬ÄÇôÏÂÒ»¸öaction»áÊÇËæ»ú¶¯×÷£¬·ñÔò²ÉÓÃ×é´óÖµ£¬´úÂëÈçÏÂ

µ«Êµ¼ÊÉÏÕâÖÖ×ö·¨»¹ÊÇÓÐÎÊÌâµÄ£¬ÎÊÌâÊǼ´Ê¹ÎÒÃÇÒѾѧϰÍê±ÏÁË£¬ÒѾ֪µÀÁË×îÓŽ⣬µ±ÎÒÃÇÑ¡ÔñÒ»¸ö¶¯×÷ʱ£¬Ëü»¹ÊÇ»á¼ÌÐø²ÉÈ¡Ëæ»úµÄ¶¯×÷¡£ÓÐÐí¶à·½·¨¿ÉÒÔ¿Ë·þÕâ¸ö£¬±È½ÏÓÐÃû³ÆÖ®Îªmouse
learns: ûѻ·Ò»´Î¾Í¼õÉÙepsilonµÄÖµ£¬ÕâÑùËæ×ÅѧϰµÄ½øÐУ¬Ëæ»úÔ½À´Ô½²»ÈÝÒ×´¥·¢£¬´Ó¶ø¼õÉÙËæ»ú¶ÔϵͳµÄÓ°Ï죬³£ÓõļõÉÙ·½·¨ÓÐÒÔϼ¸ÖÖ£¬´ó¼Ò¿ÉÒÔ¸ù¾ÝÇé¿öÑ¡ÓÃ

|