ÒÔÏÂÄÚÈÝÀ´Ô´ÓÚÒ»´Î²¿ÃÅÄÚ²¿µÄ·ÖÏí£¬Ö÷ÒªÕë¶ÔAI³õѧÕߣ¬½éÉܰüÀ¨CNN¡¢Deep
Q NetworkÒÔ¼°TensorFlowƽ̨µÈÄÚÈÝ¡£ÓÉÓÚ±ÊÕß²¢·ÇÉî¶ÈѧϰËã·¨Ñо¿Õߣ¬Òò´ËÒÔϸü¶à´ÓÓ¦ÓõĽǶȶÔÕû¸öϵͳ½øÐнéÉÜ£¬¶ø²»»á½øÐÐÏêϸµÄ¹«Ê½ÍƵ¼¡£
±¾ÎÄÖ÷Òª½éÉÜÈçºÎͨ¹ýAI£¨È˹¤ÖÇÄÜ£©µÄ·½Ê½ÍæFlappy BirdÓÎÏ·£¬·ÖΪÒÔÏÂËĸö²¿·ÖÄÚÈÝ£º
1.Flappy Bird ÓÎϷչʾ
2.Ä£ÐÍ£º¾í»ýÉñ¾ÍøÂç
3.Ëã·¨£ºDeep Q Network
4.´úÂ룺TensorFlowʵÏÖ
Ò»¡¢Flappy Bird ÓÎϷչʾ
ÔÚ½éÉÜÄ£ÐÍ¡¢Ë㷨ǰÏÈÀ´Ö±½Ó¿´ÏÂЧ¹û£¬ÉÏͼÊǸտªÊ¼ÑµÁ·µÄʱºò£¬»ÃæÖеÄСÄñ¾ÍÏñÎÞÍ·²ÔÓ¬Ò»ÑùÂÒ·É£¬ÏÂͼչʾµÄÊÇÔÚ±¾»ú£¨ºóÃæ»á¸ø³öÅäÖã©ÑµÁ·³¬¹ý10Сʱºó£¨ÑµÁ·²½Êý³¬¹ý2000000£©µÄÇé¿ö£¬Æä×îºÃ³É¼¨ÒѾ³¬¹ý200·Ö£¬ÈËÀàÍæ¼ÒÒÑ»ù±¾²»¿ÉÄܳ¬Ô½¡£

ѵÁ·ÊýСÓÚ10000²½£¨¸Õ¿ªÊ¼ÑµÁ·£©

ѵÁ·²½Êý´óÓÚ2000000²½£¨10Сʱºó£©
ÓÉÓÚ±¾»úÅäÖÃÁËCUDAÒÔ¼°cuDNN£¬²ÉÓÃÁËNVIDIAµÄÏÔ¿¨½øÐв¢ÐмÆË㣬ËùÒÔÕâÀïÌáǰÌùÒ»ÏÂÔËÐÐʱµÄÈÕÖ¾Êä³ö¡£
¹ØÓÚCUDAÒÔ¼°cuDNNµÄÅäÖã¬ÆäÖÐÓÐһЩ¿Ó°üÀ¨£º°²×°CUDAÖ®ºóÑ»·µÇ¼£¬ÆÁÄ»·Ö±æÂÊÎÞ·¨Õý³£µ÷½ÚµÈµÈ£¬¶¼
ÊÇÓÉÓÚNVIDIAÇý¶¯°²×°µÄÎÊÌ⣬Õâ²»ÊDZ¾ÎÄÒªÌÖÂÛµÄÖ÷ÒªÄÚÈÝ£¬¶ÁÕß¿É×ÔÐÐGoogle¡£
|
¼ÓÔØCUDAÔËËã¿â

TensorFlowÔËÐÐÉ豸 /gpu:0

/gpu:0 ÕâÊÇTensorFlowƽ̨ĬÈϵÄÅäÖ÷½·¨£¬±íʾʹÓÃϵͳÖеĵÚÒ»¿éÏÔ¿¨¡£
±¾»úÈíÓ²¼þÅäÖãº
ϵͳ£ºUbuntu 16.04
ÏÔ¿¨£ºNVIDIA GeForce GTX 745 4G
°æ±¾£ºTensorFlow 1.0
Èí¼þ°ü£ºOpenCV 3.2.0¡¢Pygame¡¢Numpy¡¢¡
ϸÐĵÄÅóÓÑ¿ÉÄÜ·¢ÏÖ£¬±ÊÕßµÄÏÔ¿¨ÅäÖò¢²»¸ß£¬GeForce GTX 745£¬ÏÔ´æ3.94G£¬¿ÉÓÃ3.77G£¨×ÀÃæÕ¼ÓÃÁËÒ»²¿·Ö£©£¬ÊôÓÚÈëÃÅÖеÄÈëÃÅ¡£¶ÔÓÚרҵ×öÉî¶ÈѧϰËã·¨µÄÅóÓÑ£¬Õâ¸öÏÔ¿¨±ØÈ»ÊDz»¹»µÄ¡£ÖªºõÉÏÓÐÌû×ӽ̴ó¼ÒÔõôÅäÖøüרҵµÄÏÔ¿¨£¬ÓÐÐËȤµÄ¿ÉÒÔÒÆ²½¡£
|
¶þ¡¢Ä£ÐÍ£º¾í»ýÉñ¾ÍøÂç
Éñ¾ÍøÂçËã·¨ÊÇÓÉÖÚ¶àµÄÉñ¾Ôª¿Éµ÷µÄÁ¬½ÓȨֵÁ¬½Ó¶ø³É£¬¾ßÓдó¹æÄ£²¢Ðд¦Àí¡¢·Ö²¼Ê½ÐÅÏ¢´æ´¢¡¢Á¼ºÃµÄ×Ô×éÖ¯×ÔѧϰÄÜÁ¦µÈÌØµã¡£È˹¤Éñ¾ÔªÓëÉúÎïÉñ¾Ôª½á¹¹ÀàËÆ£¬Æä½á¹¹¶Ô±ÈÈçÏÂͼËùʾ¡£


È˹¤Éñ¾ÔªµÄÊäÈ루x1,x2¡xm£©ÀàËÆÓÚÉúÎïÉñ¾ÔªµÄÊ÷Í»£¬ÊäÈë¾¹ý²»Í¬µÄȨֵ£¨wk1,
wk2, ¡.wkn£©£¬¼ÓÉÏÆ«Ö㬾¹ý¼¤»îº¯ÊýµÃµ½Êä³ö£¬×îºó½«Êä³ö´«Êäµ½ÏÂÒ»²ãÉñ¾Ôª½øÐд¦Àí¡£

¼¤»îº¯ÊýΪÕû¸öÍøÂçÒýÈëÁË·ÇÏßÐÔÌØÐÔ£¬ÕâÒ²ÊÇÉñ¾ÍøÂçÏà±ÈÓڻعéµÈËã·¨ÄâºÏÄÜÁ¦¸üÇ¿µÄÔÒò¡£³£Óõļ¤»îº¯Êý°üÀ¨sigmoid¡¢tanhµÈ£¬ËüÃǵĺ¯Êý±í´ïʽÈçÏ£º


ÕâÀï¿ÉÒÔ¿´³ö£¬sigmoidº¯ÊýµÄÖµÓòÊÇ£¨0,1£©£¬tanhº¯ÊýµÄÖµÓòÊÇ£¨-1,1£©¡£
¾í»ýÉñ¾ÍøÂçÆðÔ´ÓÚ¶¯ÎïµÄÊÓ¾õϵͳ£¬Ö÷Òª°üº¬µÄ¼¼ÊõÊÇ£º
1.¾Ö²¿¸ÐÖªÓò£¨Ï¡ÊèÁ¬½Ó£©£»
2.²ÎÊý¹²Ïí£»
3.¶à¾í»ýºË£»
4.³Ø»¯¡£
1. ¾Ö²¿¸ÐÖªÓò£¨Ï¡ÊèÁ¬½Ó£©
È«Á¬½ÓÍøÂçµÄÎÊÌâÔÚÓÚ£º
ÐèҪѵÁ·µÄ²ÎÊý¹ý¶à£¬ÈÝÆ÷µ¼Ö½á¹û²»ÊÕÁ²£¨ÌݶÈÏûʧ£©£¬ÇÒѵÁ·ÄѶȼ«´ó£»
ʵ¼ÊÉ϶ÔÓÚij¸ö¾Ö²¿µÄÉñ¾ÔªÀ´½²£¬Ëü¸ü¼ÓÃô¸ÐµÄÊÇС·¶Î§ÄÚµÄÊäÈ룬»»¾ä»°Ëµ£¬¶ÔÓÚ½ÏÔ¶µÄÊäÈ룬ÆäÏà¹ØÐԺܵͣ¬È¨ÖµÒ²¾Í·Ç³£Ð¡¡£
ÈËÀàµÄÊÓ¾õϵͳ¾ö¶¨ÁËÈËÔÚ¹Û²ìÍâ½çµÄʱºò£¬×ÜÊÇ´Ó¾Ö²¿µ½È«¾Ö¡£
±ÈÈ磬ÎÒÃÇ¿´µ½Ò»¸öÃÀÅ®£¬¿ÉÄÜ×îÏȹ۲쵽µÄÊÇÃÀÅ®ÉíÉϵÄijЩ²¿Î»£¨×Ô¼ºÌå»á£©¡£
|
Òò´Ë£¬¾í»ýÉñ¾ÍøÂçÓëÈËÀàµÄÊÓ¾õÀàËÆ£¬²ÉÓþֲ¿¸ÐÖª£¬µÍ²ãµÄÉñ¾ÔªÖ»¸ºÔð¸ÐÖª¾Ö²¿µÄÐÅÏ¢£¬ÔÚÏòºó´«ÊäµÄ¹ý³ÌÖУ¬¸ß²ãµÄÉñ¾Ôª½«¾Ö²¿ÐÅÏ¢×ÛºÏÆðÀ´µÃµ½È«¾ÖÐÅÏ¢¡£

È«Á¬½ÓÓë¾Ö²¿Á¬½ÓµÄ¶Ô±È£¨Í¼Æ¬À´×Ô»¥ÁªÍø£©
´ÓÉÏͼÖпÉÒÔ¿´³ö£¬²ÉÓþֲ¿Á¬½ÓÖ®ºó£¬¿ÉÒÔ´ó´óµÄ½µµÍѵÁ·²ÎÊýµÄÁ¿¼¶¡£
2. ²ÎÊý¹²Ïí
ËäȻͨ¹ý¾Ö²¿¸ÐÖª½µµÍÁËѵÁ·²ÎÊýµÄÁ¿¼¶£¬µ«Õû¸öÍøÂçÐèҪѵÁ·µÄ²ÎÊýÒÀÈ»ºÜ¶à¡£
²ÎÊý¹²Ïí¾ÍÊǽ«¶à¸ö¾ßÓÐÏàͬͳ¼ÆÌØÕ÷µÄ²ÎÊýÉèÖÃΪÏàͬ£¬ÆäÒÀ¾ÝÊÇͼÏñÖÐÒ»²¿·ÖµÄͳ¼ÆÌØÕ÷ÓëÆäËü²¿·ÖÊÇÒ»ÑùµÄ¡£ÆäʵÏÖÊÇͨ¹ý¶ÔͼÏñ½øÐоí»ý£¨¾í»ýÉñ¾ÍøÂçÃüÃûµÄÀ´Ô´£©¡£
¿ÉÒÔÀí½âΪ£¬±ÈÈç´ÓÒ»ÕÅͼÏñÖеÄij¸ö¾Ö²¿£¨¾í»ýºË´óС£©ÌáÈ¡ÁËijÖÖÌØÕ÷£¬È»ºóÒÔÕâÖÖÌØÕ÷Ϊ̽²âÆ÷£¬Ó¦Óõ½Õû¸öͼÏñÖУ¬¶ÔÕû¸öͼÏñ˳Ðò½øÐоí»ý£¬µÃµ½²»Í¬µÄÌØÕ÷¡£

¾í»ý¹ý³Ì£¨Í¼Æ¬À´×Ô»¥ÁªÍø£©
ÿ¸ö¾í»ý¶¼ÊÇÒ»ÖÖÌØÕ÷ÌáÈ¡·½Ê½£¬¾ÍÏñÒ»¸öɸ×Ó£¬½«Í¼ÏñÖзûºÏÌõ¼þ£¨¼¤»îÖµÔ½´óÔ½·ûºÏÌõ¼þ£©µÄ²¿·Öɸѡ³öÀ´£¬Í¨¹ýÕâÖÖ¾í»ý¾Í½øÒ»²½½µµÍѵÁ·²ÎÊýµÄÁ¿¼¶¡£
3. ¶à¾í»ýºË
ÈçÉÏ£¬Ã¿¸ö¾í»ý¶¼ÊÇÒ»ÖÖÌØÕ÷ÌáÈ¡·½Ê½£¬ÄÇô¶ÔÓÚÕû·ùͼÏñÀ´½²£¬µ¥¸ö¾í»ýºËÌáÈ¡µÄÌØÕ÷¿Ï¶¨ÊDz»¹»µÄ£¬ÄÇô¶Ôͬһ·ùͼÏñʹÓöàÖÖ¾í»ýºË½øÐÐÌØÕ÷ÌáÈ¡£¬¾ÍÄܵõ½¶à·ùÌØÕ÷ͼ£¨feature
map£©¡£

²»Í¬µÄ¾í»ýºËÌáÈ¡²»Í¬µÄÌØÕ÷£¨Í¼Æ¬À´×Ô»¥ÁªÍø£©
¶à·ùÌØÕ÷ͼ¿ÉÒÔ¿´³ÉÊÇͬһÕÅͼÏñµÄ²»Í¬Í¨µÀ£¬Õâ¸ö¸ÅÄîÔÚºóÃæ´úÂëʵÏÖµÄʱºòÓõÃÉÏ¡£
4. ³Ø»¯
µÃµ½ÌØÕ÷ͼ֮ºó£¬¿ÉÒÔʹÓÃÌáÈ¡µ½µÄÌØÕ÷ȥѵÁ··ÖÀàÆ÷£¬µ«ÒÀÈ»»áÃæÁÙÌØÕ÷ά¶È¹ý¶à£¬ÄÑÒÔ¼ÆË㣬²¢ÇÒ¿ÉÄܹýÄâºÏµÄÎÊÌâ¡£´ÓͼÏñʶ±ðµÄ½Ç¶ÈÀ´½²£¬Í¼Ïñ¿ÉÄÜ´æÔÚÆ«ÒÆ¡¢ÐýתµÈ£¬µ«Í¼ÏñµÄÖ÷ÌåÈ´ÏàͬµÄÇé¿ö¡£Ò²¾ÍÊDz»Í¬µÄÌØÕ÷ÏòÁ¿¿ÉÄܶÔÓ¦×ÅÏàͬµÄ½á¹û£¬ÄÇô³Ø»¯¾ÍÊǽâ¾öÕâ¸öÎÊÌâµÄ¡£

³Ø»¯¹ý³Ì£¨Í¼Æ¬À´×Ô»¥ÁªÍø£©
³Ø»¯¾ÍÊǽ«³Ø»¯ºË·¶Î§ÄÚ£¨±ÈÈç2*2·¶Î§£©µÄѵÁ·²ÎÊý²ÉÓÃÆ½¾ùÖµ£¨Æ½¾ùÖµ³Ø»¯£©»ò×î´óÖµ£¨×î´óÖµ³Ø»¯£©À´½øÐÐÌæ´ú¡£
ÖÕÓÚµ½ÁËչʾģÐ͵Äʱºò£¬ÏÂÃæÕâ·ùͼÊDZÊÕßÊÖ»µÄ£¨ÓõçÄԻ̫·Ñʱ£¬½«¾Í¿´°É£©£¬Õâ·ùͼչʾÁ˱¾ÎÄÖÐÓÃÓÚѵÁ·ÓÎÏ·ËùÓõľí»ýÉñ¾ÍøÂçÄ£ÐÍ¡£

¾í»ýÉñ¾ÍøÂçÄ£ÐÍ

ͼÏñµÄ´¦Àí¹ý³Ì
1.³õʼÊäÈëËÄ·ùͼÏñ80¡Á80¡Á4£¨4´ú±íÊäÈëͨµÀ£¬³õʼʱËÄ·ùͼÏñÊÇÍêȫһֵ쩣¬¾¹ý¾í»ýºË8¡Á8¡Á4¡Á32£¨ÊäÈëͨµÀ4£¬Êä³öͨµÀ32£©£¬²½¾àΪ4£¨Ã¿²½¾í»ý×ß4¸öÏñËØµã£©£¬µÃµ½32·ùÌØÕ÷ͼ£¨feature
map£©£¬´óСΪ20¡Á20£»
2.½«20¡Á20µÄͼÏñ½øÐгػ¯£¬³Ø»¯ºËΪ2¡Á2£¬µÃµ½Í¼Ïñ´óСΪ10¡Á10£»
3.Ôٴξí»ý£¬¾í»ýºËΪ4¡Á4¡Á32¡Á64£¬²½¾àΪ2£¬µÃµ½Í¼Ïñ5¡Á5¡Á64£»
4.Ôٴξí»ý£¬¾í»ýºËΪ3¡Á3¡Á64*64£¬²½¾àΪ2£¬µÃµ½Í¼Ïñ5¡Á5¡Á64£¬ËäÈ»ÓëÉÏÒ»²½µÃµ½µÄͼÏñ¹æÄ£Ò»Ö£¬µ«Ôٴξí»ýÖ®ºóµÄͼÏñÐÅÏ¢¸üΪ³éÏó£¬Ò²¸ü½Ó½üÈ«¾ÖÐÅÏ¢£»
5.Reshape£¬¼´½«¶àÎ¬ÌØÕ÷ͼת»»ÎªÌØÕ÷ÏòÁ¿£¬µÃµ½1600άµÄÌØÕ÷ÏòÁ¿£»
6.¾¹ýÈ«Á¬½Ó1600¡Á512£¬µÃµ½512Î¬ÌØÕ÷ÏòÁ¿£»
7.ÔÙ´ÎÈ«Á¬½Ó512¡Á2£¬µÃµ½×îÖÕµÄ2άÏòÁ¿[0,1]ºÍ[1,0]£¬·Ö±ð´ú±íÓÎÏ·ÆÁÄ»ÉϵÄÊÇ·ñµã»÷ʼþ¡£
¿ÉÒÔ¿´³ö£¬¸ÃÄ£ÐÍʵÏÖÁ˶˵½¶ËµÄѧϰ£¬ÊäÈëµÄÊÇÓÎÏ·ÆÁÄ»µÄ½ØÍ¼ÐÅÏ¢£¨´úÂëÖо¹ýopencv´¦Àí£©£¬Êä³öµÄÊÇÓÎÏ·µÄ¶¯×÷£¬¼´ÊÇ·ñµã»÷ÆÁÄ»¡£Éî¶ÈѧϰµÄÇ¿´óÔÚÓÚÆäÊý¾ÝÄâºÏÄÜÁ¦£¬²»ÐèÒª´«Í³»úÆ÷ѧϰÖи´ÔÓµÄÌØÕ÷ÌáÈ¡¹ý³Ì£¬¶øÊÇÒÀ¿¿Ä£ÐÍ·¢ÏÖÊý¾ÝÄÚ²¿µÄ¹ØÏµ¡£
²»¹ýÕâÒ²´øÀ´ÁíÒ»·½ÃæµÄÎÊÌ⣬ÄǾÍÊÇÉî¶Èѧϰ¸ß¶ÈÒÀÀµ´óÁ¿µÄ±êÇ©Êý¾Ý£¬¶øÕâЩÊý¾Ý»ñÈ¡³É±¾¼«¸ß¡£
|
Èý¡¢Ëã·¨£ºDeep Q Network
ÓÐÁ˾í»ýÉñ¾ÍøÂçÄ£ÐÍ£¬ÄÇôÔõÑùѵÁ·Ä£ÐÍ£¿Ê¹µÃÄ£ÐÍÊÕÁ²£¬´Ó¶øÄܹ»Ö¸µ¼ÓÎÏ·¶¯×÷ÄØ£¿»úÆ÷ѧϰ·ÖΪ¼à¶½Ñ§Ï°¡¢·Ç¼à¶½Ñ§Ï°ºÍÇ¿»¯Ñ§Ï°£¬ÕâÀïÒª½éÉܵÄQ
NetworkÊôÓÚÇ¿»¯Ñ§Ï°£¨Reinforcement Learning£©µÄ·¶³ë¡£ÔÚÕýʽ½éÉÜQ Network֮ǰ£¬Ïȼòµ¥ËµÏÂËüµÄ¹âÈÙÀúÊ·¡£
2014ÄêGoogle 4ÒÚÃÀ½ðÊÕ¹ºDeepMindµÄÇŶΣ¬´ó¼Ò¿ÉÄÜÌý˵¹ý¡£ÄÇô£¬DeepMindÊÇÈçºÎ±»Google¸ø¶¢ÉϵÄÄØ£¿×îÖÕÔÒò¿ÉÒÔ¹é¾ÌΪÕâÆªÂÛÎÄ£º
Playing Atari
with Deep Reinforcement Learning
|
DeepMindÍŶÓͨ¹ýÇ¿»¯Ñ§Ï°£¬Íê³ÉÁË20¶àÖÖÓÎÏ·£¬ÊµÏÖÁ˶˵½¶ËµÄѧϰ¡£ÆäÓõ½µÄËã·¨¾ÍÊÇQ Network¡£2015Ä꣬DeepMindÍŶÓÔÚ¡¶Nature¡·ÉÏ·¢±íÁËһƪÉý¼¶°æ£º
Human-level
control through deep reinforcement learning
|
×Ô´Ë£¬ÔÚÕâÀàÓÎÏ·ÁìÓò£¬ÈËÒѾÎÞ·¨³¬¹ý»úÆ÷ÁË¡£ºóÀ´ÓÖÓÐÁËAlphaGo£¬ÒÔ¼°Master£¬µ±È»£¬Õâ¶¼ÊǺó»°ÁË¡£Æäʵ±¾ÎÄÒ²ÊôÓÚÉÏÊöÂÛÎĵķ¶³ë£¬Ö»²»¹ý»ùÓÚTensorFlowƽ̨½øÐÐÁËʵÏÖ£¬¼ÓÈëÁËһЩ±ÊÕß×Ô¼ºµÄÀí½â¶øÒÑ¡£
»Øµ½ÕýÌ⣬Q NetworkÊôÓÚÇ¿»¯Ñ§Ï°£¬ÄÇôÏȽéÉÜÏÂÇ¿»¯Ñ§Ï°¡£

Ç¿»¯Ñ§Ï°Ä£ÐÍ
ÕâÕÅͼÊÇ´ÓUCLµÄ¿Î³ÌÖп½³öÀ´µÄ£¬¿Î³ÌÁ´½ÓµØÖ·£¨YouTube£©£º
ÕâÕÅͼÊÇ´ÓUCLµÄ¿Î³ÌÖп½³öÀ´µÄ£¬¿Î³ÌÁ´½ÓµØÖ·£¨YouTube£©£º
https://www.youtube.com/watch?v=2pWv7GOvuf0
|
Ç¿»¯Ñ§Ï°¹ý³ÌÓÐÁ½¸ö×é³É²¿·Ö£º
1.ÖÇÄÜ´úÀí£¨Ñ§Ï°ÏµÍ³£©
2.»·¾³
ÈçͼËùʾ£¬ÔÚÿ²½µü´ú¹ý³ÌÖУ¬Ê×ÏÈÖÇÄÜ´úÀí£¨Ñ§Ï°ÏµÍ³£©½ÓÊÕ»·¾³µÄ״̬st£¬È»ºó²úÉú¶¯×÷at×÷ÓÃÓÚ»·¾³£¬»·¾³½ÓÊÕ¶¯×÷at£¬²¢ÇÒ¶ÔÆä½øÐÐÆÀ¼Û£¬·´À¡¸øÖÇÄÜ´úÀírt¡£²»¶ÏµÄÑ»·Õâ¸ö¹ý³Ì£¬¾Í»á²úÉúÒ»¸ö״̬/¶¯×÷/·´À¡µÄÐòÁУº£¨s1,
a1, r1, s2, a2, r2¡..,sn, an, rn£©£¬¶øÕâ¸öÐòÁÐÈÃÎÒÃǺÜ×ÔÈ»µÄÏëÆðÁË:
Âí¶û¿Æ·ò¾ö²ß¹ý³Ì

MDP£ºÂí¶û¿Æ·ò¾ö²ß¹ý³Ì
Âí¶û¿Æ·ò¾ö²ß¹ý³ÌÓëÖøÃûµÄHMM£¨ÒþÂí¶û¿Æ·òÄ£ÐÍ£©ÏàͬµÄÊÇ£¬ËüÃǶ¼¾ßÓÐÂí¶û¿Æ·òÌØÐÔ¡£ÄÇôʲôÊÇÂí¶û¿Æ·òÌØÐÔÄØ£¿¼òµ¥À´Ëµ£¬¾ÍÊÇδÀ´µÄ״ֻ̬ȡ¾öÓÚµ±Ç°µÄ״̬£¬Óë¹ýÈ¥µÄ״̬Î޹ء£
HMM£¨Âí¶û¿Æ·òÄ£ÐÍ£©ÔÚÓïÒôʶ±ð£¬ÐÐΪʶ±ðµÈ»úÆ÷ѧϰÁìÓòÓнÏΪ¹ã·ºµÄÓ¦Óá£Ìõ¼þËæ»ú³¡Ä£ÐÍ£¨Conditional
Random Field£©ÔòÓÃÓÚ×ÔÈ»ÓïÑÔ´¦Àí¡£Á½´óÄ£ÐÍÊÇÓïÒôʶ±ð¡¢×ÔÈ»ÓïÑÔ´¦ÀíÁìÓòµÄ»ùʯ¡£
|
ÉÏͼ¿ÉÒÔÓÃÒ»¸öºÜÐÎÏóµÄÀý×ÓÀ´ËµÃ÷¡£±ÈÈçÄã±ÏÒµ½øÈëÁËÒ»¸ö¹«Ë¾£¬ÄãµÄ³õʼְ¼¶ÊÇT1£¨¶ÔӦͼÖÐµÄ s1£©£¬ÄãÔÚ¹¤×÷ÉϿ̿àŬÁ¦£¬×·ÇóÉϽø£¨¶ÔӦͼÖеÄa1£©£¬È»ºóÁìµ¼¾õµÃÄã²»´í£¬×¼±¸¸øÄãÉýÖ°£¨¶ÔӦͼÖеÄr1£©£¬ÓÚÊÇ£¬ÄãÉýµ½ÁËT2£»Äã¼ÌÐø¿Ì¿àŬÁ¦£¬×·ÇóÉϽø¡¡²»¶ÏµÄŬÁ¦£¬²»¶ÏµÄÉýÖ°£¬×îºóÉýµ½ÁËsn¡£µ±È»£¬ÄãÒ²ÓпÉÄܲ»Å¬Á¦ÉϽø£¬ÕâÒ²ÊÇÒ»ÖÖ¶¯×÷£¬»»¾ä»°Ëµ£¬¸Ã¶¯×÷aÒ²ÊôÓÚ¶¯×÷¼¯ºÏA£¬È»ºóµÃµ½µÄ·´À¡r¾ÍÊÇûÓÐÉýÖ°¼ÓнµÄ»ú»á¡£
ÕâÀï×¢ÒâÏ£¬ÎÒÃǵ±È»Ï£Íû»ñÈ¡×î¶àµÄÉýÖ°£¬ÄÇôÎÊÌâת»»Îª£ºÈçºÎ¸ù¾Ýµ±Ç°×´Ì¬s£¨sÊôÓÚ״̬¼¯S£©£¬´ÓAÖÐѡȡ¶¯×÷aÖ´ÐÐÓÚ»·¾³£¬´Ó¶ø»ñÈ¡×î¶àµÄr£¬¼´r1
+ r2 ¡¡+rnµÄºÍ×î´ó £¿ÕâÀï±ØÐëÒªÒýÈëÒ»¸öÊýѧ¹«Ê½£º×´Ì¬Öµº¯Êý¡£

״ֵ̬º¯ÊýÄ£ÐÍ
¹«Ê½ÖÐÓиöÕÛºÏÒò×Ӧã¬Æäȡֵ·¶Î§Îª[0,1]£¬µ±ÆäΪ0ʱ£¬±íʾֻ¿¼Âǵ±Ç°¶¯×÷¶Ôµ±Ç°µÄÓ°Ï죬²»¿¼ÂǶԺóÐø²½ÖèµÄÓ°Ï죬µ±ÆäΪ1ʱ£¬±íʾµ±Ç°¶¯×÷¶ÔºóÐøÃ¿²½¶¼ÓоùµÈµÄÓ°Ïì¡£µ±È»£¬Êµ¼ÊÇé¿öͨ³£Êǵ±Ç°¶¯×÷¶ÔºóÐøµÃ·ÖÓÐÒ»¶¨µÄÓ°Ï죬µ«Ëæ×Ų½ÊýÔö¼Ó£¬ÆäÓ°Ïì¼õС¡£
´Ó¹«Ê½ÖпÉÒÔ¿´³ö£¬×´Ì¬Öµº¯Êý¿ÉÒÔͨ¹ýµü´úµÄ·½Ê½À´Çó½â¡£ÔöǿѧϰµÄÄ¿µÄ¾ÍÊÇÇó½âÂí¶û¿É·ò¾ö²ß¹ý³Ì£¨MDP£©µÄ×îÓŲßÂÔ¡£
²ßÂÔ¾ÍÊÇÈçºÎ¸ù¾Ý»·¾³Ñ¡È¡¶¯×÷À´Ö´ÐеÄÒÀ¾Ý¡£²ßÂÔ·ÖΪÎȶ¨µÄ²ßÂԺͲ»Îȶ¨µÄ²ßÂÔ£¬Îȶ¨µÄ²ßÂÔÔÚÏàͬµÄ»·¾³ÏÂ
×ÜÊÇ»á¸ø³öÏàͬµÄ¶¯×÷£¬²»Îȶ¨µÄ²ßÂÔÔò·´Ö®£¬ÕâÀïÎÒÃÇÖ÷ÒªÌÖÂÛÎȶ¨µÄ²ßÂÔ¡£
|
Çó½âÉÏÊö״̬º¯ÊýÐèÒª²ÉÓö¯Ì¬¹æ»®µÄ·½·¨£¬¶ø¾ßÌåµ½¹«Ê½£¬²»µÃ²»Ì᣺
±´¶ûÂü·½³Ì

ÆäÖУ¬¦Ð´ú±íÉÏÊöÌáµ½µÄ²ßÂÔ£¬Q ¦Ð (s, a)Ïà±ÈÓÚV ¦Ð (s)£¬ÒýÈëÁ˶¯×÷£¬±»³Æ×÷¶¯×÷Öµº¯Êý¡£¶Ô±´¶ûÂü·½³ÌÇó×îÓŽ⣬¾ÍµÃµ½Á˱´¶ûÂü×îÓÅÐÔ·½³Ì¡£


Çó½â¸Ã·½³ÌÓÐÁ½ÖÖ·½·¨£º²ßÂÔµü´úºÍÖµµü´ú¡£
²ßÂÔµü´ú
²ßÂÔµü´ú·ÖΪÁ½¸ö²½Ö裺²ßÂÔÆÀ¹ÀºÍ²ßÂԸĽø£¬¼´Ê×ÏÈÆÀ¹À²ßÂÔ£¬µÃµ½×´Ì¬Öµº¯Êý£¬Æä´Î£¬¸Ä½ø²ßÂÔ£¬Èç¹ûеIJßÂÔ±È֮ǰºÃ£¬¾ÍÌæ´úÀϵIJßÂÔ¡£

Öµµü´ú
´ÓÉÏÃæÎÒÃÇ¿ÉÒÔ¿´µ½£¬²ßÂÔµü´úËã·¨°üº¬ÁËÒ»¸ö²ßÂÔ¹À¼ÆµÄ¹ý³Ì£¬¶ø²ßÂÔ¹À¼ÆÔòÐèҪɨÃè(sweep)ËùÓеÄ״̬Èô¸É´Î£¬ÆäÖо޴óµÄ¼ÆËãÁ¿Ö±½ÓÓ°ÏìÁ˲ßÂÔµü´úËã·¨µÄЧÂÊ¡£¶øÖµµü´úÿ´ÎֻɨÃèÒ»´Î£¬¸üйý³ÌÈçÏ£º

¼´ÔÚÖµµü´úµÄµÚk+1´Îµü´úʱ£¬Ö±½Ó½«ÄÜ»ñµÃµÄ×î´óµÄV¦Ð(s)Öµ¸³¸øVk+1¡£
Q-Learning
Q-LearningÊǸù¾ÝÖµµü´úµÄ˼·À´½øÐÐѧϰµÄ¡£¸ÃËã·¨ÖУ¬QÖµ¸üÐµķ½·¨ÈçÏ£º

ËäÈ»¸ù¾ÝÖµµü´ú¼ÆËã³öÄ¿±êQÖµ£¬µ«ÊÇÕâÀﲢûÓÐÖ±½Ó½«Õâ¸öQÖµ£¨ÊǹÀ¼ÆÖµ£©Ö±½Ó¸³ÓèеÄQ£¬¶øÊDzÉÓý¥½øµÄ·½Ê½ÀàËÆÌݶÈϽµ£¬³¯Ä¿±êÂõ½üһС²½£¬È¡¾öÓÚ¦Á£¬Õâ¾ÍÄܹ»¼õÉÙ¹À¼ÆÎó²îÔì³ÉµÄÓ°Ïì¡£ÀàËÆËæ»úÌݶÈϽµ£¬×îºó¿ÉÒÔÊÕÁ²µ½×îÓŵÄQÖµ¡£¾ßÌåËã·¨ÈçÏ£º

Èç¹ûûÓнӴ¥¹ý¶¯Ì¬¹æ»®µÄͯЬ¿´ÉÏÊö¹«Ê½¿ÉÄÜÓеãÍ·´ó£¬ÏÂÃæÍ¨¹ý±í¸ñÀ´ÑÝʾÏÂQÖµ¸üеĹý³Ì£¬´ó¼Ò¾ÍÃ÷°×ÁË¡£

Q-LearningËã·¨µÄ¹ý³Ì¾ÍÊÇ´æ´¢QÖµµÄ¹ý³Ì¡£ÉϱíÖУ¬ºáÁÐΪ״̬s£¬×ÝÁÐΪAction a£¬sºÍa¾ö¶¨Á˱íÖеÄQÖµ¡£
µÚÒ»²½£º³õʼ»¯£¬½«±íÖеÄQֵȫ²¿ÖÃ0£»
µÚ¶þ²½£º¸ù¾Ý²ßÂÔ¼°×´Ì¬s£¬Ñ¡ÔñaÖ´ÐС£¼Ù¶¨µ±Ç°×´Ì¬Îªs1£¬ÓÉÓÚ³õʼֵ¶¼Îª0£¬ËùÒÔÈÎÒâѡȡaÖ´ÐУ¬¼Ù¶¨ÕâÀïѡȡÁËa2Ö´ÐУ¬µÃµ½ÁËrewardΪ1£¬²¢ÇÒ½øÈëÁË״̬s3¡£¸ù¾ÝQÖµ¸üй«Ê½£º

À´¸üÐÂQÖµ£¬ÕâÀïÎÒÃǼÙÉè¦ÁÊÇ1£¬¦ËÒ²µÈÓÚ1£¬Ò²¾ÍÊÇÿһ´Î¶¼°ÑÄ¿±êQÖµ¸³¸øQ¡£ÄÇôÕâÀ﹫ʽ±ä³É£º

ËùÒÔÔÚÕâÀ¾ÍÊÇ

ÄÇô¶ÔÓ¦µÄs3״̬£¬×î´óÖµÊÇ0£¬ËùÒÔ

Q±í¸ñ¾Í±ä³É£º

È»ºóÖÃλµ±Ç°×´Ì¬sΪs3¡£
µÚÈý²½£º¼ÌÐøÑ»·²Ù×÷£¬½øÈëÏÂÒ»´Î¶¯×÷£¬µ±Ç°×´Ì¬ÊÇs3£¬¼ÙÉèÑ¡Ôñ¶¯×÷a3£¬È»ºóµÃµ½rewardΪ2£¬×´Ì¬±ä³És1£¬ÄÇôÎÒÃÇͬÑù½øÐиüУº

ËùÒÔQµÄ±í¸ñ¾Í±ä³É£º

µÚËIJ½£º ¼ÌÐøÑ»·£¬QÖµÔÚÊÔÑéµÄͬʱ·´¸´¸üУ¬Ö±µ½ÊÕÁ²¡£
ÉÏÊö±í¸ñÑÝʾÁ˾ßÓÐ4ÖÖ״̬/4ÖÖÐÐΪµÄϵͳ£¬È»¶øÔÚʵ¼ÊÓ¦ÓÃÖУ¬ÒÔ±¾ÎĽ²µ½µÄFlappy BirdÓÎϷΪÀý£¬½çÃæÎª80*80¸öÏñËØµã£¬Ã¿¸öÏñËØµãµÄɫֵÓÐ256ÖÖ¿ÉÄÜ¡£ÄÇôʵ¼ÊµÄ״̬×ÜÊýΪ256µÄ80*80´Î·½£¬ÕâÊÇÒ»¸öºÜ´óµÄÊý×Ö£¬Ö±½Óµ¼ÖÂÎÞ·¨Í¨¹ý±í¸ñµÄ˼·½øÐмÆËã¡£
Òò´Ë£¬ÎªÁËʵÏÖ½µÎ¬£¬ÕâÀïÒýÈëÁËÒ»¸ö¼ÛÖµº¯Êý½üËÆµÄ·½·¨£¬Í¨¹ýÒ»¸öº¯Êý±í½üËÆ±í´ï¼ÛÖµº¯Êý£º

ÆäÖУ¬¦Ø Óë b ·Ö±ðΪ²ÎÊý¡£¿´µ½ÕâÀÖÕÓÚ¿ÉÒÔÁªÏµµ½Ç°ÃæÌáµ½µÄÉñ¾ÍøÂçÁË£¬ÉÏÃæµÄ±í´ïʽ²»¾ÍÊÇÉñ¾ÔªµÄº¯ÊýÂð£¿
Q-network
ÏÂÃæÕâÕÅͼÀ´×ÔÂÛÎÄ¡¶Human-level Control through
Deep Reinforcement Learning¡·£¬ÆäÖÐÏêϸ½éÉÜÁËÉÏÊö½«QÖµÉñ¾ÍøÂ绯µÄ¹ý³Ì¡££¨¸ÐÐËȤµÄ¿ÉÒÔµã֮ǰµÄÁ´½ÓÁ˽âÔÎÄ¡«£©

ÒÔ±¾ÎÄΪÀý£¬ÊäÈëÊǾ¹ý´¦ÀíµÄ4¸öÁ¬ÐøµÄ80¡Á80ͼÏñ£¬È»ºó¾¹ýÈý¸ö¾í»ý²ã£¬Ò»¸ö³Ø»¯²ã£¬Á½¸öÈ«Á¬½Ó²ã£¬×îºóÊä³ö°üº¬Ã¿Ò»¸ö¶¯×÷QÖµµÄÏòÁ¿¡£
ÏÖÔÚÒѾ½«Q-learningÉñ¾ÍøÂ绯ΪQ-networkÁË£¬½ÓÏÂÀ´µÄÎÊÌâÊÇÈçºÎѵÁ·Õâ¸öÉñ¾ÍøÂç¡£Éñ¾ÍøÂçѵÁ·µÄ¹ý³ÌÆäʵ¾ÍÊÇÒ»¸ö×îÓÅ»¯·½³ÌÇó½âµÄ¹ý³Ì£¬¶¨ÒåϵͳµÄËðʧº¯Êý£¬È»ºóÈÃËðʧº¯Êý×îС»¯µÄ¹ý³Ì¡£
ѵÁ·¹ý³ÌÒÀÀµÓÚÉÏÊöÌáµ½µÄDQNËã·¨£¬ÒÔÄ¿±êQÖµ×÷Ϊ±êÇ©£¬Òò´Ë£¬Ëðʧº¯Êý¿ÉÒÔ¶¨ÒåΪ£º

ÉÏÃæ¹«Ê½ÊÇs'£¬a'¼´ÏÂÒ»¸ö״̬ºÍ¶¯×÷¡£È·¶¨ÁËËðʧº¯Êý£¬È·¶¨ÁË»ñÈ¡Ñù±¾µÄ·½Ê½£¬DQNµÄÕû¸öËã·¨Ò²¾Í³ÉÐÍÁË£¡

ÖµµÃ×¢ÒâµÄÊÇÕâÀïµÄD¡ªExperience Replay£¬Ò²¾ÍÊǾÑ鳨£¬¾ÍÊÇÈçºÎ´æ´¢Ñù±¾¼°²ÉÑùµÄÎÊÌâ¡£
ÓÉÓÚÍæFlappy BirdÓÎÏ·£¬²É¼¯µÄÑù±¾ÊÇÒ»¸öʱ¼äÐòÁУ¬Ñù±¾Ö®¼ä¾ßÓÐÁ¬ÐøÐÔ£¬Èç¹ûÿ´ÎµÃµ½Ñù±¾¾Í¸üÐÂQÖµ£¬ÊÜÑù±¾·Ö²¼Ó°Ï죬Ч¹û»á²»ºÃ¡£Òò´Ë£¬Ò»¸öºÜÖ±½ÓµÄÏë·¨¾ÍÊǰÑÑù±¾ÏÈ´æÆðÀ´£¬È»ºóËæ»ú²ÉÑùÈçºÎ£¿Õâ¾ÍÊÇExperience
ReplayµÄ˼Ïë¡£
Ë㷨ʵÏÖÉÏ£¬ÏÈ·´¸´ÊµÑ飬²¢ÇÒ½«ÊµÑéÊý¾Ý´æ´¢ÔÚDÖУ»´æ´¢µ½Ò»¶¨³Ì¶È£¬¾Í´ÓÖÐËæ»ú³éÈ¡Êý¾Ý£¬¶ÔËðʧº¯Êý½øÐÐÌݶÈϽµ¡£
ËÄ¡¢´úÂ룺TensorFlowʵÏÖ
ÖÕÓÚµ½ÁË¿´´úÂëµÄʱºò¡£Ê×ÏÈÉêÃ÷Ï£¬µ±±ÊÕß´ÓDeep MindµÄÂÛÎÄÈëÊÖ£¬ÊÔͼÓÃTensorFlowʵÏÖ¶ÔFlappy
BirdÓÎÏ·½øÐÐʵÏÖʱ£¬·¢ÏÖgithubÒÑÓдóÉñÍê³Édemo¡£Ë¼Â·Ïàͬ£¬ËùÒÔÖ±½ÓÒÔ¹«¿ª´úÂëΪÀý½øÐзÖÎö˵Ã÷ÁË¡£
ÈçÓÐÔ´ÂëÐèÒª£¬ÇëÒÆ²½github£ºUsing Deep Q-Network to Learn How
To Play Flappy Bird¡£
´úÂë´Ó½á¹¹ÉÏÀ´½²£¬Ö÷Òª·ÖΪÒÔϼ¸²¿·Ö£º
1.GameStateÓÎÏ·À࣬frame_step·½·¨¿ØÖÆÒƶ¯
2.CNNÄ£Ð͹¹½¨
3.OpenCV-PythonͼÏñÔ¤´¦Àí·½·¨
4.Ä£ÐÍѵÁ·¹ý³Ì
1. GameStateÓÎÏ·À༰frame_step·½·¨
ͨ¹ýPythonʵÏÖÓÎÏ·±ØÈ»ÒªÓÃpygame¿â£¬Æä°üº¬Ê±ÖÓ¡¢»ù±¾µÄÏÔʾ¿ØÖÆ¡¢¸÷ÖÖÓÎÏ·¿Ø¼þ¡¢´¥·¢Ê¼þµÈ£¬¶Ô´ËÓÐÐËȤµÄ£¬¿ÉÒÔÏêϸÁ˽âpygame¡£frame_step·½·¨µÄÈë²ÎΪshapeΪ
(2,) µÄndarray£¬ÖµÓò£º [1,0]£ºÊ²Ã´¶¼²»×ö£» [0,1]£ºÌáÉýBird¡£À´¿´Ï´úÂëʵÏÖ£º
if input_actions[1]
== 1:
if self.playery > -2 * PLAYER_HEIGHT:
self.playerVelY = self.playerFlapAcc
self.playerFlapped = True
# SOUNDS['wing'].play()
|
ºóÐø²Ù×÷°üÀ¨¼ì²éµÃ·Ö¡¢ÉèÖýçÃæ¡¢¼ì²éÊÇ·ñÅöײµÈ£¬ÕâÀï²»ÔÙÏêϸչ¿ª¡£
frame_step·½·¨µÄ·µ»ØÖµÊÇ£º
return image_data,
reward, terminal
|
·Ö±ð±íʾ½çÃæÍ¼ÏñÊý¾Ý£¬µÃ·ÖÒÔ¼°ÊÇ·ñ½áÊøÓÎÏ·¡£¶ÔÓ¦Ç°ÃæÇ¿»¯Ñ§Ï°Ä£ÐÍ£¬½çÃæÍ¼ÏñÊý¾Ý±íʾ»·¾³×´Ì¬ s£¬µÃ·Ö±íʾ»·¾³¸øÓèѧϰϵͳµÄ·´À¡
r¡£
2. CNNÄ£Ð͹¹½¨
¸ÃDemoÖаüº¬Èý¸ö¾í»ý²ã£¬Ò»¸ö³Ø»¯²ã£¬Á½¸öÈ«Á¬½Ó²ã£¬×îºóÊä³ö°üº¬Ã¿Ò»¸ö¶¯×÷QÖµµÄÏòÁ¿¡£Òò´Ë£¬Ê×Ïȶ¨ÒåÈ¨ÖØ¡¢Æ«Öᢾí»ýºÍ³Ø»¯º¯Êý£º
# È¨ÖØ
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.01)
return tf.Variable(initial)
# Æ«ÖÃ
def bias_variable(shape):
initial = tf.constant(0.01, shape=shape)
return tf.Variable(initial)
# ¾í»ý
def conv2d(x, W, stride):
return tf.nn.conv2d(x, W, strides=[1, stride,
stride, 1], padding="SAME")
# ³Ø»¯
def max_pool_2x2(x):
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1,
2, 2, 1], padding="SAME") |
È»ºó£¬Í¨¹ýÉÏÊöº¯Êý¹¹½¨¾í»ýÉñ¾ÍøÂçÄ£ÐÍ£¨¶Ô´úÂëÖвÎÊý²»½âµÄ£¬¿ÉÖ±½ÓÍùǰ·£¬¿´ÉÏÃæÄÇÕÅÊÖ»µÄͼ£©¡£
def createNetwork():
# µÚÒ»²ã¾í»ý
W_conv1 = weight_variable([8, 8, 4, 32])
b_conv1 = bias_variable([32])
# µÚ¶þ²ã¾í»ý
W_conv2 = weight_variable([4, 4, 32, 64])
b_conv2 = bias_variable([64])
# µÚÈý²ã¾í»ý
W_conv3 = weight_variable([3, 3, 64, 64])
b_conv3 = bias_variable([64])
# µÚÒ»²ãÈ«Á¬½Ó
W_fc1 = weight_variable([1600, 512])
b_fc1 = bias_variable([512])
# µÚ¶þ²ãÈ«Á¬½Ó
W_fc2 = weight_variable([512, ACTIONS])
b_fc2 = bias_variable([ACTIONS])
# ÊäÈë²ã
s = tf.placeholder("float", [None, 80,
80, 4])
# µÚÒ»²ãÒþ²Ø²ã+³Ø»¯²ã
h_conv1 = tf.nn.relu(conv2d(s, W_conv1, 4) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
# µÚ¶þ²ãÒþ²Ø²ã£¨ÕâÀïÖ»ÓÃÁËÒ»²ã³Ø»¯²ã£©
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2,
2) + b_conv2)
# h_pool2 = max_pool_2x2(h_conv2)
# µÚÈý²ãÒþ²Ø²ã
h_conv3 = tf.nn.relu(conv2d(h_conv2, W_conv3,
1) + b_conv3)
# h_pool3 = max_pool_2x2(h_conv3)
# Reshape
# h_pool3_flat = tf.reshape(h_pool3, [-1, 256])
h_conv3_flat = tf.reshape(h_conv3, [-1, 1600])
# È«Á¬½Ó²ã
h_fc1 = tf.nn.relu(tf.matmul(h_conv3_flat, W_fc1)
+ b_fc1)
# Êä³ö²ã
# readout layer
readout = tf.matmul(h_fc1, W_fc2) + b_fc2
return s, readout, h_fc1 |
3. OpenCV-PythonͼÏñÔ¤´¦Àí·½·¨
ÔÚUbuntuÖа²×°opencvµÄ²½Öè±È½ÏÂé·³£¬µ±Ê±Ò²²ÈÁ˲»ÉÙ¿Ó£¬¸÷ÖÖGoogle½â¾ö¡£½¨Òé°²×°opencv3¡£
|
Õⲿ·ÖÖ÷Òª¶Ôframe_step·½·¨·µ»ØµÄÊý¾Ý½øÐÐÁ˻ҶȻ¯ºÍ¶þÖµ»¯£¬Ò²¾ÍÊÇ×î»ù±¾µÄͼÏñÔ¤´¦Àí·½·¨¡£
x_t, r_0,
terminal = game_state.frame_step(do_nothing)
# Ê×ÏȽ«Í¼Ïñת»»Îª80*80£¬È»ºó½øÐлҶȻ¯
x_t = cv2.cvtColor(cv2.resize(x_t, (80, 80)),
cv2.COLOR_BGR2GRAY)
# ¶Ô»Ò¶ÈͼÏñ¶þÖµ»¯
ret, x_t = cv2.threshold(x_t, 1, 255, cv2.THRESH_BINARY)
# ËÄͨµÀÊäÈëͼÏñ
s_t = np.stack((x_t, x_t, x_t, x_t), axis=2)
|
4. DQNѵÁ·¹ý³Ì
ÕâÊÇ´úÂ벿·ÖÒª½²µÄÖØµã£¬Ò²ÊÇÉÏÊöQ-learningËã·¨µÄ´úÂ뻯¡£
i. ÔÚ½øÈëѵÁ·Ö®Ç°£¬Ê×ÏÈ´´½¨Ò»Ð©±äÁ¿£º
# define
the cost function
a = tf.placeholder("float", [None, ACTIONS])
y = tf.placeholder("float", [None])
readout_action = tf.reduce_sum(tf.multiply(readout,
a), axis=1)
cost = tf.reduce_mean(tf.square(y - readout_action))
train_step = tf.train.AdamOptimizer(1e-6).minimize(cost)
# open up a game state to communicate with emulator
game_state = game.GameState()
# store the previous observations in replay memory
D = deque() |
ÔÚTensorFlowÖУ¬Í¨³£ÓÐÈýÖÖ¶ÁÈ¡Êý¾ÝµÄ·½Ê½£ºFeeding¡¢Reading from filesºÍPreloaded
data¡£FeedingÊÇ×î³£ÓÃÒ²×îÓÐЧµÄ·½·¨¡£¼´ÔÚÄ£ÐÍ£¨Graph£©¹¹½¨Ö®Ç°£¬ÏÈʹÓÃplaceholder½øÐÐռ룬µ«´Ëʱ²¢Ã»ÓÐѵÁ·Êý¾Ý£¬ÑµÁ·ÊÇͨ¹ýfeed_dict´«ÈëÊý¾Ý¡£
ÕâÀïµÄa±íʾÊä³öµÄ¶¯×÷£¬¼´Ç¿»¯Ñ§Ï°Ä£ÐÍÖеÄAction£¬y±íʾ±êǩֵ£¬readout_action±íʾģÐÍÊä³öÓëaÏà³Ëºó£¬ÔÚһάÇóºÍ£¬Ëðʧº¯Êý¶Ô±êǩֵÓëÊä³öÖµµÄ²î½øÐÐÆ½·½£¬train_step±íʾ¶ÔËðʧº¯Êý½øÐÐAdamÓÅ»¯¡£
¸³ÖµµÄ¹ý³ÌΪ£º
# perform
gradient step
train_step.run(feed_dict={
y: y_batch,
a: a_batch,
s: s_j_batch}
)
|
ii. ´´½¨ÓÎÏ·¼°¾Ñ鳨 D
# open up
a game state to communicate with emulator
game_state = game.GameState()
# store the previous observations in replay memory
D = deque() |
¾Ñ鳨 D²ÉÓÃÁ˶ÓÁеÄÊý¾Ý½á¹¹£¬ÊÇTensorFlowÖÐ×î»ù´¡µÄÊý¾Ý½á¹¹£¬¿ÉÒÔͨ¹ýdequeue()ºÍenqueue([y])·½·¨½øÐÐÈ¡³öºÍѹÈëÊý¾Ý¡£¾Ñ鳨
DÓÃÀ´´æ´¢ÊµÑé¹ý³ÌÖеÄÊý¾Ý£¬ºóÃæµÄѵÁ·¹ý³Ì»á´ÓÖÐËæ»úÈ¡³öÒ»¶¨Á¿µÄbatch½øÐÐѵÁ·¡£
±äÁ¿´´½¨Íê³ÉÖ®ºó£¬ÐèÒªµ÷ÓÃTensorFlowϵͳ·½·¨tf.global_variables_initializer()Ìí¼ÓÒ»¸ö²Ù×÷ʵÏÖ±äÁ¿³õʼ»¯¡£ÔËÐÐʱ»úÊÇÔÚÄ£Ð͹¹½¨Íê³É£¬Session½¨Á¢Ö®³õ¡£±ÈÈ磺
# Create
two variables.
weights = tf.Variable(tf.random_normal([784, 200],
stddev=0.35),
name="weights")
biases = tf.Variable(tf.zeros([200]), name="biases")
...
# Add an op to initialize the variables.
init_op = tf.global_variables_initializer()
# Later, when launching the model
with tf.Session() as sess:
# Run the init operation.
sess.run(init_op)
...
# Use the model
... |
iii. ²ÎÊý±£´æ¼°¼ÓÔØ
²ÉÓÃTensorFlowѵÁ·Ä£ÐÍ£¬ÐèÒª½«ÑµÁ·µÃµ½µÄ²ÎÊý½øÐб£´æ£¬²»È»Ò»¹Ø»ú£¬¾ÍÒ»Ò¹»Øµ½½â·ÅǰÁË¡£TensorFlow²ÉÓÃSaverÀ´±£¡£Ò»°ãÔÚSession()½¨Á¢Ö®Ç°£¬Í¨¹ýtf.train.Saver()»ñÈ¡SaverʵÀý¡£
±äÁ¿µÄ»Ö¸´Ê¹ÓÃsaverµÄrestore·½·¨£º
# Create
some variables.
v1 = tf.Variable(..., name="v1")
v2 = tf.Variable(..., name="v2")
...
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, use the saver to restore
variables from disk, and
# do some work with the model.
with tf.Session() as sess:
# Restore variables from disk.
saver.restore(sess, "/tmp/model.ckpt")
print("Model restored.")
# Do some work with the model
... |
ÔÚ¸ÃDemoѵÁ·Ê±£¬Ò²²ÉÓÃÁËSaver½øÐвÎÊý±£´æ¡£
# saving
and loading networks
saver = tf.train.Saver()
checkpoint = tf.train.get_checkpoint_state("saved_networks")
if checkpoint and checkpoint.model_checkpoint_path:
saver.restore(sess, checkpoint.model_checkpoint_path)
print("Successfully loaded:", checkpoint.model_checkpoint_path)
else:
print("Could not find old network weights")
|
Ê×ÏȼÓÔØCheckPointStateÎļþ£¬È»ºó²ÉÓÃsaver.restore¶ÔÒÑ´æÔÚ²ÎÊý½øÐлָ´¡£
ÔÚ¸ÃDemoÖУ¬Ã¿¸ô10000²½£¬¾Í¶Ô²ÎÊý½øÐб£´æ£º
# save progress
every 10000 iterations
if t % 10000 == 0:
saver.save(sess, 'saved_networks/' + GAME + '-dqn',
global_step=t)
|
iv. ʵÑé¼°Ñù±¾´æ´¢
Ê×ÏÈ£¬¸ù¾Ý¦Å ¸ÅÂÊÑ¡ÔñÒ»¸öAction¡£
# choose
an action epsilon greedily
readout_t = readout.eval(feed_dict={s: [s_t]})[0]
a_t = np.zeros([ACTIONS])
action_index = 0
if t % FRAME_PER_ACTION == 0:
if random.random() <= epsilon:
print("----------Random Action----------")
action_index = random.randrange(ACTIONS)
a_t[random.randrange(ACTIONS)] = 1
else:
action_index = np.argmax(readout_t)
a_t[action_index] = 1
else:
a_t[0] = 1 # do nothing
|
ÕâÀreadout_tÊÇѵÁ·Êý¾ÝΪ֮ǰÌáµ½µÄËÄͨµÀͼÏñµÄÄ£ÐÍÊä³ö¡£a_tÊǸù¾Ý¦Å ¸ÅÂÊÑ¡ÔñµÄAction¡£
Æä´Î£¬Ö´ÐÐÑ¡ÔñµÄ¶¯×÷£¬²¢±£´æ·µ»ØµÄ״̬¡¢µÃ·Ö¡£
# run the
selected action and observe next state and reward
x_t1_colored, r_t, terminal = game_state.frame_step(a_t)
x_t1 = cv2.cvtColor(cv2.resize(x_t1_colored, (80,
80)), cv2.COLOR_BGR2GRAY)
ret, x_t1 = cv2.threshold(x_t1, 1, 255, cv2.THRESH_BINARY)
x_t1 = np.reshape(x_t1, (80, 80, 1))
# s_t1 = np.append(x_t1, s_t[:,:,1:], axis = 2)
s_t1 = np.append(x_t1, s_t[:, :, :3], axis=2)
# store the transition in D
D.append((s_t, a_t, r_t, s_t1, terminal)) |
¾Ñ鳨D±£´æµÄÊÇÒ»¸öÂí¶û¿Æ·òÐòÁС£(s_t, a_t, r_t, s_t1, terminal)·Ö±ð±íʾtʱµÄ״̬s_t£¬Ö´Ðе͝×÷a_t£¬µÃµ½µÄ·´À¡r_t£¬ÒÔ¼°µÃµ½µÄÏÂÒ»²½µÄ״̬s_t1ºÍÓÎÏ·ÊÇ·ñ½áÊøµÄ±êÖ¾terminal¡£
ÔÚÏÂһѵÁ·¹ý³ÌÖУ¬¸üе±Ç°×´Ì¬¼°²½Êý£º
# update
the old values
s_t = s_t1
t += 1
|
ÖØ¸´ÉÏÊö¹ý³Ì£¬ÊµÏÖ·´¸´ÊµÑé¼°Ñù±¾´æ´¢¡£
v. ͨ¹ýÌݶÈϽµ½øÐÐÄ£ÐÍѵÁ·
ÔÚʵÑéÒ»¶Îʱ¼äºó£¬¾Ñ鳨DÖÐÒѾ±£´æÁËһЩÑù±¾Êý¾Ýºó£¬¾Í¿ÉÒÔ´ÓÕâЩÑù±¾Êý¾ÝÖÐËæ»ú³éÑù£¬½øÐÐÄ£ÐÍѵÁ·ÁË¡£ÕâÀïÉèÖÃÑù±¾ÊýΪOBSERVE
= 100000.¡£Ëæ»ú³éÑùµÄÑù±¾ÊýΪBATCH = 32¡£
if t >
OBSERVE:
# sample a minibatch to train on
minibatch = random.sample(D, BATCH)
# get the batch variables
s_j_batch = [d[0] for d in minibatch]
a_batch = [d[1] for d in minibatch]
r_batch = [d[2] for d in minibatch]
s_j1_batch = [d[3] for d in minibatch]
y_batch = []
readout_j1_batch = readout.eval(feed_dict={s:
s_j1_batch})
for i in range(0, len(minibatch)):
terminal = minibatch[i][4]
# if terminal, only equals reward
if terminal:
y_batch.append(r_batch[i])
else:
y_batch.append(r_batch[i] + GAMMA * np.max(readout_j1_batch[i]))
# perform gradient step
train_step.run(feed_dict={
y: y_batch,
a: a_batch,
s: s_j_batch}
) |
s_j_batch¡¢a_batch¡¢r_batch¡¢s_j1_batchÊÇ´Ó¾Ñ鳨DÖÐÌáÈ¡µ½µÄÂí¶û¿Æ·òÐòÁУ¨JavaͯЬÏÛĽPythonµÄÁбíÍÆµ¼Ê½°¡£©£¬y_batchΪ±êǩֵ£¬ÈôÓÎÏ·½áÊø£¬Ôò²»´æÔÚÏÂÒ»²½ÖÐ״̬¶ÔÓ¦µÄQÖµ£¨»ØÒäQÖµ¸üйý³Ì£©£¬Ö±½ÓÌí¼Ór_batch£¬Èôδ½áÊø£¬ÔòÓÃÕÛºÏÒò×Ó£¨0.99£©ºÍÏÂÒ»²½ÖÐ״̬µÄ×î´óQÖµµÄ³Ë»ý£¬Ìí¼ÓÖÁy_batch¡£
×îºó£¬Ö´ÐÐÌݶÈϽµÑµÁ·£¬train_stepµÄÈë²ÎÊÇs_j_batch¡¢a_batchºÍy_batch¡£²î²»¶à¾¹ý2000000²½£¨ÔÚ±¾»úÉÏ´ó¸Å10¸öСʱ£©ÑµÁ·Ö®ºó£¬¾ÍÄÜ´ïµ½±¾ÎÄ¿ªÍ·¶¯Í¼ÖеÄЧ¹ûÀ²¡£ |