Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Model Center   Code  
»áÔ±   
   
 
     
   
 ¶©ÔÄ
  ¾èÖú
Scrapy¿ò¼ÜµÄʹÓÃÖ®ScrapyÅÀÈ¡ÐÂÀË΢²©
 
  3166  次浏览      27
 2019-7-3
 
±à¼­ÍƼö:
±¾ÎÄÀ´×ÔÓÚtencent,ÎÄÕÂÖ÷ÒªÒÔÒ»¸ö·´ÅÀ±È½ÏÇ¿µÄÍøÕ¾ÐÂÀË΢²©ÎªÀý£¬À´ÊµÏÖÒ»ÏÂScrapyµÄ´ó¹æÄ£ÅÀÈ¡£¬Ï£Íû¶ÔÄúÄÜÓÐËù°ïÖú¡£

Ò»¡¢±¾½ÚÄ¿±ê

±¾´ÎÅÀÈ¡µÄÄ¿±êÊÇÐÂÀË΢²©Óû§µÄ¹«¿ª»ù±¾ÐÅÏ¢£¬ÈçÓû§êdzơ¢Í·Ïñ¡¢Óû§µÄ¹Ø×¢¡¢·ÛË¿ÁбíÒÔ¼°·¢²¼µÄ΢²©µÈ£¬ÕâЩÐÅϢץȡ֮ºó±£´æÖÁMongoDB¡£

¶þ¡¢×¼±¸¹¤×÷

ÇëÈ·±£Ç°ÎÄËù½²µÄ´úÀí³Ø¡¢Cookies³ØÒѾ­ÊµÏÖ²¢¿ÉÒÔÕý³£ÔËÐУ¬°²×°Scrapy¡¢PyMongo¿â¡£

Èý¡¢ÅÀȡ˼·

Ê×ÏÈÎÒÃÇҪʵÏÖÓû§µÄ´ó¹æÄ£ÅÀÈ¡¡£ÕâÀï²ÉÓõÄÅÀÈ¡·½Ê½ÊÇ£¬ÒÔ΢²©µÄ¼¸¸ö´óVΪÆðʼµã£¬ÅÀÈ¡ËûÃǸ÷×ԵķÛË¿ºÍ¹Ø×¢ÁÐ±í£¬È»ºó»ñÈ¡·ÛË¿ºÍ¹Ø×¢ÁбíµÄ·ÛË¿ºÍ¹Ø×¢ÁÐ±í£¬ÒÔ´ËÀàÍÆ£¬ÕâÑùÏÂÈ¥¾Í¿ÉÒÔʵÏֵݹéÅÀÈ¡¡£Èç¹ûÒ»¸öÓû§ÓëÆäËûÓû§ÓÐÉç½»ÍøÂçÉϵĹØÁª£¬ÄÇËûÃǵÄÐÅÏ¢¾Í»á±»ÅÀ³æ×¥È¡µ½£¬ÕâÑùÎÒÃǾͿÉÒÔ×öµ½¶ÔËùÓÐÓû§µÄÅÀÈ¡¡£Í¨¹ýÕâÖÖ·½Ê½£¬ÎÒÃÇ¿ÉÒԵõ½Óû§µÄΨһID£¬ÔÙ¸ù¾ÝID»ñȡÿ¸öÓû§·¢²¼µÄ΢²©¼´¿É¡£

ËÄ¡¢ÅÀÈ¡·ÖÎö

ÕâÀïÎÒÃÇѡȡµÄÅÀȡվµãÊÇ£ºhttps://m.weibo.cn£¬´ËÕ¾µãÊÇ΢²©Òƶ¯¶ËµÄÕ¾µã¡£´ò¿ª¸ÃÕ¾µã»áÌø×ªµ½µÇÂ¼Ò³Ãæ£¬ÕâÊÇÒòΪÖ÷Ò³×öÁ˵ǼÏÞÖÆ¡£²»¹ýÎÒÃÇ¿ÉÒÔÈÆ¹ýµÇ¼ÏÞÖÆ£¬Ö±½Ó´ò¿ªÄ³¸öÓû§ÏêÇéÒ³Ãæ£¬ÀýÈç´ò¿ªÖܶ¬ÓêµÄ΢²©£¬Á´½ÓΪ£ºhttps://m.weibo.cn/u/1916655407£¬¼´¿É½øÈëÆä¸öÈËÏêÇéÒ³Ãæ£¬ÈçÏÂͼËùʾ¡£

ÎÒÃÇ´ò¿ª¿ª·¢Õß¹¤¾ß£¬Çл»µ½XHR¹ýÂËÆ÷£¬Ò»Ö±ÏÂÀ­¹Ø×¢ÁÐ±í£¬¼´¿É¿´µ½Ï·½»á³öÏֺܶàAjaxÇëÇó£¬ÕâЩÇëÇó¾ÍÊÇ»ñÈ¡Öܶ¬ÓêµÄ¹Ø×¢ÁбíµÄAjaxÇëÇó£¬ÈçÏÂͼËùʾ¡£

ÎÒÃÇ´ò¿ªµÚÒ»¸öAjaxÇëÇó£¬ËüµÄÁ´½ÓΪ£ºhttps://m.weibo.cn/api/container /getIndex?containerid=231051-_followers-_1916655407&luicode=10000011 &lfid=1005051916655407& featurecode=20000320& type=uid&value=1916655407&page=2£¬ÏêÇéÈçÏÂͼËùʾ¡£

ÇëÇóÀàÐÍÊÇGETÀàÐÍ£¬·µ»Ø½á¹ûÊÇJSON¸ñʽ£¬ÎÒÃǽ«ÆäÕ¹¿ªÖ®ºó¼´¿É¿´µ½Æä¹Ø×¢µÄÓû§µÄ»ù±¾ÐÅÏ¢¡£½ÓÏÂÀ´ÎÒÃÇÖ»ÐèÒª¹¹ÔìÕâ¸öÇëÇóµÄ²ÎÊý¡£´ËÁ´½ÓÒ»¹²ÓÐ7¸ö²ÎÊý£¬ÈçÏÂͼËùʾ¡£

ÆäÖÐ×îÖ÷ÒªµÄ²ÎÊý¾ÍÊǺ͡£ÓÐÁËÕâÁ½¸ö²ÎÊý£¬ÎÒÃÇͬÑù¿ÉÒÔ»ñÈ¡ÇëÇó½á¹û¡£ÎÒÃÇ¿ÉÒÔ½«½Ó¿Ú¾«¼òΪ£ºhttps://m.weibo.cn/api/container/getIndex?containerid=231051-_followers-_1916655407&page=2£¬ÕâÀïµÄµÄǰ°ë²¿·ÖÊǹ̶¨µÄ£¬ºó°ë²¿·ÖÊÇÓû§µÄid¡£ËùÒÔÕâÀï²ÎÊý¾Í¿ÉÒÔ¹¹Ôì³öÀ´ÁË£¬Ö»ÐèÒªÐÞ¸Ä×îºóµÄºÍ²ÎÊý¼´¿É»ñÈ¡·ÖÒ³ÐÎʽµÄ¹Ø×¢ÁбíÐÅÏ¢¡£

ÀûÓÃͬÑùµÄ·½·¨£¬ÎÒÃÇÒ²¿ÉÒÔ·ÖÎöÓû§ÏêÇéµÄAjaxÁ´½Ó¡¢Óû§Î¢²©ÁбíµÄAjaxÁ´½Ó£¬ÈçÏÂËùʾ£º

# Óû§ÏêÇéAPI
user_url ='https://m.weibo.cn/api/container/ getIndex?uid=&type=uid&value=&containerid=100505'
# ¹Ø×¢ÁбíAPI
follow_url ='https://m.weibo.cn /api/container/getIndex?containerid=231051_-_followers_-_&page='
# ·ÛË¿ÁбíAPI
fan_url ='https://m.weibo.cn /api/container/getIndex?containerid=231051_-_fans_-_&page='
# ΢²©ÁбíAPI
weibo_url ='https://m.weibo.cn/api/container /getIndex?uid=&type=uid&page=&containerid=107603'

´Ë´¦µÄºÍ·Ö±ð´ú±íÓû§IDºÍ·ÖÒ³Ò³Âë¡£

×¢Ò⣬Õâ¸öAPI¿ÉÄÜËæ×Åʱ¼äµÄ±ä»¯»òÕß΢²©µÄ¸Ä°æ¶ø±ä»¯£¬ÒÔʵ²âΪ׼¡£

ÎÒÃÇ´Ó¼¸¸ö´óV¿ªÊ¼×¥È¡£¬×¥È¡ËûÃǵķÛË¿¡¢¹Ø×¢ÁÐ±í¡¢Î¢²©ÐÅÏ¢£¬È»ºóµÝ¹éץȡËûÃǵķÛË¿ºÍ¹Ø×¢ÁбíµÄ·ÛË¿¡¢¹Ø×¢ÁÐ±í¡¢Î¢²©ÐÅÏ¢£¬µÝ¹éץȡ£¬×îºó±£´æÎ¢²©Óû§µÄ»ù±¾ÐÅÏ¢¡¢¹Ø×¢ºÍ·ÛË¿ÁÐ±í¡¢·¢²¼µÄ΢²©¡£

ÎÒÃÇÑ¡ÔñMongoDB×÷´æ´¢µÄÊý¾Ý¿â£¬¿ÉÒÔ¸ü·½±ãµØ´æ´¢Óû§µÄ·ÛË¿ºÍ¹Ø×¢ÁÐ±í¡£

Î塢н¨ÏîÄ¿

½ÓÏÂÀ´ÎÒÃÇÓÃScrapyÀ´ÊµÏÖÕâ¸öץȡ¹ý³Ì¡£Ê×ÏÈ´´½¨Ò»¸öÏîÄ¿£¬ÃüÁîÈçÏÂËùʾ£º

scrapy startproject weibo

½øÈëÏîÄ¿ÖУ¬Ð½¨Ò»¸öSpider£¬ÃûΪweibocn£¬ÃüÁîÈçÏÂËùʾ£º

scrapy genspider weibocn m.weibo.cn

ÎÒÃÇÊ×ÏÈÐÞ¸ÄSpider£¬ÅäÖø÷¸öAjaxµÄURL£¬Ñ¡È¡¼¸¸ö´óV£¬½«ËûÃǵÄID¸³Öµ³ÉÒ»¸öÁÐ±í£¬ÊµÏÖ·½·¨£¬Ò²¾ÍÊÇÒÀ´Îץȡ¸÷¸ö´óVµÄ¸öÈËÏêÇ飬ȻºóÓýøÐнâÎö£¬ÈçÏÂËùʾ£º

romscrapyimportRequest, Spider
classWeiboSpider(Spider):
name ='weibocn'
allowed_domains = ['m.weibo.cn']
user_url ='https://m.weibo.cn/api/ container/getIndex?uid=&type=uid&value=&containerid=100505'
follow_url ='https://m.weibo.cn /api/container/getIndex?containerid=231051_-_followers_-_&page='
fan_url ='https://m.weibo.cn /api/container/getIndex?containerid=231051_-_fans_-_&page='
weibo_url ='https://m.weibo.cn/api/containe r/getIndex?uid=&type=uid&page=&containerid=107603'
start_users = ['3217179555','1742566624',' 2282991915','1288739185','3952070245','5878659096']
defstart_requests(self):
foruidinself.start_users:
yieldRequest(self.user_url. format(uid=uid), callback=self.parse_user)
defparse_user(self, response):
self.logger.debug(response)

Áù¡¢´´½¨Item

½ÓÏÂÀ´ÎÒÃǽâÎöÓû§µÄ»ù±¾ÐÅÏ¢²¢Éú³ÉItem¡£ÕâÀïÎÒÃÇÏȶ¨Ò弸¸öItem£¬ÈçÓû§¡¢Óû§¹ØÏµ¡¢Î¢²©µÄItem£¬ÈçÏÂËùʾ£º

fromscrapyimportItem, Field
classUserItem(Item):
collection ='users'
id = Field()
name = Field()
avatar = Field()
cover = Field()
gender = Field()
description = Field()
fans_count = Field()
follows_count = Field()
weibos_count = Field()
verified = Field()
verified_reason = Field()
verified_type = Field()
follows = Field()
fans = Field()
crawled_at = Field()
classUserRelationItem(Item):
collection ='users'
id = Field()
follows = Field()
fans = Field()
classWeiboItem(Item):
collection ='weibos'
id = Field()
attitudes_count = Field()
comments_count = Field()
reposts_count = Field()
picture = Field()
pictures = Field()
source = Field()
text = Field()
raw_text = Field()
thumbnail = Field()
user = Field()
created_at = Field()
crawled_at = Field()

ÕâÀﶨÒåÁË×ֶΣ¬Ö¸Ã÷±£´æµÄCollectionµÄÃû³Æ¡£Óû§µÄ¹Ø×¢ºÍ·ÛË¿ÁбíÖ±½Ó¶¨ÒåΪһ¸öµ¥¶ÀµÄ£¬ÆäÖоÍÊÇÓû§µÄID£¬¾ÍÊÇÓû§¹Ø×¢ÁÐ±í£¬ÊÇ·ÛË¿ÁÐ±í£¬µ«Õâ²¢²»Òâζ×ÅÎÒÃǻὫ¹Ø×¢ºÍ·ÛË¿ÁÐ±í´æµ½Ò»¸öµ¥¶ÀµÄCollectionÀï¡£ºóÃæÎÒÃÇ»áÓÃPipeline¶Ô¸÷¸öItem½øÐд¦Àí¡¢ºÏ²¢´æ´¢µ½Óû§µÄCollectionÀÒò´ËItemºÍCollection²¢²»Ò»¶¨ÊÇÍêÈ«¶ÔÓ¦µÄ¡£

Æß¡¢ÌáÈ¡Êý¾Ý

ÎÒÃÇ¿ªÊ¼½âÎöÓû§µÄ»ù±¾ÐÅÏ¢£¬ÊµÏÖ·½·¨£¬ÈçÏÂËùʾ£º

defparse_user(self, response):

"""

½âÎöÓû§ÐÅÏ¢

:param response: Response¶ÔÏó
"""
result = json.loads(response.text)
ifresult.get('data').get('userInfo'):
user_info = result.get('data').get('userInfo')
user_item = UserItem()
field_map = {
'id':'id','name':'screen_name','avatar': 'profile_image_url','cover':'cover_image_phone',
'gender':'gender','description':'description', 'fans_count':'followers_count',
'follows_count':'follow_count', 'weibos_count':'statuses_count','verified': 'verified',
'verified_reason':'verified_reason',' verified_type':'verified_type'
}
forfield, attrinfield_map.items():
user_item[field] = user_info.get(attr)
yielduser_item
# ¹Ø×¢
uid = user_info.get('id')
yieldRequest(self.follow_url.format(uid=uid, page=1), callback=self.parse_follows,
meta={'page':1,'uid': uid})
# ·ÛË¿
yieldRequest(self.fan_url.format(uid=uid, page=1), callback=self.parse_fans,
meta={'page':1,'uid': uid})
# ΢²©
yieldRequest(self.weibo_url.format(uid=uid, page=1), callback=self.parse_weibos,
meta={'page':1,'uid': uid})

ÔÚÕâÀïÎÒÃÇÒ»¹²Íê³ÉÁËÁ½¸ö²Ù×÷¡£

½âÎöJSONÌáÈ¡Óû§ÐÅÏ¢²¢Éú³ÉUserItem·µ»Ø¡£ÎÒÃDz¢Ã»ÓвÉÓ󣹿µÄÖð¸ö¸³ÖµµÄ·½·¨£¬¶øÊǶ¨ÒåÁËÒ»¸ö×Ö¶ÎÓ³Éä¹ØÏµ¡£ÎÒÃǶ¨ÒåµÄ×Ö¶ÎÃû³Æ¿ÉÄܺÍJSONÖÐÓû§µÄ×Ö¶ÎÃû³Æ²»Í¬£¬ËùÒÔÔÚÕâÀﶨÒå³ÉÒ»¸ö×ֵ䣬Ȼºó±éÀú×ÖµäµÄÿ¸ö×Ö¶ÎʵÏÖÖð¸ö×ֶεĸ³Öµ¡£

¹¹ÔìÓû§µÄ¹Ø×¢¡¢·ÛË¿¡¢Î¢²©µÄµÚÒ»Ò³µÄÁ´½Ó£¬²¢Éú³ÉRequest£¬ÕâÀïÐèÒªµÄ²ÎÊýÖ»ÓÐÓû§µÄID¡£ÁíÍ⣬³õʼ·ÖÒ³Ò³ÂëÖ±½ÓÉèÖÃΪ1¼´¿É¡£

½ÓÏÂÀ´£¬ÎÒÃÇ»¹ÐèÒª±£´æÓû§µÄ¹Ø×¢ºÍ·ÛË¿ÁÐ±í¡£ÒÔ¹Ø×¢ÁбíΪÀý£¬Æä½âÎö·½·¨Îª£¬ÊµÏÖÈçÏÂËùʾ£º

defparse_follows(self, response):
"""
½âÎöÓû§¹Ø×¢
:param response: Response¶ÔÏó
"""
result = json.loads(response.text)
ifresult.get('ok')andresult.get('data') .get('cards') andlen(result.get('data').get('cards')) andresult.get('data').get('cards')[-1].get(
'card_group'):
# ½âÎöÓû§
follows = result.get('data').get('cards')[-1].get('card_group')
forfollowinfollows:
iffollow.get('user'):
uid = follow.get('user').get('id')
yieldRequest(self.user_url.format(uid=uid), callback=self.parse_user)
# ¹Ø×¢Áбí
uid = response.meta.get('uid')
user_relation_item = UserRelationItem()
follows = [{'id': follow.get('user').get('id') ,'name': follow.get('user'). get('screen_name')}forfollowin
follows]
user_relation_item['id'] = uid
user_relation_item['follows'] = follows
user_relation_item['fans'] = []
yielduser_relation_item
# ÏÂÒ»Ò³¹Ø×¢
page = response.meta.get('page') +1
yieldRequest(self.follow_url.format(uid=uid, page=page),
callback=self.parse_follows, meta={'page': page,'uid': uid})

ÄÇôÔÚÕâ¸ö·½·¨ÀïÃæÎÒÃÇ×öÁËÈçÏÂÈý¼þÊ¡£

½âÎö¹Ø×¢ÁбíÖеÄÿ¸öÓû§ÐÅÏ¢²¢·¢ÆðеĽâÎöÇëÇó¡£ÎÒÃÇÊ×ÏȽâÎö¹Ø×¢ÁбíµÄÐÅÏ¢£¬µÃµ½Óû§µÄID£¬È»ºóÔÙÀûÓù¹Ôì·ÃÎÊÓû§ÏêÇéµÄRequest£¬»Øµ÷¾ÍÊǸղÅËù¶¨ÒåµÄ·½·¨¡£

ÌáÈ¡Óû§¹Ø×¢ÁбíÄڵĹؼüÐÅÏ¢²¢Éú³É¡£×Ö¶ÎÖ±½ÓÉèÖóÉÓû§µÄID£¬JSON·µ»ØÊý¾ÝÖеÄÓû§ÐÅÏ¢ÓкܶàÈßÓà×ֶΡ£ÔÚÕâÀïÎÒÃÇÖ»ÌáÈ¡Á˹Ø×¢Óû§µÄIDºÍÓû§Ãû£¬È»ºó°ÑËüÃǸ³Öµ¸ø×ֶΣ¬×Ö¶ÎÉèÖóɿÕÁÐ±í¡£ÕâÑùÎÒÃǾͽ¨Á¢ÁËÒ»¸ö´æÓÐÓû§IDºÍÓû§²¿·Ö¹Ø×¢ÁбíµÄ£¬Ö®ºóºÏ²¢ÇÒ±£´æ¾ßÓÐͬһ¸öIDµÄµÄ¹Ø×¢ºÍ·ÛË¿ÁÐ±í¡£

ÌáÈ¡ÏÂÒ»Ò³¹Ø×¢¡£Ö»ÐèÒª½«´ËÇëÇóµÄ·ÖÒ³Ò³Âë¼Ó1¼´¿É¡£·ÖÒ³Ò³Âëͨ¹ýRequestµÄÊôÐÔ½øÐд«µÝ£¬ResponseµÄÀ´½ÓÊÕ¡£ÕâÑùÎÒÃǹ¹Ôì²¢·µ»ØÏÂÒ»Ò³µÄ¹Ø×¢ÁбíµÄRequest¡£

ץȡ·ÛË¿ÁбíµÄÔ­ÀíºÍץȡ¹Ø×¢ÁбíÔ­ÀíÏàͬ£¬Ôڴ˲»ÔÙ׸Êö¡£

½ÓÏÂÀ´ÎÒÃÇ»¹²îÒ»¸ö·½·¨µÄʵÏÖ£¬¼´£¬ËüÓÃÀ´×¥È¡Óû§µÄ΢²©ÐÅÏ¢£¬ÊµÏÖÈçÏÂËùʾ£º

defparse_weibos(self, response):
"""
½âÎö΢²©Áбí
:param response: Response¶ÔÏó
"""
result = json.loads(response.text)
ifresult.get('ok')andresult.get('data') .get('cards'):
weibos = result.get('data').get('cards')
forweiboinweibos:
mblog = weibo.get('mblog')
ifmblog:
weibo_item = WeiboItem()
field_map = {
'id':'id','attitudes_count':'attitudes_count' ,'comments_count': 'comments_count','created_at':' created_at',
'reposts_count':'reposts_count','picture':' original_pic','pictures':'pics',
'source':'source','text':'text','raw_text':' raw_text','thumbnail':'thumbnail_pic'
}
forfield, attrinfield_map.items():
weibo_item[field] = mblog.get(attr)
weibo_item['user'] = response.meta.get('uid')
yieldweibo_item
# ÏÂһҳ΢²©
uid = response.meta.get('uid')
page = response.meta.get('page') +1
yieldRequest(self.weibo_url.format(uid=uid, page=page), callback=self.parse_weibos,
meta={'uid': uid,'page': page})

ÔÚÕâÀï·½·¨Íê³ÉÁËÁ½¼þÊ¡£

ÌáÈ¡Óû§µÄ΢²©ÐÅÏ¢£¬²¢Éú³ÉWeiboItem¡£ÕâÀïͬÑù½¨Á¢ÁËÒ»¸ö×Ö¶ÎÓ³Éä±í£¬ÊµÏÖÅúÁ¿×ֶθ³Öµ¡£

ÌáÈ¡ÏÂÒ»Ò³µÄ΢²©ÁÐ±í¡£ÕâÀïͬÑùÐèÒª´«ÈëÓû§IDºÍ·ÖÒ³Ò³Âë¡£

ĿǰΪֹ£¬Î¢²©µÄSpiderÒѾ­Íê³É¡£ºóÃæ»¹ÐèÒª¶ÔÊý¾Ý½øÐÐÊý¾ÝÇåÏ´´æ´¢£¬ÒÔ¼°¶Ô½Ó´úÀí³Ø¡¢Cookies³ØÀ´·ÀÖ¹·´ÅÀ³æ¡£

°Ë¡¢Êý¾ÝÇåÏ´

ÓÐЩ΢²©µÄʱ¼ä¿ÉÄܲ»ÊDZê×¼µÄʱ¼ä£¬±ÈÈçËü¿ÉÄÜÏÔʾΪ¸Õ¸Õ¡¢¼¸·ÖÖÓǰ¡¢¼¸Ð¡Ê±Ç°¡¢×òÌìµÈ¡£ÕâÀïÎÒÃÇÐèҪͳһת»¯ÕâЩʱ¼ä£¬ÊµÏÖÒ»¸ö·½·¨£¬ÈçÏÂËùʾ£º

defparse_time(self, date):
ifre.match('¸Õ¸Õ', date):
date = time.strftime('%Y-%m-%d %H:%M', time.localtime(time.time()))
ifre.match('\d+ᅅ', date):
minute = re.match('(\d+)', date).group(1)
date = time.strftime('%Y-%m-%d %H:%M', time.localtime(time.time() - float(minute) *60))
ifre.match('\d+Сʱǰ', date):
hour = re.match('(\d+)', date).group(1)
date = time.strftime('%Y-%m-%d %H:%M', time.localtime(time.time() - float(hour) *60*60))
ifre.match('×òÌì.*', date):
date = re.match('×òÌì(.*)', date). group(1).strip()
date = time.strftime('%Y-%m-%d', time.localtime() -24*60*60) +' '+ date
ifre.match('\d-\d', date):
date = time.strftime('%Y-', time.localtime()) + date +' 00:00'
returndate

ÎÒÃÇÓÃÕýÔòÀ´ÌáȡһЩ¹Ø¼üÊý×Ö£¬ÓÃtime¿âÀ´ÊµÏÖ±ê׼ʱ¼äµÄת»»¡£

ÒÔX·ÖÖÓǰµÄ´¦ÀíΪÀý£¬ÅÀÈ¡µÄʱ¼ä»á¸³ÖµÎª×ֶΡ£ÎÒÃÇÊ×ÏÈÓÃÕýÔòÆ¥ÅäÕâ¸öʱ¼ä£¬±í´ïʽд×÷£¬Èç¹ûÌáÈ¡µ½µÄʱ¼ä·ûºÏÕâ¸ö±í´ïʽ£¬ÄÇô¾ÍÌáÈ¡³öÆäÖеÄÊý×Ö£¬ÕâÑù¾Í¿ÉÒÔ»ñÈ¡·ÖÖÓÊýÁË¡£½ÓÏÂÀ´Ê¹ÓÃÄ£¿éµÄ·½·¨£¬µÚÒ»¸ö²ÎÊý´«ÈëҪת»»µÄʱ¼ä¸ñʽ£¬µÚ¶þ¸ö²ÎÊý¾ÍÊÇʱ¼ä´Á¡£ÔÚÕâÀïÎÒÃÇÓõ±Ç°µÄʱ¼ä´Á¼õÈ¥´Ë·ÖÖÓÊý³ËÒÔ60¾ÍÊǵ±Ê±µÄʱ¼ä´Á£¬ÕâÑùÎÒÃǾͿÉÒԵõ½¸ñʽ»¯ºóµÄÕýȷʱ¼äÁË¡£

È»ºóPipeline¿ÉÒÔʵÏÖÈçÏ´¦Àí£º

classWeiboPipeline():
defprocess_item(self, item, spider):
ifisinstance(item, WeiboItem):
ifitem.get('created_at'):
item['created_at'] = item['created_at'].strip()
item['created_at'] = self.parse_time (item.get('created_at'))

ÎÒÃÇÔÚSpiderÀïûÓжÔ×ֶθ³Öµ£¬Ëü´ú±íÅÀȡʱ¼ä£¬ÎÒÃÇ¿ÉÒÔͳһ½«Æä¸³ÖµÎªµ±Ç°Ê±¼ä£¬ÊµÏÖÈçÏÂËùʾ£º

classTimePipeline():
defprocess_item(self, item, spider):
ifisinstance(item, UserItem) orisinstance(item, WeiboItem):
now = time.strftime ('%Y-%m-%d %H:%M', time.localtime())
item['crawled_at'] = now
returnitem

ÔÚÕâÀïÎÒÃÇÅжÏÁËItemÈç¹ûÊÇUserItem»òWeiboItemÀàÐÍ£¬ÄÇô¾Í¸øËüµÄ×ֶθ³ÖµÎªµ±Ç°Ê±¼ä¡£

ͨ¹ýÉÏÃæµÄÁ½¸öPipeline£¬ÎÒÃDZãÍê³ÉÁËÊý¾ÝÇåÏ´¹¤×÷£¬ÕâÀïÖ÷ÒªÊÇʱ¼äµÄת»»¡£

¾Å¡¢Êý¾Ý´æ´¢

Êý¾ÝÇåÏ´Íê±ÏÖ®ºó£¬ÎÒÃǾÍÒª½«Êý¾Ý±£´æµ½MongoDBÊý¾Ý¿â¡£ÎÒÃÇÔÚÕâÀïʵÏÖMongoPipelineÀ࣬ÈçÏÂËùʾ£º

importpymongo
classMongoPipeline(object):
def__init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
deffrom_crawler(cls, crawler):
returncls(
mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE')
)
defopen_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
self.db[UserItem.collection].create_index([('id', pymongo.ASCENDING)])
self.db[WeiboItem.collection].create_index([('id', pymongo.ASCENDING)])
defclose_spider(self, spider):
self.client.close()
defprocess_item(self, item, spider):
ifisinstance(item, UserItem)orisinstance(item, WeiboItem):
self.db[item.collection].update({'id': item.get('id')}, {'$set': item},True)
ifisinstance(item, UserRelationItem):
self.db[item.collection].update(
{'id': item.get('id')},
{'$addToSet':
{
'follows': {'$each': item['follows']},
'fans': {'$each': item['fans']}
}
},True)
returnitem

µ±Ç°µÄMongoPipelineºÍÇ°ÃæÎÒÃÇËùдµÄÓÐËù²»Í¬£¬Ö÷ÒªÓÐÒÔϼ¸µã¡£

·½·¨ÀïÌí¼ÓÁËCollectionµÄË÷Òý£¬ÔÚÕâÀïΪÁ½¸öItem¶¼Ìí¼ÓÁËË÷Òý£¬Ë÷ÒýµÄ×Ö¶ÎÊÇ¡£ÓÉÓÚÎÒÃÇÕâ´ÎÊÇ´ó¹æÄ£ÅÀÈ¡£¬ÅÀÈ¡¹ý³ÌÉæ¼°Êý¾ÝµÄ¸üÐÂÎÊÌ⣬ËùÒÔÎÒÃÇΪÿ¸öCollection½¨Á¢ÁËË÷Òý£¬ÕâÑù¿ÉÒÔ´ó´óÌá¸ß¼ìË÷ЧÂÊ¡£

ÔÚ·½·¨Àï´æ´¢Ê¹ÓõÄÊÇ·½·¨£¬µÚÒ»¸ö²ÎÊýÊDzéѯÌõ¼þ£¬µÚ¶þ¸ö²ÎÊýÊÇÅÀÈ¡µÄItem¡£ÕâÀïÎÒÃÇʹÓÃÁ˲Ù×÷·û£¬Èç¹ûÅÀÈ¡µ½Öظ´µÄÊý¾Ý¼´¿É¶ÔÊý¾Ý½øÐиüУ¬Í¬Ê±²»»áɾ³ýÒÑ´æÔÚµÄ×ֶΡ£Èç¹ûÕâÀï²»¼Ó²Ù×÷·û£¬ÄÇô»áÖ±½Ó½øÐÐÌæ»»£¬ÕâÑù¿ÉÄܻᵼÖÂÒÑ´æÔÚµÄ×Ö¶ÎÈç¹Ø×¢ºÍ·ÛË¿ÁбíÇå¿Õ¡£µÚÈý¸ö²ÎÊýÉèÖÃΪTrue£¬Èç¹ûÊý¾Ý²»´æÔÚ£¬Ôò²åÈëÊý¾Ý¡£ÕâÑùÎÒÃǾͿÉÒÔ×öµ½Êý¾Ý´æÔÚ¼´¸üС¢Êý¾Ý²»´æÔÚ¼´²åÈ룬´Ó¶ø»ñµÃÈ¥ÖØµÄЧ¹û¡£

¶ÔÓÚÓû§µÄ¹Ø×¢ºÍ·ÛË¿ÁÐ±í£¬ÎÒÃÇʹÓÃÁËÒ»¸öеIJÙ×÷·û£¬½Ð×÷£¬Õâ¸ö²Ù×÷·û¿ÉÒÔÏòÁбíÀàÐ͵Ä×ֶβåÈëÊý¾ÝÍ¬Ê±È¥ÖØ¡£ËüµÄÖµ¾ÍÊÇÐèÒª²Ù×÷µÄ×Ö¶ÎÃû³Æ¡£ÕâÀïÀûÓÃÁ˲Ù×÷·û¶ÔÐèÒª²åÈëµÄÁбíÊý¾Ý½øÐÐÁ˱éÀú£¬ÒÔÖðÌõ²åÈëÓû§µÄ¹Ø×¢»ò·ÛË¿Êý¾Ýµ½Ö¸¶¨µÄ×ֶΡ£¹ØÓڸòÙ×÷¸ü¶à½âÊÍ¿ÉÒԲο¼MongoDBµÄ¹Ù·½Îĵµ£¬Á´½ÓΪ£ºhttps://docs.mongodb.com/manua l/reference/operator/update/addToSet/¡£

Ê®¡¢Cookies³Ø¶Ô½Ó

ÐÂÀË΢²©µÄ·´ÅÀÄÜÁ¦·Ç³£Ç¿£¬ÎÒÃÇÐèÒª×öһЩ·À·¶·´ÅÀ³æµÄ´ëÊ©²Å¿ÉÒÔ˳ÀûÍê³ÉÊý¾ÝÅÀÈ¡¡£

Èç¹ûûÓеǼ¶øÖ±½ÓÇëÇó΢²©µÄAPI½Ó¿Ú£¬Õâ·Ç³£ÈÝÒ×µ¼ÖÂ403״̬Âë¡£Õâ¸öÇé¿öÎÒÃÇÔÚCookies³ØÒ»½ÚÒ²Ìá¹ý¡£ËùÒÔÔÚÕâÀïÎÒÃÇʵÏÖÒ»¸öMiddleware£¬ÎªÃ¿¸öRequestÌí¼ÓËæ»úµÄCookies¡£

ÎÒÃÇÏÈ¿ªÆôCookies³Ø£¬Ê¹APIÄ£¿éÕý³£ÔËÐС£ÀýÈçÔÚ±¾µØÔËÐÐ5000¶Ë¿Ú£¬·ÃÎÊ£ºhttp://localhost:5000/weibo/random£¬¼´¿É»ñÈ¡Ëæ»úµÄCookies¡£µ±È»Ò²¿ÉÒÔ½«Cookies³Ø²¿Êðµ½Ô¶³ÌµÄ·þÎñÆ÷£¬ÕâÑùÖ»ÐèÒª¸ü¸Ä·ÃÎʵÄÁ´½Ó¡£

ÎÒÃÇÔÚ±¾µØÆô¶¯Cookies³Ø£¬ÊµÏÖÒ»¸öMiddleware£¬ÈçÏÂËùʾ£º

classCookiesMiddleware():
def__init__(self, cookies_url):
self.logger = logging.getLogger(__name__)
self.cookies_url = cookies_url
defget_random_cookies(self):
try:
response = requests.get(self.cookies_url)
ifresponse.status_code ==200:
cookies = json.loads(response.text)
returncookies
exceptrequests.ConnectionError:
returnFalse
defprocess_request(self, request, spider):
self.logger.debug('ÕýÔÚ»ñÈ¡Cookies')
cookies = self.get_random_cookies()
ifcookies:
request.cookies = cookies
self.logger.debug('ʹÓÃCookies '+ json.dumps(cookies))
@classmethod
deffrom_crawler(cls, crawler):
settings = crawler.settings
returncls(
cookies_url=settings.get('COOKIES_URL')
)

ÎÒÃÇÊ×ÏÈÀûÓ÷½·¨»ñÈ¡Á˱äÁ¿£¬Ëü¶¨ÒåÔÚsettings.pyÀÕâ¾ÍÊǸղÅÎÒÃÇËù˵µÄ½Ó¿Ú¡£½ÓÏÂÀ´ÊµÏÖ·½·¨£¬Õâ¸ö·½·¨Ö÷Òª¾ÍÊÇÇëÇó´ËCookies³Ø½Ó¿Ú²¢»ñÈ¡½Ó¿Ú·µ»ØµÄËæ»úCookies¡£Èç¹û³É¹¦»ñÈ¡£¬Ôò·µ»ØCookies£»·ñÔò·µ»Ø¡£

½ÓÏÂÀ´£¬ÔÚ·½·¨ÀÎÒÃǸø¶ÔÏóµÄÊôÐÔ¸³Öµ£¬ÆäÖµ¾ÍÊÇ»ñÈ¡µÄËæ»úCookies£¬ÕâÑùÎÒÃǾͳɹ¦µØÎªÃ¿Ò»´ÎÇëÇó¸³ÖµCookiesÁË¡£

Èç¹ûÆôÓÃÁ˸ÃMiddleware£¬Ã¿¸öÇëÇ󶼻ᱻ¸³ÖµËæ»úµÄCookies¡£ÕâÑùÎÒÃǾͿÉÒÔÄ£ÄâµÇ¼֮ºóµÄÇëÇó£¬403״̬Âë»ù±¾¾Í²»»á³öÏÖ¡£

ʮһ¡¢´úÀí³Ø¶Ô½Ó

΢²©»¹ÓÐÒ»¸ö·´ÅÀ´ëÊ©¾ÍÊÇ£¬¼ì²âµ½Í¬Ò»IPÇëÇóÁ¿¹ý´óʱ¾Í»á³öÏÖ414״̬Âë¡£Èç¹ûÓöµ½ÕâÑùµÄÇé¿ö¿ÉÒÔÇл»´úÀí¡£ÀýÈ磬ÔÚ±¾µØ5555¶Ë¿ÚÔËÐУ¬»ñÈ¡Ëæ»ú¿ÉÓôúÀíµÄµØÖ·Îª£ºhttp://localhost:5555/random£¬·ÃÎÊÕâ¸ö½Ó¿Ú¼´¿É»ñȡһ¸öËæ»ú¿ÉÓôúÀí¡£½ÓÏÂÀ´ÎÒÃÇÔÙʵÏÖÒ»¸öMiddleware£¬´úÂëÈçÏÂËùʾ£º

classProxyMiddleware():
def__init__(self, proxy_url):
self.logger = logging.getLogger(__name__)
self.proxy_url = proxy_url
defget_random_proxy(self):
try:
response = requests.get(self.proxy_url)
ifresponse.status_code ==200:
proxy = response.text
returnproxy
exceptrequests.ConnectionError:
returnFalse
defprocess_request(self, request, spider):
ifrequest.meta.get('retry_times'):
proxy = self.get_random_proxy()
ifproxy:
uri ='https://'.format(proxy=proxy)
self.logger.debug('ʹÓôúÀí '+ proxy)
request.meta['proxy'] = uri
@classmethod
deffrom_crawler(cls, crawler):
settings = crawler.settings
returncls(
proxy_url=settings.get('PROXY_URL')
)

ͬÑùµÄÔ­Àí£¬ÎÒÃÇʵÏÖÁËÒ»¸ö·½·¨ÓÃÓÚÇëÇó´úÀí³ØµÄ½Ó¿Ú»ñÈ¡Ëæ»ú´úÀí¡£Èç¹û»ñÈ¡³É¹¦£¬Ôò·µ»Ø¸Ä´úÀí£¬·ñÔò·µ»Ø¡£ÔÚ·½·¨ÖУ¬ÎÒÃǸø¶ÔÏóµÄÊôÐÔ¸³ÖµÒ»¸ö×ֶΣ¬¸Ã×ֶεÄÖµ¾ÍÊÇ´úÀí¡£

ÁíÍ⣬¸³Öµ´úÀíµÄÅжÏÌõ¼þÊǵ±Ç°²»Îª¿Õ£¬Ò²¾ÍÊÇ˵µÚÒ»´ÎÇëÇóʧ°ÜÖ®ºó²ÅÆôÓôúÀí£¬ÒòΪʹÓôúÀíºó·ÃÎÊËÙ¶È»áÂýһЩ¡£ËùÒÔÎÒÃÇÔÚÕâÀïÉèÖÃÁËÖ»ÓÐÖØÊÔµÄʱºò²ÅÆôÓôúÀí£¬·ñÔòÖ±½ÓÇëÇó¡£ÕâÑù¾Í¿ÉÒÔ±£Ö¤ÔÚûÓб»·â½ûµÄÇé¿öÏÂÖ±½ÓÅÀÈ¡£¬±£Ö¤ÁËÅÀÈ¡ËÙ¶È¡£

Ê®¶þ¡¢ÆôÓÃMiddleware

½ÓÏÂÀ´£¬ÎÒÃÇÔÚÅäÖÃÎļþÖÐÆôÓÃÕâÁ½¸öMiddleware£¬ÐÞ¸Äsettings.pyÈçÏÂËùʾ£º

DOWNLOADER_MIDDLEWARES = {
'weibo.middlewares.CookiesMiddleware':554,
'weibo.middlewares.ProxyMiddleware':555,
}

×¢ÒâÕâÀïµÄÓÅÏȼ¶ÉèÖã¬Ç°ÎÄÌáµ½ÁËScrapyµÄĬÈÏDownloader MiddlewareµÄÉèÖÃÈçÏ£º

{
'scrapy.downloadermiddlewares.robotstxt .RobotsTxtMiddleware':100,
'scrapy.downloadermiddlewares.httpauth .HttpAuthMiddleware':300,
'scrapy.downloadermiddlewares.downloadtimeout. DownloadTimeoutMiddleware':350,
'scrapy.downloadermiddlewares. defaultheaders.DefaultHeadersMiddleware':400,
'scrapy.downloadermiddlewares. useragent.UserAgentMiddleware':500,
'scrapy.downloadermiddlewares. retry.RetryMiddleware':550,
'scrapy.downloadermiddlewares. ajaxcrawl.AjaxCrawlMiddleware':560,
'scrapy.downloadermiddlewares. redirect.MetaRefreshMiddleware':580,
'scrapy.downloadermiddlewares. httpcompression.HttpCompressionMiddleware':590,
'scrapy.downloadermiddlewares. redirect.RedirectMiddleware':600,
'scrapy.downloadermiddlewares. cookies.CookiesMiddleware':700,
'scrapy.downloadermiddlewares. httpproxy.HttpProxyMiddleware':750,
'scrapy.downloadermiddlewares. stats.DownloaderStats':850,
'scrapy.downloadermiddlewares. httpcache.HttpCacheMiddleware':900,
}

ҪʹµÃÎÒÃÇ×Ô¶¨ÒåµÄCookiesMiddlewareÉúЧ£¬ËüÔÚÄÚÖõÄCookiesMiddleware֮ǰµ÷Óá£ÄÚÖõÄCookiesMiddlewareµÄÓÅÏȼ¶Îª700£¬ËùÒÔÕâÀïÎÒÃÇÉèÖÃÒ»¸ö±È700СµÄÊý×Ö¼´¿É¡£

ҪʹµÃÎÒÃÇ×Ô¶¨ÒåµÄProxyMiddlewareÉúЧ£¬ËüÔÚÄÚÖõÄHttpProxyMiddleware֮ǰµ÷Óá£ÄÚÖõÄHttpProxyMiddlewareµÄÓÅÏȼ¶Îª750£¬ËùÒÔÕâÀïÎÒÃÇÉèÖÃÒ»¸ö±È750СµÄÊý×Ö¼´¿É¡£

Ê®Èý¡¢ÔËÐÐ

µ½´ËΪֹ£¬Õû¸ö΢²©ÅÀ³æ¾ÍʵÏÖÍê±ÏÁË¡£ÎÒÃÇÔËÐÐÈçÏÂÃüÁîÆô¶¯ÅÀ³æ£º

scrapy crawl weibocn

Êä³ö½á¹ûÈçÏÂËùʾ£º

2017-07-11 17:27:34 [urllib3.connectionpool] DEBUG: http://localhost:5000 "GET /weibo/random HTTP/1.1" 200 339
2017-07-11 17:27:34 [weibo.middlewares] DEBUG: ʹÓÃCookies {"SCF": "AhzwTr_DxIGjgri_dt46_DoPzUqq-PSupu545JdozdHYJ7HyEb4pD3pe05VpbIp VyY1ciKRRWwUgojiO3jYwlBE.", "_T_WM": "8fe0bc1dad068d09b888d8177f1c1218", "SSOLoginState": "1501496388", "M_WEIBOCN_PARAMS": "uicode%3D20000174", "SUHB": "0tKqV4asxqYl4J", "SUB": "_2A250e3QUDeRhGeBM6VYX8y7NwjiIHXVXh BxcrDV6PUJbkdBeLXjckW2fUT8MWloekO4FCWVlIYJGJdGLnA.."}
2017-07-11 17:27:34 [weibocn] DEBUG:
2017-07-11 17:27:34 [scrapy.core.scraper] DEBUG: Scraped from
{'avatar': 'https://tva4.sinaimg.cn/crop.0.0.180.180. 180/67dd74e0jw1e8qgp5bmzyj2050050aa8.jpg',
'cover': 'https://tva3.sinaimg.cn/crop.0.0.640.640. 640/6ce2240djw1e9oaqhwllzj20hs0hsdir.jpg',
'crawled_at': '2017-07-11 17:27',
'description': '³É³¤£¬¾ÍÊÇÒ»¸ö²»¶Ï¾õµÃÒÔ Ç°µÄ×Ô¼ºÊǸöɵ±ÆµÄ¹ý³Ì',
'fans_count': 19202906,
'follows_count': 1599,
'gender': 'm',
'id': 1742566624,
'name': '˼Ïë¾Û½¹',
'verified': True,
'verified_reason': '΢²©ÖªÃû²©Ö÷£¬Ð£µ¼Íø±à¼­',
'verified_type': 0,
'weibos_count': 58393}

ÔËÐÐÒ»¶Îʱ¼äºó£¬ÎÒÃDZã¿ÉÒÔµ½MongoDBÊý¾Ý¿â²é¿´Êý¾Ý£¬ÅÀÈ¡ÏÂÀ´µÄÊý¾ÝÈçÏÂͼËùʾ¡£

Õë¶ÔÓû§ÐÅÏ¢£¬ÎÒÃDz»½öÅÀÈ¡ÁËÆä»ù±¾ÐÅÏ¢£¬»¹°Ñ¹Ø×¢ºÍ·ÛË¿Áбí¼Óµ½Á˺Í×ֶβ¢×öÁËÈ¥ÖØ²Ù×÷¡£Õë¶Ô΢²©ÐÅÏ¢£¬ÎÒÃdzɹ¦½øÐÐÁËʱ¼äת»»´¦Àí£¬Í¬Ê±»¹±£´æÁË΢²©µÄͼƬÁбíÐÅÏ¢¡£

Ê®ËÄ¡¢±¾½Ú´úÂë

±¾½Ú´úÂëµØÖ·Îª£ºhttps://github.com/Python3WebSpider/Weibo¡£

Ê®Îå¡¢½áÓï

±¾½ÚʵÏÖÁËÐÂÀË΢²©µÄÓû§¼°Æä·ÛË¿¹Ø×¢ÁбíºÍ΢²©ÐÅÏ¢µÄÅÀÈ¡£¬»¹¶Ô½ÓÁËCookies³ØºÍ´úÀí³ØÀ´´¦Àí·´ÅÀ³æ¡£²»¹ýÏÖÔÚÊÇÕë¶Ôµ¥»úµÄÅÀÈ¡£¬ºóÃæÎÒÃǻὫ´ËÏîÄ¿ÐÞ¸ÄΪ·Ö²¼Ê½ÅÀ³æ£¬ÒÔ½øÒ»²½Ìá¸ßץȡЧÂÊ¡£

 
   
3166 ´Îä¯ÀÀ       27
Ïà¹ØÎÄÕÂ

ÊÖ»úÈí¼þ²âÊÔÓÃÀýÉè¼ÆÊµ¼ù
ÊÖ»ú¿Í»§¶ËUI²âÊÔ·ÖÎö
iPhoneÏûÏ¢ÍÆËÍ»úÖÆÊµÏÖÓë̽ÌÖ
AndroidÊÖ»ú¿ª·¢£¨Ò»£©
Ïà¹ØÎĵµ

Android_UI¹Ù·½Éè¼Æ½Ì³Ì
ÊÖ»ú¿ª·¢Æ½Ì¨½éÉÜ
androidÅÄÕÕ¼°ÉÏ´«¹¦ÄÜ
Android½²ÒåÖÇÄÜÊÖ»ú¿ª·¢
Ïà¹Ø¿Î³Ì

Android¸ß¼¶Òƶ¯Ó¦ÓóÌÐò
Androidϵͳ¿ª·¢
AndroidÓ¦Óÿª·¢
ÊÖ»úÈí¼þ²âÊÔ