±à¼ÍƼö: |
±¾ÎÄÀ´×ÔÓÚsegmentfault,ÎÄÕÂÖ÷Òª½éÉÜÁËScrapy¿ò¼ÜÊÇʲô£¬ScrapyµÄ¼Ü¹¹Á÷³Ìͼ,ÒÔ¼°Scrapy¿ò¼ÜµÄ°²×°£¬Ï£Íû¶ÔÄúÄÜÓÐËù°ïÖú¡£ |
|
Scrapy ÊÇÓÃPythonʵÏÖÒ»¸öΪÅÀÈ¡ÍøÕ¾Êý¾Ý¡¢ÌáÈ¡½á¹¹ÐÔÊý¾Ý¶ø±àдµÄÓ¦Óÿò¼Ü¡£
Ò»¡¢Scrapy¿ò¼Ü¼ò½é
ScrapyÊÇÒ»¸öΪÁËÅÀÈ¡ÍøÕ¾Êý¾Ý£¬ÌáÈ¡½á¹¹ÐÔÊý¾Ý¶ø±àдµÄÓ¦Óÿò¼Ü¡£ ¿ÉÒÔÓ¦ÓÃÔÚ°üÀ¨Êý¾ÝÍÚ¾ò£¬ÐÅÏ¢´¦Àí»ò´æ´¢ÀúÊ·Êý¾ÝµÈһϵÁеijÌÐòÖС£
Æä×î³õÊÇΪÁË Ò³Ãæ×¥È¡ (¸üÈ·ÇÐÀ´Ëµ, ÍøÂçץȡ )ËùÉè¼ÆµÄ£¬ Ò²¿ÉÒÔÓ¦ÓÃÔÚ»ñÈ¡APIËù·µ»ØµÄÊý¾Ý(ÀýÈç
Amazon Associates Web Services ) »òÕßͨÓõÄÍøÂçÅÀ³æ¡£
¶þ¡¢¼Ü¹¹Á÷³Ìͼ
½ÓÏÂÀ´µÄͼ±íÕ¹ÏÖÁËScrapyµÄ¼Ü¹¹£¬°üÀ¨×é¼þ¼°ÔÚϵͳÖз¢ÉúµÄÊý¾ÝÁ÷µÄ¸ÅÀÀ(ÂÌÉ«¼ýÍ·Ëùʾ)¡£ ÏÂÃæ¶Ôÿ¸ö×é¼þ¶¼×öÁ˼òµ¥½éÉÜ£¬²¢¸ø³öÁËÏêϸÄÚÈݵÄÁ´½Ó¡£Êý¾ÝÁ÷ÈçÏÂËùÃèÊö¡£

1¡¢×é¼þ
Scrapy Engine
ÒýÇæ¸ºÔð¿ØÖÆÊý¾ÝÁ÷ÔÚϵͳÖÐËùÓÐ×é¼þÖÐÁ÷¶¯£¬²¢ÔÚÏàÓ¦¶¯×÷·¢Éúʱ´¥·¢Ê¼þ¡£ ÏêϸÄÚÈݲ鿴ÏÂÃæµÄÊý¾ÝÁ÷(Data
Flow)²¿·Ö¡£
µ÷¶ÈÆ÷(Scheduler)
µ÷¶ÈÆ÷´ÓÒýÇæ½ÓÊÜrequest²¢½«ËûÃÇÈë¶Ó£¬ÒÔ±ãÖ®ºóÒýÇæÇëÇóËûÃÇʱÌṩ¸øÒýÇæ¡£
ÏÂÔØÆ÷(Downloader)
ÏÂÔØÆ÷¸ºÔð»ñÈ¡Ò³ÃæÊý¾Ý²¢Ìṩ¸øÒýÇæ£¬¶øºóÌṩ¸øspider¡£
Spiders
SpiderÊÇScrapyÓû§±àдÓÃÓÚ·ÖÎöresponse²¢ÌáÈ¡item(¼´»ñÈ¡µ½µÄitem)»ò¶îÍâ¸ú½øµÄURLµÄÀà¡£
ÿ¸öspider¸ºÔð´¦ÀíÒ»¸öÌØ¶¨(»òһЩ)ÍøÕ¾¡£ ¸ü¶àÄÚÈÝÇë¿´ Spiders ¡£
Item Pipeline
Item Pipeline¸ºÔð´¦Àí±»spiderÌáÈ¡³öÀ´µÄitem¡£µäÐ͵Ĵ¦ÀíÓÐÇåÀí¡¢ ÑéÖ¤¼°³Ö¾Ã»¯(ÀýÈç´æÈ¡µ½Êý¾Ý¿âÖÐ)¡£
¸ü¶àÄÚÈݲ鿴 Item Pipeline ¡£
ÏÂÔØÆ÷Öмä¼þ(Downloader middlewares)
ÏÂÔØÆ÷Öмä¼þÊÇÔÚÒýÇæ¼°ÏÂÔØÆ÷Ö®¼äµÄÌØ¶¨¹³×Ó(specific hook)£¬´¦ÀíDownloader´«µÝ¸øÒýÇæµÄresponse¡£
ÆäÌṩÁËÒ»¸ö¼ò±ãµÄ»úÖÆ£¬Í¨¹ý²åÈë×Ô¶¨Òå´úÂëÀ´À©Õ¹Scrapy¹¦ÄÜ¡£¸ü¶àÄÚÈÝÇë¿´ ÏÂÔØÆ÷Öмä¼þ(Downloader
Middleware) ¡£
SpiderÖмä¼þ(Spider middlewares)
SpiderÖмä¼þÊÇÔÚÒýÇæ¼°SpiderÖ®¼äµÄÌØ¶¨¹³×Ó(specific hook)£¬´¦ÀíspiderµÄÊäÈë(response)ºÍÊä³ö(items¼°requests)¡£
ÆäÌṩÁËÒ»¸ö¼ò±ãµÄ»úÖÆ£¬Í¨¹ý²åÈë×Ô¶¨Òå´úÂëÀ´À©Õ¹Scrapy¹¦ÄÜ¡£¸ü¶àÄÚÈÝÇë¿´ SpiderÖмä¼þ(Middleware)
¡£
2¡¢Êý¾ÝÁ÷(Data flow)
ScrapyÖеÄÊý¾ÝÁ÷ÓÉÖ´ÐÐÒýÇæ¿ØÖÆ£¬Æä¹ý³ÌÈçÏÂ:
ÒýÇæ´ò¿ªÒ»¸öÍøÕ¾(open a domain)£¬ÕÒµ½´¦Àí¸ÃÍøÕ¾µÄSpider²¢Ïò¸ÃspiderÇëÇóµÚÒ»¸öÒªÅÀÈ¡µÄURL(s)¡£
ÒýÇæ´ÓSpiderÖлñÈ¡µ½µÚÒ»¸öÒªÅÀÈ¡µÄURL²¢ÔÚµ÷¶ÈÆ÷(Scheduler)ÒÔRequestµ÷¶È¡£
ÒýÇæÏòµ÷¶ÈÆ÷ÇëÇóÏÂÒ»¸öÒªÅÀÈ¡µÄURL¡£
µ÷¶ÈÆ÷·µ»ØÏÂÒ»¸öÒªÅÀÈ¡µÄURL¸øÒýÇæ£¬ÒýÇæ½«URLͨ¹ýÏÂÔØÖмä¼þ(ÇëÇó(request)·½Ïò)ת·¢¸øÏÂÔØÆ÷(Downloader)¡£
Ò»µ©Ò³ÃæÏÂÔØÍê±Ï£¬ÏÂÔØÆ÷Éú³ÉÒ»¸ö¸ÃÒ³ÃæµÄResponse£¬²¢½«Æäͨ¹ýÏÂÔØÖмä¼þ(·µ»Ø(response)·½Ïò)·¢Ë͸øÒýÇæ¡£
ÒýÇæ´ÓÏÂÔØÆ÷ÖнÓÊÕµ½Response²¢Í¨¹ýSpiderÖмä¼þ(ÊäÈë·½Ïò)·¢Ë͸øSpider´¦Àí¡£
Spider´¦ÀíResponse²¢·µ»ØÅÀÈ¡µ½µÄItem¼°(¸ú½øµÄ)еÄRequest¸øÒýÇæ¡£
ÒýÇæ½«(Spider·µ»ØµÄ)ÅÀÈ¡µ½µÄItem¸øItem Pipeline£¬½«(Spider·µ»ØµÄ)Request¸øµ÷¶ÈÆ÷¡£
(´ÓµÚ¶þ²½)ÖØ¸´Ö±µ½µ÷¶ÈÆ÷ÖÐûÓиü¶àµØrequest£¬ÒýÇæ¹Ø±Õ¸ÃÍøÕ¾¡£
3¡¢Ê¼þÇý¶¯ÍøÂç(Event-driven networking)
Scrapy»ùÓÚʼþÇý¶¯ÍøÂç¿ò¼Ü Twisted ±àд¡£Òò´Ë£¬Scrapy»ùÓÚ²¢·¢ÐÔ¿¼ÂÇÓÉ·Ç×èÈû(¼´Òì²½)µÄʵÏÖ¡£
¹ØÓÚÒì²½±à³Ì¼°Twisted¸ü¶àµÄÄÚÈÝÇë²é¿´ÏÂÁÐÁ´½Ó:
Èý¡¢4²½ÖÆ×÷ÅÀ³æ
н¨ÏîÄ¿£¨scrapy startproject xxx£©:н¨Ò»¸öеÄÅÀ³æÏîÄ¿
Ã÷È·Ä¿±ê£¨±àдitems.py£©:Ã÷È·ÄãÏëҪץȡµÄÄ¿±ê
ÖÆ×÷ÅÀ³æ£¨spiders/xxsp der.py£©:ÖÆ×÷ÅÀ³æ¿ªÊ¼ÅÀÈ¡ÍøÒ³
´æ´¢ÄÚÈÝ£¨pipelines.py£©:Éè¼Æ¹ÜµÀ´æ´¢ÅÀÈ¡ÄÚÈÝ
ËÄ¡¢°²×°¿ò¼Ü
ÕâÀïÎÒÃÇʹÓà conda À´½øÐа²×°£º
»òÕßʹÓà pip ½øÐа²×°£º
²é¿´°²×°£º
spider scrapy
-h
Scrapy 1.4.0 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined
templates
runspider Run a self-contained spider (without
creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run
from project directory
Use "scrapy <command> -h" to
see more info about a command |
1.´´½¨ÏîÄ¿
spider scrapy
startproject SF
New Scrapy project 'SF', using template directory
'/Users/kaiyiwang/anaconda2/lib/python2.7/site-packages/scrapy/templates/project',
created in:
/Users/kaiyiwang/Code/python/spider/SF
You can start your first spider with:
cd SF
scrapy genspider example example.com
spider |
ʹÓà tree ÃüÁî¿ÉÒԲ鿴ÏîÄ¿½á¹¹£º
SF tree
.
©À©¤©¤ SF
©¦ ©À©¤©¤ __init__.py
©¦ ©À©¤©¤ items.py
©¦ ©À©¤©¤ middlewares.py
©¦ ©À©¤©¤ pipelines.py
©¦ ©À©¤©¤ settings.py
©¦ ©¸©¤©¤ spiders
©¦ ©¸©¤©¤ __init__.py
©¸©¤©¤ scrapy.cfg |

2.ÔÚspiders Ŀ¼Ï´´½¨Ä£°å
spiders scrapy
genspider sf "https://segmentfault.com"
Created spider 'sf' using template 'basic' in
module:
SF.spiders.sf
spiders |
ÕâÑù£¬¾ÍÉú³ÉÁËÒ»¸öÏîÄ¿Îļþ sf.py
# -*- coding:
utf-8 -*-
import scrapy
from SF.items import SfItem
class SfSpider(scrapy.Spider):
name = 'sf'
allowed_domains = ['https://segmentfault.com']
start_urls = ['https://segmentfault.com/']
def parse(self, response):
# print response.body
# pass
node_list = response.xpath("//h2[@class='title']")
# ÓÃÀ´´æ´¢ËùÓеÄitem×ֶεÄ
# items = []
for node in node_list:
# ´´½¨item×ֶζÔÏó£¬ÓÃÀ´´æ´¢ÐÅÏ¢
item = SfItem()
# .extract() ½«xpath¶ÔÏóת»»Îª Unicode×Ö·û´®
title = node.xpath("./a/text()").extract()
item['title'] = title[0]
# ·µ»Ø×¥È¡µ½µÄitemÊý¾Ý£¬¸ø¹ÜµÀÎļþ´¦Àí£¬Í¬Ê±»¹»ØÀ´¼ÌÐøÖ´Ðкó±ßµÄ´úÂë
yield.item
#return item
#return scrapy.Request(url)
#items.append(item)
|
ÃüÁ
# ²âÊÔÅÀ³æÊÇ·ñÕý³£,
sfΪÅÀ³æµÄÃû³Æ
scrapy check sf s
# ÔËÐÐÅÀ³æ
scrapy crawl sf |
3.item pipeline
µ± item ÔÚSpiderÖб»ÊÕ¼¯Ö®ºó£¬Ëü½«»á±»´«µÝµ½ item Pipeline, ÕâЩ item
Pipeline ×é¼þ°´¶¨ÒåµÄ˳Ðò´¦Àí item.
ÿ¸ö Item Pipeline ¶¼ÊÇʵÏÖÁ˼òµ¥·½·¨µÄPython À࣬±ÈÈç¾ö¶¨´ËItemÊǶªÆú»ò´æ´¢£¬ÒÔÏÂÊÇ
item pipeline µÄһЩµäÐÍÓ¦Óãº
ÑéÖ¤ÅÀÈ¡µÃÊý¾Ý£¨¼ì²éitem°üº¬Ä³Ð©×ֶΣ¬±ÈÈç˵name×ֶΣ©
²éÖØ£¨²¢¶ªÆú£©
½«ÅÀÈ¡½á¹û±£´æµ½Îļþ»òÕßÊý¾Ý¿â×Ü£¨Êý¾Ý³Ö¾Ã»¯£©
±àд item pipeline
±àд item pipeline ºÜ¼òµ¥£¬item pipeline
×é¼þÊÇÒ»¸ö¶ÀÁ¢µÄPythonÀ࣬ÆäÖÐ process_item()·½·¨±ØÐëʵÏÖ¡£
from scrapy.exceptions
import DropItem
class PricePipeline(object):
vat_factor = 1.15
def process_item(self, item, spider):
if item['price']:
if item['price_excludes_vat']:
item['price'] = item['price'] * self.vat_factor
return item
else:
raise DropItem("Missing price in %s"
% item) |
4.Ñ¡ÔñÆ÷(Selectors)
µ±×¥È¡ÍøÒ³Ê±£¬Äã×öµÄ×î³£¼ûµÄÈÎÎñÊÇ´ÓHTMLÔ´ÂëÖÐÌáÈ¡Êý¾Ý¡£
Selector ÓÐËĸö»ù±¾µÄ·½·¨£¬×î³£ÓõϹÊÇXpath
xpath():´«Èëxpath±í´ïʽ£¬·µ»Ø¸Ã±í´ïʽËù¶ÔÓ¦µÄËùÓнڵãµÄselector list ÁÐ±í¡£
extract(): ÐòÁл¯¸Ã½ÚµãΪUnicode×Ö·û´®²¢·µ»Ølist
css():´«ÈëCSS±í´ïʽ£¬·µ»Ø¸Ã±í´ïʽËù¶ÔÓ¦µÄËùÓнڵãµÄselector list ÁÐ±í£¬Ó﷨ͬ
BeautifulSoup4
re():¸ù¾Ý´«ÈëµÄÕýÔò±í´ïʽ¶ÔÊý¾Ý½øÐÐÌáÈ¡£¬·µ»ØUnicode ×Ö·û´®list Áбí
ScrapyÌáÈ¡Êý¾ÝÓÐ×Ô¼ºµÄÒ»Ì×»úÖÆ¡£ËüÃDZ»³Æ×÷Ñ¡ÔñÆ÷(seletors)£¬ÒòΪËûÃÇͨ¹ýÌØ¶¨µÄ XPath
»òÕß CSS ±í´ïʽÀ´¡°Ñ¡Ôñ¡± HTMLÎļþÖеÄij¸ö²¿·Ö¡£
XPath ÊÇÒ»ÃÅÓÃÀ´ÔÚXMLÎļþÖÐÑ¡Ôñ½ÚµãµÄÓïÑÔ£¬Ò²¿ÉÒÔÓÃÔÚHTMLÉÏ¡£ CSS ÊÇһΫHTMLÎĵµÑùʽ»¯µÄÓïÑÔ¡£Ñ¡ÔñÆ÷ÓÉËü¶¨Ò壬²¢ÓëÌØ¶¨µÄHTMLÔªËØµÄÑùʽÏà¹ØÁ¬¡£
ScrapyÑ¡ÔñÆ÷¹¹½¨ÓÚ lxml ¿âÖ®ÉÏ£¬ÕâÒâζ×ÅËüÃÇÔÚËٶȺͽâÎö׼ȷÐÔÉϷdz£ÏàËÆ¡£
XPath±í´ïʽµÄÀý×Ó£º
/html/head/title:
Ñ¡Ôñ<HTML>ÎĵµÖÐ<head>±êÇ©ÄÚµÄ<title>ÔªËØ
/html/head/title/text(): Ñ¡ÔñÉÏÃæÌáµ½µÄ<title>ÔªËØµÄÎÊÌâ
//td: Ñ¡ÔñËùÓеÄ<td> ÔªËØ
//div[@class="mine"]:Ñ¡ÔñËùÓоßÓÐ class="mine"
ÊôÐ﵀ div ÔªËØ |
Îå¡¢ÅÀÈ¡ÕÐÆ¸ÐÅÏ¢
1.ÅÀÈ¡ÌÚѶÕÐÆ¸ÐÅÏ¢
ÅÀÈ¡µÄµØÖ·£ºhttp://hr.tencent.com/positio...
1.1 ´´½¨ÏîÄ¿
> scrapy
startproject Tencent
You can start your first spider with:
cd Tencent
scrapy genspider example example.com |

ÐèÒª×¥È¡ÍøÒ³µÄÔªËØ£º

ÎÒÃÇÐèÒªÅÀÈ¡ÒÔÏÂÐÅÏ¢£º
ְλÃû£ºpositionName
ְλÁ´½Ó£ºpositionLink
ְλÀàÐÍ£ºpositionType
ְλÈËÊý£ºpositionNumber
¹¤×÷µØµã£ºworkLocation
·¢²¼Ê±µã£ºpublishTime
ÔÚ items.py ÎļþÖж¨ÒåÅÀÈ¡µÄ×ֶΣº
# -*- coding:
utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
# ¶¨Òå×Ö¶Î
class TencentItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# ְλÃû
positionName = scrapy.Field()
# ְλÁ´½Ó
positionLink = scrapy.Field()
# ְλÀàÐÍ
positionType = scrapy.Field()
# ְλÈËÊý
positionNumber = scrapy.Field()
# ¹¤×÷µØµã
workLocation = scrapy.Field()
# ·¢²¼Ê±µã
publishTime = scrapy.Field()
pass |
1.2 дspiderÅÀ³æ
ʹÓÃÃüÁî´´½¨
Tencent scrapy
genspider tencent "tencent.com"
Created spider 'tencent' using template 'basic'
in module:
Tencent.spiders.tencent |
Éú³ÉµÄ spider ÔÚµ±Ç°Ä¿Â¼Ï嵀 spiders/tencent.py
Tencent tree
.
©À©¤©¤ __init__.py
©À©¤©¤ __init__.pyc
©À©¤©¤ items.py
©À©¤©¤ middlewares.py
©À©¤©¤ pipelines.py
©À©¤©¤ settings.py
©À©¤©¤ settings.pyc
©¸©¤©¤ spiders
©À©¤©¤ __init__.py
©À©¤©¤ __init__.pyc
©¸©¤©¤ tencent.py |
ÎÒÃÇ¿ÉÒÔ¿´ÏÂÉú³ÉµÄÕâ¸ö³õʼ»¯Îļþ tencent.py
# -*- coding:
utf-8 -*-
import scrapy
class TencentSpider(scrapy.Spider):
name = 'tencent'
allowed_domains = ['tencent.com']
start_urls = ['http://tencent.com/']
def parse(self, response):
pass |
¶Ô³õʶÎļþtencent.py½øÐÐÐ޸ģº
# -*- coding:
utf-8 -*-
import scrapy
from Tencent.items import TencentItem
class TencentSpider(scrapy.Spider):
name = 'tencent'
allowed_domains = ['tencent.com']
baseURL = "http://hr.tencent.com/position.php?&start="
offset = 0 # Æ«ÒÆÁ¿
start_urls = [baseURL + str(offset)]
def parse(self, response):
# ÇëÇóÏìÓ¦
# node_list = response.xpath("//tr[@class='even']
or //tr[@class='odd']")
node_list = response.xpath("//tr[@class='even']
| //tr[@class='odd']")
for node in node_list:
item = TencentItem() # ÒýÈë×Ö¶ÎÀà
# Îı¾ÄÚÈÝ, È¡ÁбíµÄµÚÒ»¸öÔªËØ[0], ²¢ÇÒ½«ÌáÈ¡³öÀ´µÄUnicode±àÂë תΪ
utf-8
item['positionName'] = node.xpath("./td[1]/a/text()").extract()[0].encode("utf-8")
item['positionLink'] = node.xpath("./td[1]/a/@href").extract()[0].encode("utf-8")
# Á´½ÓÊôÐÔ
item['positionType'] = node.xpath("./td[2]/text()").extract()[0].encode("utf-8")
item['positionNumber'] = node.xpath("./td[3]/text()").extract()[0].encode("utf-8")
item['workLocation'] = node.xpath("./td[4]/text()").extract()[0].encode("utf-8")
item['publishTime'] = node.xpath("./td[5]/text()").extract()[0].encode("utf-8")
# ·µ»Ø¸ø¹ÜµÀ´¦Àí
yield item
# ÏÈÅÀ 2000 Ò³Êý¾Ý
if self.offset < 2000:
self.offset += 10
url = self.baseURL + self.offset
yield scrapy.Request(url, callback = self.parse)
#pass
|
д¹ÜµÀÎļþ pipelines.py£º
# -*- coding:
utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES
setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
class TencentPipeline(object):
def __init__(self):
self.f = open("tencent.json", "w")
# ËùÓеÄitemʹÓù²Í¬µÄ¹ÜµÀ
def process_item(self, item, spider):
content = json.dumps(dict(item), ensure_ascii
= False) + ",\n"
self.f.write(content)
return item
def close_spider(self, spider):
self.f.close() |
¹ÜµÀдºÃÖ®ºó£¬ÔÚ settings.py ÖÐÆôÓùܵÀ
# Configure
item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'Tencent.pipelines.TencentPipeline': 300,
} |
ÔËÐУº
> scrapy
crawl tencent
File "/Users/kaiyiwang/Code /python/spider
/Tencent/Tencent/spiders/tencent.py", line
21, in parse
item['positionName'] = node.xpath("./td[1]/a/text()").
extract()[0].encode("utf-8")
IndexError: list index out of range |
ÇëÇóÏìÓ¦ÕâÀïдµÄÓÐÎÊÌ⣬Xpath»òÓ¦¸ÃΪÕâÖÖд·¨£º
# ÇëÇóÏìÓ¦
# node_list = response.xpath("//tr[@class='even']
or //tr[@class='odd']")
node_list = response.xpath("//tr[@class='even']
| //tr[@class='odd']") |
È»ºóÔÙÖ´ÐÐÃüÁ
Ö´Ðнá¹ûÎļþ tencent.json £º
{"positionName":
"23673-²Æ¾ÔËÓªÖÐÐÄÈȵãÔËÓª×é±à¼", "publishTime":
"2017-12-02", "positionLink":
"position_detail.php?id=32718&keywords=&tid=0&lid=0",
"positionType": "ÄÚÈݱà¼Àà", "workLocation":
"±±¾©", "positionNumber": "1"},
{"positionName": "MIG03-ÌÚѶµØÍ¼¸ß¼¶Ëã·¨ÆÀ²â¹¤³Ìʦ£¨±±¾©£©",
"publishTime": "2017-12-02",
"positionLink": "position_detail.php?id=30276&keywords=&tid=0&lid=0",
"positionType": "¼¼ÊõÀà", "workLocation":
"±±¾©", "positionNumber": "1"},
{"positionName": "MIG10-΢»ØÊÕÇþµÀ²úÆ·ÔËÓª¾Àí£¨ÉîÛÚ£©",
"publishTime": "2017-12-02",
"positionLink": "position_detail.php?id=32720&keywords=&tid=0&lid=0",
"positionType": "²úÆ·/ÏîÄ¿Àà",
"workLocation": "ÉîÛÚ", "positionNumber":
"1"},
{"positionName": "MIG03-iOS²âÊÔ¿ª·¢¹¤³Ìʦ£¨±±¾©£©",
"publishTime": "2017-12-02",
"positionLink": "position_detail.php?id=32715&keywords=&tid=0&lid=0",
"positionType": "¼¼ÊõÀà", "workLocation":
"±±¾©", "positionNumber": "1"},
{"positionName": "19332-¸ß¼¶PHP¿ª·¢¹¤³Ìʦ£¨ÉϺ££©",
"publishTime": "2017-12-02",
"positionLink": "position_detail.php?id=31967&keywords=&tid=0&lid=0",
"positionType": "¼¼ÊõÀà", "workLocation":
"ÉϺ£", "positionNumber": "2"} |
1.3 ͨ¹ýÏÂÒ»Ò³ÅÀÈ¡
ÎÒÃÇÉϱßÊÇͨ¹ý×ܵÄÒ³ÊýÀ´×¥È¡Ã¿Ò³Êý¾ÝµÄ£¬µ«ÊÇûÓп¼Âǵ½Ã¿ÌìµÄÊý¾ÝÊDZ仯µÄ£¬ËùÒÔ£¬ÐèÒªÅÀÈ¡µÄ×ÜÒ³Êý²»ÄÜдËÀ£¬ÄǸÃÔõôÅжÏÊÇ·ñÅÀÍêÁËÊý¾ÝÄØ£¿ÆäʵºÜ¼òµ¥£¬ÎÒÃÇ¿ÉÒÔ¸ù¾ÝÏÂÒ»Ò³À´ÅÀÈ¡£¬Ö»ÒªÏÂһҳûÓÐÊý¾ÝÁË£¬¾Í˵Ã÷Êý¾ÝÒѾÅÀÍêÁË¡£

ÎÒÃÇͨ¹ý ÏÂÒ»Ò³ ¿´ÏÂ×îºóÒ»Ò³µÄÌØÕ÷£º

ÏÂÒ»Ò³µÄ°´Å¥Îª»ÒÉ«£¬²¢ÇÒÁ´½ÓΪ class='noactive'ÊôÐÔÁË£¬ÎÒÃÇ¿ÉÒÔ¸ù¾Ý´ËÌØÐÔÀ´ÅжÏÊÇ·ñµ½×îºóÒ»Ò³ÁË¡£
# дËÀ×ÜÒ³Êý£¬ÏÈÅÀ 100
Ò³Êý¾Ý """
if self.offset < 100:
self.offset += 10
url = self.baseURL + str(self.offset)
yield scrapy.Request(url, callback = self.parse)
"""
# ʹÓÃÏÂÒ»Ò³ÅÀÈ¡Êý¾Ý
if len(response.xpath("//a[@class='noactive'
and @id='next']")) == 0:
url = response.xpath("//a[@id='next']/@href").extract()[0]
yield scrapy.Request("http://hr.tencent.com/"
+ url, callback = self.parse) |
Ð޸ĺóµÄtencent.pyÎļþ£º
# -*- coding:
utf-8 -*-
import scrapy
from Tencent.items import TencentItem
class TencentSpider(scrapy.Spider):
# ÅÀ³æÃû
name = 'tencent'
# ÅÀ³æÅÀÈ¡Êý¾ÝµÄÓò·¶Î§
allowed_domains = ['tencent.com']
# 1.ÐèҪƴ½ÓµÄURL
baseURL = "http://hr.tencent.com/position.php?&start="
# ÐèҪƴ½ÓµÄURLµØÖ·µÄÆ«ÒÆÁ¿
offset = 0 # Æ«ÒÆÁ¿
# ÅÀ³æÆô¶¯Ê±£¬¶ÁÈ¡µÄURLµØÖ·Áбí
start_urls = [baseURL + str(offset)]
# ÓÃÀ´´¦Àíresponse
def parse(self, response):
# Ìáȡÿ¸öresponseµÄÊý¾Ý
node_list = response.xpath("//tr[@class='even']
| //tr[@class='odd']")
for node in node_list:
# ¹¹½¨item¶ÔÏó£¬ÓÃÀ´±£´æÊý¾Ý
item = TencentItem()
# Îı¾ÄÚÈÝ, È¡ÁбíµÄµÚÒ»¸öÔªËØ[0], ²¢ÇÒ½«ÌáÈ¡³öÀ´µÄUnicode±àÂë תΪ
utf-8
print node.xpath("./td[1]/a/text()").extract()
item['positionName'] = node.xpath("./td[1]/a/text()").extract()[0].encode("utf-8")
item['positionLink'] = node.xpath("./td[1]/a/@href").extract()[0].encode("utf-8")
# Á´½ÓÊôÐÔ
# ½øÐÐÊÇ·ñΪ¿ÕÅжÏ
if len(node.xpath("./td[2]/text()")):
item['positionType'] = node.xpath("./td[2]/text()").extract()[0].encode("utf-8")
else:
item['positionType'] = ""
item['positionNumber'] = node.xpath("./td[3]/text()").extract()[0].encode("utf-8")
item['workLocation'] = node.xpath("./td[4]/text()").extract()[0].encode("utf-8")
item['publishTime'] = node.xpath("./td[5]/text()").extract()[0].encode("utf-8")
# yieldµÄÖØÒªÐÔ£¬ÊÇ·µ»ØÊý¾Ýºó»¹ÄÜ»ØÀ´½Ó×ÅÖ´ÐдúÂ룬·µ»Ø¸ø¹ÜµÀ´¦Àí£¬Èç¹ûΪreturn
Õû¸öº¯Êý¶¼Í˳öÁË
yield item
# µÚÒ»ÖÖд·¨£ºÆ´½ÓURL£¬ÊÊÓó¡¾°£ºÒ³ÃæÃ»ÓпÉÒÔµã»÷µÄÇëÇóÁ´½Ó£¬±ØÐëͨ¹ýÆ´½ÓURL²ÅÄÜ»ñÈ¡ÏìÓ¦
"""
if self.offset < 100:
self.offset += 10
url = self.baseURL + str(self.offset)
yield scrapy.Request(url, callback = self.parse)
"""
<
# µÚ¶þÖÖд·¨£ºÖ±½Ó´Óresponse»ñÈ¡ÐèÒªÅÀÈ¡µÄÁ¬½Ó£¬²¢·¢ËÍÇëÇó´¦Àí£¬Ö±µ½Á¬½ÓÈ«²¿ÌáÈ¡Í꣨ʹÓÃÏÂÒ»Ò³ÅÀÈ¡Êý¾Ý£©
if len(response.xpath("//a[@class='noactive'
and @id='next']")) == 0:
url = response.xpath("//a[@id='next']/@href").extract()[0]
yield scrapy.Request("http://hr.tencent.com/"
+ url, callback = self.parse)
#pass |
OK£¬Í¨¹ý ¸ù¾ÝÏÂÒ»Ò³ÎÒÃdzɹ¦ÅÀÍêÕÐÆ¸ÐÅÏ¢µÄËùÓÐÊý¾Ý¡£
1.4 С½á
ÅÀ³æ²½Ö裺
1.´´½¨ÏîÄ¿ scrapy project XXX
2.scarpy genspider xxx "http://www.xxx.com"
3.±àд items.py, Ã÷È·ÐèÒªÌáÈ¡µÄÊý¾Ý
4.±àд spiders/xxx.py, ±àдÅÀ³æÎļþ£¬´¦ÀíÇëÇóºÍÏìÓ¦£¬ÒÔ¼°ÌáÈ¡Êý¾Ý£¨yield item£©
5.±àд pipelines.py, ±àд¹ÜµÀÎļþ£¬´¦Àíspider·µ»ØitemÊý¾Ý,±ÈÈç±¾µØÊý¾Ý³Ö¾Ã»¯£¬Ð´Îļþ»ò´æµ½±íÖС£
6.±àд settings.py£¬Æô¶¯¹ÜµÀ×é¼þITEM_PIPELINES£¬ÒÔ¼°ÆäËûÏà¹ØÉèÖÃ
7.Ö´ÐÐÅÀ³æ scrapy crawl xxx
ÓÐʱºò±»ÅÀÈ¡µÄÍøÕ¾¿ÉÄÜ×öÁ˺ܶàÏÞÖÆ£¬ËùÒÔ£¬ÎÒÃÇÇëÇóʱ¿ÉÒÔÌí¼ÓÇëÇó±¨Í·£¬scrapy
¸øÎÒÃÇÌṩÁËÒ»¸öºÜ·½±ãµÄ±¨Í·ÅäÖõĵط½£¬settings.py ÖУ¬ÎÒÃÇ¿ÉÒÔ¿ªÆô:
# Crawl responsibly
by identifying yourself (and your website) on
the user-agent
USER_AGENT = 'Tencent (+http://www.yourdomain.com)'
User-AGENT = "Mozilla/5.0 (Macintosh; Intel
Mac OS X 10_11_6)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/62.0.3202.94 Safari/537.36"
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml, application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
} |
scrapy ×î´óµÄÊÊÓó¡¾°ÊÇÅÀÈ¡¾²Ì¬Ò³Ã棬ÐÔÄܷdz£Ç¿º·£¬µ«Èç¹ûÒªÅÀÈ¡¶¯Ì¬µÄjsonÊý¾Ý£¬ÄǾÍû±ØÒªÁË¡£
|