Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Model Center   Code  
»áÔ±   
   
 
     
   
 ¶©ÔÄ
  ¾èÖú
Python ScrapyÅÀ³æ¿ò¼Üѧϰ
 
  2182  次浏览      31
 2019-7-11
 
±à¼­ÍƼö:
±¾ÎÄÀ´×ÔÓÚsegmentfault,ÎÄÕÂÖ÷Òª½éÉÜÁËScrapy¿ò¼ÜÊÇʲô£¬ScrapyµÄ¼Ü¹¹Á÷³Ìͼ,ÒÔ¼°Scrapy¿ò¼ÜµÄ°²×°£¬Ï£Íû¶ÔÄúÄÜÓÐËù°ïÖú¡£

Scrapy ÊÇÓÃPythonʵÏÖÒ»¸öΪÅÀÈ¡ÍøÕ¾Êý¾Ý¡¢ÌáÈ¡½á¹¹ÐÔÊý¾Ý¶ø±àдµÄÓ¦Óÿò¼Ü¡£

Ò»¡¢Scrapy¿ò¼Ü¼ò½é

ScrapyÊÇÒ»¸öΪÁËÅÀÈ¡ÍøÕ¾Êý¾Ý£¬ÌáÈ¡½á¹¹ÐÔÊý¾Ý¶ø±àдµÄÓ¦Óÿò¼Ü¡£ ¿ÉÒÔÓ¦ÓÃÔÚ°üÀ¨Êý¾ÝÍÚ¾ò£¬ÐÅÏ¢´¦Àí»ò´æ´¢ÀúÊ·Êý¾ÝµÈһϵÁеijÌÐòÖС£

Æä×î³õÊÇΪÁË Ò³Ãæ×¥È¡ (¸üÈ·ÇÐÀ´Ëµ, ÍøÂçץȡ )ËùÉè¼ÆµÄ£¬ Ò²¿ÉÒÔÓ¦ÓÃÔÚ»ñÈ¡APIËù·µ»ØµÄÊý¾Ý(ÀýÈç Amazon Associates Web Services ) »òÕßͨÓõÄÍøÂçÅÀ³æ¡£

¶þ¡¢¼Ü¹¹Á÷³Ìͼ

½ÓÏÂÀ´µÄͼ±íÕ¹ÏÖÁËScrapyµÄ¼Ü¹¹£¬°üÀ¨×é¼þ¼°ÔÚϵͳÖз¢ÉúµÄÊý¾ÝÁ÷µÄ¸ÅÀÀ(ÂÌÉ«¼ýÍ·Ëùʾ)¡£ ÏÂÃæ¶Ôÿ¸ö×é¼þ¶¼×öÁ˼òµ¥½éÉÜ£¬²¢¸ø³öÁËÏêϸÄÚÈݵÄÁ´½Ó¡£Êý¾ÝÁ÷ÈçÏÂËùÃèÊö¡£

1¡¢×é¼þ

Scrapy Engine

ÒýÇæ¸ºÔð¿ØÖÆÊý¾ÝÁ÷ÔÚϵͳÖÐËùÓÐ×é¼þÖÐÁ÷¶¯£¬²¢ÔÚÏàÓ¦¶¯×÷·¢Éúʱ´¥·¢Ê¼þ¡£ ÏêϸÄÚÈݲ鿴ÏÂÃæµÄÊý¾ÝÁ÷(Data Flow)²¿·Ö¡£

µ÷¶ÈÆ÷(Scheduler)

µ÷¶ÈÆ÷´ÓÒýÇæ½ÓÊÜrequest²¢½«ËûÃÇÈë¶Ó£¬ÒÔ±ãÖ®ºóÒýÇæÇëÇóËûÃÇʱÌṩ¸øÒýÇæ¡£

ÏÂÔØÆ÷(Downloader)

ÏÂÔØÆ÷¸ºÔð»ñÈ¡Ò³ÃæÊý¾Ý²¢Ìṩ¸øÒýÇæ£¬¶øºóÌṩ¸øspider¡£

Spiders

SpiderÊÇScrapyÓû§±àдÓÃÓÚ·ÖÎöresponse²¢ÌáÈ¡item(¼´»ñÈ¡µ½µÄitem)»ò¶îÍâ¸ú½øµÄURLµÄÀà¡£ ÿ¸öspider¸ºÔð´¦ÀíÒ»¸öÌØ¶¨(»òһЩ)ÍøÕ¾¡£ ¸ü¶àÄÚÈÝÇë¿´ Spiders ¡£

Item Pipeline

Item Pipeline¸ºÔð´¦Àí±»spiderÌáÈ¡³öÀ´µÄitem¡£µäÐ͵Ĵ¦ÀíÓÐÇåÀí¡¢ ÑéÖ¤¼°³Ö¾Ã»¯(ÀýÈç´æÈ¡µ½Êý¾Ý¿âÖÐ)¡£ ¸ü¶àÄÚÈݲ鿴 Item Pipeline ¡£

ÏÂÔØÆ÷Öмä¼þ(Downloader middlewares)

ÏÂÔØÆ÷Öмä¼þÊÇÔÚÒýÇæ¼°ÏÂÔØÆ÷Ö®¼äµÄÌØ¶¨¹³×Ó(specific hook)£¬´¦ÀíDownloader´«µÝ¸øÒýÇæµÄresponse¡£ ÆäÌṩÁËÒ»¸ö¼ò±ãµÄ»úÖÆ£¬Í¨¹ý²åÈë×Ô¶¨Òå´úÂëÀ´À©Õ¹Scrapy¹¦ÄÜ¡£¸ü¶àÄÚÈÝÇë¿´ ÏÂÔØÆ÷Öмä¼þ(Downloader Middleware) ¡£

SpiderÖмä¼þ(Spider middlewares)

SpiderÖмä¼þÊÇÔÚÒýÇæ¼°SpiderÖ®¼äµÄÌØ¶¨¹³×Ó(specific hook)£¬´¦ÀíspiderµÄÊäÈë(response)ºÍÊä³ö(items¼°requests)¡£ ÆäÌṩÁËÒ»¸ö¼ò±ãµÄ»úÖÆ£¬Í¨¹ý²åÈë×Ô¶¨Òå´úÂëÀ´À©Õ¹Scrapy¹¦ÄÜ¡£¸ü¶àÄÚÈÝÇë¿´ SpiderÖмä¼þ(Middleware) ¡£

2¡¢Êý¾ÝÁ÷(Data flow)

ScrapyÖеÄÊý¾ÝÁ÷ÓÉÖ´ÐÐÒýÇæ¿ØÖÆ£¬Æä¹ý³ÌÈçÏÂ:

ÒýÇæ´ò¿ªÒ»¸öÍøÕ¾(open a domain)£¬ÕÒµ½´¦Àí¸ÃÍøÕ¾µÄSpider²¢Ïò¸ÃspiderÇëÇóµÚÒ»¸öÒªÅÀÈ¡µÄURL(s)¡£

ÒýÇæ´ÓSpiderÖлñÈ¡µ½µÚÒ»¸öÒªÅÀÈ¡µÄURL²¢ÔÚµ÷¶ÈÆ÷(Scheduler)ÒÔRequestµ÷¶È¡£

ÒýÇæÏòµ÷¶ÈÆ÷ÇëÇóÏÂÒ»¸öÒªÅÀÈ¡µÄURL¡£

µ÷¶ÈÆ÷·µ»ØÏÂÒ»¸öÒªÅÀÈ¡µÄURL¸øÒýÇæ£¬ÒýÇæ½«URLͨ¹ýÏÂÔØÖмä¼þ(ÇëÇó(request)·½Ïò)ת·¢¸øÏÂÔØÆ÷(Downloader)¡£

Ò»µ©Ò³ÃæÏÂÔØÍê±Ï£¬ÏÂÔØÆ÷Éú³ÉÒ»¸ö¸ÃÒ³ÃæµÄResponse£¬²¢½«Æäͨ¹ýÏÂÔØÖмä¼þ(·µ»Ø(response)·½Ïò)·¢Ë͸øÒýÇæ¡£

ÒýÇæ´ÓÏÂÔØÆ÷ÖнÓÊÕµ½Response²¢Í¨¹ýSpiderÖмä¼þ(ÊäÈë·½Ïò)·¢Ë͸øSpider´¦Àí¡£

Spider´¦ÀíResponse²¢·µ»ØÅÀÈ¡µ½µÄItem¼°(¸ú½øµÄ)еÄRequest¸øÒýÇæ¡£

ÒýÇæ½«(Spider·µ»ØµÄ)ÅÀÈ¡µ½µÄItem¸øItem Pipeline£¬½«(Spider·µ»ØµÄ)Request¸øµ÷¶ÈÆ÷¡£

(´ÓµÚ¶þ²½)ÖØ¸´Ö±µ½µ÷¶ÈÆ÷ÖÐûÓиü¶àµØrequest£¬ÒýÇæ¹Ø±Õ¸ÃÍøÕ¾¡£

3¡¢Ê¼þÇý¶¯ÍøÂç(Event-driven networking)

Scrapy»ùÓÚʼþÇý¶¯ÍøÂç¿ò¼Ü Twisted ±àд¡£Òò´Ë£¬Scrapy»ùÓÚ²¢·¢ÐÔ¿¼ÂÇÓÉ·Ç×èÈû(¼´Òì²½)µÄʵÏÖ¡£

¹ØÓÚÒì²½±à³Ì¼°Twisted¸ü¶àµÄÄÚÈÝÇë²é¿´ÏÂÁÐÁ´½Ó:

Èý¡¢4²½ÖÆ×÷ÅÀ³æ

н¨ÏîÄ¿£¨scrapy startproject xxx£©:н¨Ò»¸öеÄÅÀ³æÏîÄ¿

Ã÷È·Ä¿±ê£¨±àдitems.py£©:Ã÷È·ÄãÏëҪץȡµÄÄ¿±ê

ÖÆ×÷ÅÀ³æ£¨spiders/xxsp der.py£©:ÖÆ×÷ÅÀ³æ¿ªÊ¼ÅÀÈ¡ÍøÒ³

´æ´¢ÄÚÈÝ£¨pipelines.py£©:Éè¼Æ¹ÜµÀ´æ´¢ÅÀÈ¡ÄÚÈÝ

ËÄ¡¢°²×°¿ò¼Ü

ÕâÀïÎÒÃÇʹÓà conda À´½øÐа²×°£º

conda install scrapy

»òÕßʹÓà pip ½øÐа²×°£º

pip install scrapy

²é¿´°²×°£º

spider scrapy -h
Scrapy 1.4.0 - no active project

Usage:
scrapy <command> [options] [args]

Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy

[ more ] More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

1.´´½¨ÏîÄ¿

spider scrapy startproject SF
New Scrapy project 'SF', using template directory '/Users/kaiyiwang/anaconda2/lib/python2.7/site-packages/scrapy/templates/project', created in:
/Users/kaiyiwang/Code/python/spider/SF

You can start your first spider with:
cd SF
scrapy genspider example example.com
spider

ʹÓà tree ÃüÁî¿ÉÒԲ鿴ÏîÄ¿½á¹¹£º

SF tree
.
©À©¤©¤ SF
©¦ ©À©¤©¤ __init__.py
©¦ ©À©¤©¤ items.py
©¦ ©À©¤©¤ middlewares.py
©¦ ©À©¤©¤ pipelines.py
©¦ ©À©¤©¤ settings.py
©¦ ©¸©¤©¤ spiders
©¦ ©¸©¤©¤ __init__.py
©¸©¤©¤ scrapy.cfg

2.ÔÚspiders Ŀ¼Ï´´½¨Ä£°å

spiders scrapy genspider sf "https://segmentfault.com"
Created spider 'sf' using template 'basic' in module:
SF.spiders.sf
spiders

ÕâÑù£¬¾ÍÉú³ÉÁËÒ»¸öÏîÄ¿Îļþ sf.py

# -*- coding: utf-8 -*-
import scrapy
from SF.items import SfItem

class SfSpider(scrapy.Spider):
name = 'sf'
allowed_domains = ['https://segmentfault.com']
start_urls = ['https://segmentfault.com/']

def parse(self, response):
# print response.body
# pass
node_list = response.xpath("//h2[@class='title']")

# ÓÃÀ´´æ´¢ËùÓеÄitem×ֶεÄ
# items = []
for node in node_list:
# ´´½¨item×ֶζÔÏó£¬ÓÃÀ´´æ´¢ÐÅÏ¢
item = SfItem()
# .extract() ½«xpath¶ÔÏóת»»Îª Unicode×Ö·û´®
title = node.xpath("./a/text()").extract()

item['title'] = title[0]

# ·µ»Ø×¥È¡µ½µÄitemÊý¾Ý£¬¸ø¹ÜµÀÎļþ´¦Àí£¬Í¬Ê±»¹»ØÀ´¼ÌÐøÖ´Ðкó±ßµÄ´úÂë
yield.item
#return item
#return scrapy.Request(url)
#items.append(item)

ÃüÁ

# ²âÊÔÅÀ³æÊÇ·ñÕý³£, sfΪÅÀ³æµÄÃû³Æ
scrapy check sf
s
# ÔËÐÐÅÀ³æ
scrapy crawl sf

3.item pipeline

µ± item ÔÚSpiderÖб»ÊÕ¼¯Ö®ºó£¬Ëü½«»á±»´«µÝµ½ item Pipeline, ÕâЩ item Pipeline ×é¼þ°´¶¨ÒåµÄ˳Ðò´¦Àí item.

ÿ¸ö Item Pipeline ¶¼ÊÇʵÏÖÁ˼òµ¥·½·¨µÄPython À࣬±ÈÈç¾ö¶¨´ËItemÊǶªÆú»ò´æ´¢£¬ÒÔÏÂÊÇ item pipeline µÄһЩµäÐÍÓ¦Óãº

ÑéÖ¤ÅÀÈ¡µÃÊý¾Ý£¨¼ì²éitem°üº¬Ä³Ð©×ֶΣ¬±ÈÈç˵name×ֶΣ©

²éÖØ£¨²¢¶ªÆú£©

½«ÅÀÈ¡½á¹û±£´æµ½Îļþ»òÕßÊý¾Ý¿â×Ü£¨Êý¾Ý³Ö¾Ã»¯£©

±àд item pipeline

±àд item pipeline ºÜ¼òµ¥£¬item pipeline ×é¼þÊÇÒ»¸ö¶ÀÁ¢µÄPythonÀ࣬ÆäÖÐ process_item()·½·¨±ØÐëʵÏÖ¡£

from scrapy.exceptions import DropItem

class PricePipeline(object):

vat_factor = 1.15

def process_item(self, item, spider):
if item['price']:
if item['price_excludes_vat']:
item['price'] = item['price'] * self.vat_factor
return item
else:
raise DropItem("Missing price in %s" % item)

4.Ñ¡ÔñÆ÷(Selectors)

µ±×¥È¡ÍøÒ³Ê±£¬Äã×öµÄ×î³£¼ûµÄÈÎÎñÊÇ´ÓHTMLÔ´ÂëÖÐÌáÈ¡Êý¾Ý¡£

Selector ÓÐËĸö»ù±¾µÄ·½·¨£¬×î³£ÓõϹÊÇXpath

xpath():´«Èëxpath±í´ïʽ£¬·µ»Ø¸Ã±í´ïʽËù¶ÔÓ¦µÄËùÓнڵãµÄselector list ÁÐ±í¡£

extract(): ÐòÁл¯¸Ã½ÚµãΪUnicode×Ö·û´®²¢·µ»Ølist

css():´«ÈëCSS±í´ïʽ£¬·µ»Ø¸Ã±í´ïʽËù¶ÔÓ¦µÄËùÓнڵãµÄselector list ÁÐ±í£¬Ó﷨ͬ BeautifulSoup4

re():¸ù¾Ý´«ÈëµÄÕýÔò±í´ïʽ¶ÔÊý¾Ý½øÐÐÌáÈ¡£¬·µ»ØUnicode ×Ö·û´®list Áбí

ScrapyÌáÈ¡Êý¾ÝÓÐ×Ô¼ºµÄÒ»Ì×»úÖÆ¡£ËüÃDZ»³Æ×÷Ñ¡ÔñÆ÷(seletors)£¬ÒòΪËûÃÇͨ¹ýÌØ¶¨µÄ XPath »òÕß CSS ±í´ïʽÀ´¡°Ñ¡Ôñ¡± HTMLÎļþÖеÄij¸ö²¿·Ö¡£

XPath ÊÇÒ»ÃÅÓÃÀ´ÔÚXMLÎļþÖÐÑ¡Ôñ½ÚµãµÄÓïÑÔ£¬Ò²¿ÉÒÔÓÃÔÚHTMLÉÏ¡£ CSS ÊÇһΫHTMLÎĵµÑùʽ»¯µÄÓïÑÔ¡£Ñ¡ÔñÆ÷ÓÉËü¶¨Ò壬²¢ÓëÌØ¶¨µÄHTMLÔªËØµÄÑùʽÏà¹ØÁ¬¡£

ScrapyÑ¡ÔñÆ÷¹¹½¨ÓÚ lxml ¿âÖ®ÉÏ£¬ÕâÒâζ×ÅËüÃÇÔÚËٶȺͽâÎö׼ȷÐÔÉϷdz£ÏàËÆ¡£

XPath±í´ïʽµÄÀý×Ó£º

/html/head/title: Ñ¡Ôñ<HTML>ÎĵµÖÐ<head>±êÇ©ÄÚµÄ<title>ÔªËØ
/html/head/title/text(): Ñ¡ÔñÉÏÃæÌáµ½µÄ<title>ÔªËØµÄÎÊÌâ
//td: Ñ¡ÔñËùÓеÄ<td> ÔªËØ
//div[@class="mine"]:Ñ¡ÔñËùÓоßÓÐ class="mine" ÊôÐ﵀ div ÔªËØ

 

Îå¡¢ÅÀÈ¡ÕÐÆ¸ÐÅÏ¢

1.ÅÀÈ¡ÌÚѶÕÐÆ¸ÐÅÏ¢

ÅÀÈ¡µÄµØÖ·£ºhttp://hr.tencent.com/positio...

1.1 ´´½¨ÏîÄ¿

> scrapy startproject Tencent

You can start your first spider with:
cd Tencent
scrapy genspider example example.com

ÐèÒª×¥È¡ÍøÒ³µÄÔªËØ£º

ÎÒÃÇÐèÒªÅÀÈ¡ÒÔÏÂÐÅÏ¢£º

ְλÃû£ºpositionName

ְλÁ´½Ó£ºpositionLink

ְλÀàÐÍ£ºpositionType

ְλÈËÊý£ºpositionNumber

¹¤×÷µØµã£ºworkLocation

·¢²¼Ê±µã£ºpublishTime

ÔÚ items.py ÎļþÖж¨ÒåÅÀÈ¡µÄ×ֶΣº

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

# ¶¨Òå×Ö¶Î
class TencentItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()

# ְλÃû
positionName = scrapy.Field()

# ְλÁ´½Ó
positionLink = scrapy.Field()

# ְλÀàÐÍ
positionType = scrapy.Field()

# ְλÈËÊý
positionNumber = scrapy.Field()

# ¹¤×÷µØµã
workLocation = scrapy.Field()

# ·¢²¼Ê±µã
publishTime = scrapy.Field()


pass

1.2 дspiderÅÀ³æ

ʹÓÃÃüÁî´´½¨

Tencent scrapy genspider tencent "tencent.com"
Created spider 'tencent' using template 'basic' in module:
Tencent.spiders.tencent

Éú³ÉµÄ spider ÔÚµ±Ç°Ä¿Â¼Ï嵀 spiders/tencent.py

Tencent tree
.
©À©¤©¤ __init__.py
©À©¤©¤ __init__.pyc
©À©¤©¤ items.py
©À©¤©¤ middlewares.py
©À©¤©¤ pipelines.py
©À©¤©¤ settings.py
©À©¤©¤ settings.pyc
©¸©¤©¤ spiders
©À©¤©¤ __init__.py
©À©¤©¤ __init__.pyc
©¸©¤©¤ tencent.py

ÎÒÃÇ¿ÉÒÔ¿´ÏÂÉú³ÉµÄÕâ¸ö³õʼ»¯Îļþ tencent.py

# -*- coding: utf-8 -*-
import scrapy

class TencentSpider(scrapy.Spider):
name = 'tencent'
allowed_domains = ['tencent.com']
start_urls = ['http://tencent.com/']

def parse(self, response):
pass

¶Ô³õʶÎļþtencent.py½øÐÐÐ޸ģº

# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem
class TencentSpider(scrapy.Spider):
name = 'tencent'
allowed_domains = ['tencent.com']
baseURL = "http://hr.tencent.com/position.php?&start="
offset = 0 # Æ«ÒÆÁ¿
start_urls = [baseURL + str(offset)]

def parse(self, response):

# ÇëÇóÏìÓ¦
# node_list = response.xpath("//tr[@class='even'] or //tr[@class='odd']")
node_list = response.xpath("//tr[@class='even'] | //tr[@class='odd']")

for node in node_list:
item = TencentItem() # ÒýÈë×Ö¶ÎÀà

# Îı¾ÄÚÈÝ, È¡ÁбíµÄµÚÒ»¸öÔªËØ[0], ²¢ÇÒ½«ÌáÈ¡³öÀ´µÄUnicode±àÂë תΪ utf-8
item['positionName'] = node.xpath("./td[1]/a/text()").extract()[0].encode("utf-8")
item['positionLink'] = node.xpath("./td[1]/a/@href").extract()[0].encode("utf-8") # Á´½ÓÊôÐÔ
item['positionType'] = node.xpath("./td[2]/text()").extract()[0].encode("utf-8")
item['positionNumber'] = node.xpath("./td[3]/text()").extract()[0].encode("utf-8")
item['workLocation'] = node.xpath("./td[4]/text()").extract()[0].encode("utf-8")
item['publishTime'] = node.xpath("./td[5]/text()").extract()[0].encode("utf-8")


# ·µ»Ø¸ø¹ÜµÀ´¦Àí
yield item


# ÏÈÅÀ 2000 Ò³Êý¾Ý
if self.offset < 2000:
self.offset += 10
url = self.baseURL + self.offset
yield scrapy.Request(url, callback = self.parse)

#pass

д¹ÜµÀÎļþ pipelines.py£º

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class TencentPipeline(object):
def __init__(self):
self.f = open("tencent.json", "w")

# ËùÓеÄitemʹÓù²Í¬µÄ¹ÜµÀ
def process_item(self, item, spider):
content = json.dumps(dict(item), ensure_ascii = False) + ",\n"
self.f.write(content)
return item

def close_spider(self, spider):
self.f.close()

¹ÜµÀдºÃÖ®ºó£¬ÔÚ settings.py ÖÐÆôÓùܵÀ

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'Tencent.pipelines.TencentPipeline': 300,
}

ÔËÐУº

> scrapy crawl tencent

File "/Users/kaiyiwang/Code /python/spider /Tencent/Tencent/spiders/tencent.py", line 21, in parse
item['positionName'] = node.xpath("./td[1]/a/text()"). extract()[0].encode("utf-8")
IndexError: list index out of range

ÇëÇóÏìÓ¦ÕâÀïдµÄÓÐÎÊÌ⣬Xpath»òÓ¦¸ÃΪÕâÖÖд·¨£º

# ÇëÇóÏìÓ¦
# node_list = response.xpath("//tr[@class='even'] or //tr[@class='odd']")
node_list = response.xpath("//tr[@class='even'] | //tr[@class='odd']")

È»ºóÔÙÖ´ÐÐÃüÁ

> scrapy crawl tencent

Ö´Ðнá¹ûÎļþ tencent.json £º

{"positionName": "23673-²Æ¾­ÔËÓªÖÐÐÄÈȵãÔËÓª×é±à¼­", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=32718&keywords=&tid=0&lid=0", "positionType": "ÄÚÈݱ༭Àà", "workLocation": "±±¾©", "positionNumber": "1"},
{"positionName": "MIG03-ÌÚѶµØÍ¼¸ß¼¶Ëã·¨ÆÀ²â¹¤³Ìʦ£¨±±¾©£©", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=30276&keywords=&tid=0&lid=0", "positionType": "¼¼ÊõÀà", "workLocation": "±±¾©", "positionNumber": "1"},
{"positionName": "MIG10-΢»ØÊÕÇþµÀ²úÆ·ÔËÓª¾­Àí£¨ÉîÛÚ£©", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=32720&keywords=&tid=0&lid=0", "positionType": "²úÆ·/ÏîÄ¿Àà", "workLocation": "ÉîÛÚ", "positionNumber": "1"},
{"positionName": "MIG03-iOS²âÊÔ¿ª·¢¹¤³Ìʦ£¨±±¾©£©", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=32715&keywords=&tid=0&lid=0", "positionType": "¼¼ÊõÀà", "workLocation": "±±¾©", "positionNumber": "1"},
{"positionName": "19332-¸ß¼¶PHP¿ª·¢¹¤³Ìʦ£¨ÉϺ££©", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=31967&keywords=&tid=0&lid=0", "positionType": "¼¼ÊõÀà", "workLocation": "ÉϺ£", "positionNumber": "2"}

1.3 ͨ¹ýÏÂÒ»Ò³ÅÀÈ¡

ÎÒÃÇÉϱßÊÇͨ¹ý×ܵÄÒ³ÊýÀ´×¥È¡Ã¿Ò³Êý¾ÝµÄ£¬µ«ÊÇûÓп¼Âǵ½Ã¿ÌìµÄÊý¾ÝÊDZ仯µÄ£¬ËùÒÔ£¬ÐèÒªÅÀÈ¡µÄ×ÜÒ³Êý²»ÄÜдËÀ£¬ÄǸÃÔõôÅжÏÊÇ·ñÅÀÍêÁËÊý¾ÝÄØ£¿ÆäʵºÜ¼òµ¥£¬ÎÒÃÇ¿ÉÒÔ¸ù¾ÝÏÂÒ»Ò³À´ÅÀÈ¡£¬Ö»ÒªÏÂһҳûÓÐÊý¾ÝÁË£¬¾Í˵Ã÷Êý¾ÝÒѾ­ÅÀÍêÁË¡£

ÎÒÃÇͨ¹ý ÏÂÒ»Ò³ ¿´ÏÂ×îºóÒ»Ò³µÄÌØÕ÷£º

ÏÂÒ»Ò³µÄ°´Å¥Îª»ÒÉ«£¬²¢ÇÒÁ´½ÓΪ class='noactive'ÊôÐÔÁË£¬ÎÒÃÇ¿ÉÒÔ¸ù¾Ý´ËÌØÐÔÀ´ÅжÏÊÇ·ñµ½×îºóÒ»Ò³ÁË¡£

# дËÀ×ÜÒ³Êý£¬ÏÈÅÀ 100 Ò³Êý¾Ý
"""

if self.offset < 100:
self.offset += 10
url = self.baseURL + str(self.offset)
yield scrapy.Request(url, callback = self.parse)
"""

# ʹÓÃÏÂÒ»Ò³ÅÀÈ¡Êý¾Ý
if len(response.xpath("//a[@class='noactive' and @id='next']")) == 0:
url = response.xpath("//a[@id='next']/@href").extract()[0]
yield scrapy.Request("http://hr.tencent.com/" + url, callback = self.parse)

Ð޸ĺóµÄtencent.pyÎļþ£º

# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem

class TencentSpider(scrapy.Spider):
# ÅÀ³æÃû
name = 'tencent'
# ÅÀ³æÅÀÈ¡Êý¾ÝµÄÓò·¶Î§
allowed_domains = ['tencent.com']
# 1.ÐèҪƴ½ÓµÄURL
baseURL = "http://hr.tencent.com/position.php?&start="
# ÐèҪƴ½ÓµÄURLµØÖ·µÄÆ«ÒÆÁ¿
offset = 0 # Æ«ÒÆÁ¿

# ÅÀ³æÆô¶¯Ê±£¬¶ÁÈ¡µÄURLµØÖ·Áбí
start_urls = [baseURL + str(offset)]

# ÓÃÀ´´¦Àíresponse
def parse(self, response):

# Ìáȡÿ¸öresponseµÄÊý¾Ý
node_list = response.xpath("//tr[@class='even'] | //tr[@class='odd']")

for node in node_list:

# ¹¹½¨item¶ÔÏó£¬ÓÃÀ´±£´æÊý¾Ý
item = TencentItem()

# Îı¾ÄÚÈÝ, È¡ÁбíµÄµÚÒ»¸öÔªËØ[0], ²¢ÇÒ½«ÌáÈ¡³öÀ´µÄUnicode±àÂë תΪ utf-8
print node.xpath("./td[1]/a/text()").extract()

item['positionName'] = node.xpath("./td[1]/a/text()").extract()[0].encode("utf-8")
item['positionLink'] = node.xpath("./td[1]/a/@href").extract()[0].encode("utf-8") # Á´½ÓÊôÐÔ

# ½øÐÐÊÇ·ñΪ¿ÕÅжÏ
if len(node.xpath("./td[2]/text()")):
item['positionType'] = node.xpath("./td[2]/text()").extract()[0].encode("utf-8")
else:
item['positionType'] = ""

item['positionNumber'] = node.xpath("./td[3]/text()").extract()[0].encode("utf-8")
item['workLocation'] = node.xpath("./td[4]/text()").extract()[0].encode("utf-8")
item['publishTime'] = node.xpath("./td[5]/text()").extract()[0].encode("utf-8")

# yieldµÄÖØÒªÐÔ£¬ÊÇ·µ»ØÊý¾Ýºó»¹ÄÜ»ØÀ´½Ó×ÅÖ´ÐдúÂ룬·µ»Ø¸ø¹ÜµÀ´¦Àí£¬Èç¹ûΪreturn Õû¸öº¯Êý¶¼Í˳öÁË
yield item

# µÚÒ»ÖÖд·¨£ºÆ´½ÓURL£¬ÊÊÓó¡¾°£ºÒ³ÃæÃ»ÓпÉÒÔµã»÷µÄÇëÇóÁ´½Ó£¬±ØÐëͨ¹ýÆ´½ÓURL²ÅÄÜ»ñÈ¡ÏìÓ¦
"""

if self.offset < 100:
self.offset += 10
url = self.baseURL + str(self.offset)
yield scrapy.Request(url, callback = self.parse)
"""
<
# µÚ¶þÖÖд·¨£ºÖ±½Ó´Óresponse»ñÈ¡ÐèÒªÅÀÈ¡µÄÁ¬½Ó£¬²¢·¢ËÍÇëÇó´¦Àí£¬Ö±µ½Á¬½ÓÈ«²¿ÌáÈ¡Í꣨ʹÓÃÏÂÒ»Ò³ÅÀÈ¡Êý¾Ý£©
if len(response.xpath("//a[@class='noactive' and @id='next']")) == 0:
url = response.xpath("//a[@id='next']/@href").extract()[0]
yield scrapy.Request("http://hr.tencent.com/" + url, callback = self.parse)

#pass

OK£¬Í¨¹ý ¸ù¾ÝÏÂÒ»Ò³ÎÒÃdzɹ¦ÅÀÍêÕÐÆ¸ÐÅÏ¢µÄËùÓÐÊý¾Ý¡£

1.4 С½á

ÅÀ³æ²½Ö裺

1.´´½¨ÏîÄ¿ scrapy project XXX

2.scarpy genspider xxx "http://www.xxx.com"

3.±àд items.py, Ã÷È·ÐèÒªÌáÈ¡µÄÊý¾Ý

4.±àд spiders/xxx.py, ±àдÅÀ³æÎļþ£¬´¦ÀíÇëÇóºÍÏìÓ¦£¬ÒÔ¼°ÌáÈ¡Êý¾Ý£¨yield item£©

5.±àд pipelines.py, ±àд¹ÜµÀÎļþ£¬´¦Àíspider·µ»ØitemÊý¾Ý,±ÈÈç±¾µØÊý¾Ý³Ö¾Ã»¯£¬Ð´Îļþ»ò´æµ½±íÖС£

6.±àд settings.py£¬Æô¶¯¹ÜµÀ×é¼þITEM_PIPELINES£¬ÒÔ¼°ÆäËûÏà¹ØÉèÖÃ

7.Ö´ÐÐÅÀ³æ scrapy crawl xxx

ÓÐʱºò±»ÅÀÈ¡µÄÍøÕ¾¿ÉÄÜ×öÁ˺ܶàÏÞÖÆ£¬ËùÒÔ£¬ÎÒÃÇÇëÇóʱ¿ÉÒÔÌí¼ÓÇëÇó±¨Í·£¬scrapy ¸øÎÒÃÇÌṩÁËÒ»¸öºÜ·½±ãµÄ±¨Í·ÅäÖõĵط½£¬settings.py ÖУ¬ÎÒÃÇ¿ÉÒÔ¿ªÆô:

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Tencent (+http://www.yourdomain.com)'
User-AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/62.0.3202.94 Safari/537.36"

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml, application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}

scrapy ×î´óµÄÊÊÓó¡¾°ÊÇÅÀÈ¡¾²Ì¬Ò³Ã棬ÐÔÄܷdz£Ç¿º·£¬µ«Èç¹ûÒªÅÀÈ¡¶¯Ì¬µÄjsonÊý¾Ý£¬ÄǾÍû±ØÒªÁË¡£

 
   
2182 ´Îä¯ÀÀ       31
Ïà¹ØÎÄÕÂ

ÊÖ»úÈí¼þ²âÊÔÓÃÀýÉè¼ÆÊµ¼ù
ÊÖ»ú¿Í»§¶ËUI²âÊÔ·ÖÎö
iPhoneÏûÏ¢ÍÆËÍ»úÖÆÊµÏÖÓë̽ÌÖ
AndroidÊÖ»ú¿ª·¢£¨Ò»£©
Ïà¹ØÎĵµ

Android_UI¹Ù·½Éè¼Æ½Ì³Ì
ÊÖ»ú¿ª·¢Æ½Ì¨½éÉÜ
androidÅÄÕÕ¼°ÉÏ´«¹¦ÄÜ
Android½²ÒåÖÇÄÜÊÖ»ú¿ª·¢
Ïà¹Ø¿Î³Ì

Android¸ß¼¶Òƶ¯Ó¦ÓóÌÐò
Androidϵͳ¿ª·¢
AndroidÓ¦Óÿª·¢
ÊÖ»úÈí¼þ²âÊÔ