99精品在线免费观看99,成人无码免费在线观看

新聞中心

這里有您想知道的互聯(lián)網(wǎng)營(yíng)銷解決方案

一名合格的數(shù)據(jù)分析師分享Python網(wǎng)絡(luò)爬蟲(chóng)二三事（Scrapy自動(dòng)爬蟲(chóng)）

接上篇《一名合格的數(shù)據(jù)分析師分享Python網(wǎng)絡(luò)爬蟲(chóng)二三事(綜合實(shí)戰(zhàn)案例)》

五、綜合實(shí)戰(zhàn)案例

3. 利用Scrapy框架爬取

（1）了解Scrapy

Scrapy使用了Twisted異步網(wǎng)絡(luò)庫(kù)來(lái)處理網(wǎng)絡(luò)通訊。整體架構(gòu)大致如下：

關(guān)于Scrapy的使用方法請(qǐng)參考其官方文檔

（2）Scrapy自動(dòng)爬蟲(chóng)

前面的實(shí)戰(zhàn)中我們都是通過(guò)循環(huán)構(gòu)建URL進(jìn)行數(shù)據(jù)爬取，其實(shí)還有另外一種實(shí)現(xiàn)方式，首先設(shè)定初始URL，獲取當(dāng)前URL中的新鏈接，基于這些鏈接繼續(xù)爬取，直到所爬取的頁(yè)面不存在新的鏈接為止。

(a)需求

采用自動(dòng)爬蟲(chóng)的方式爬取糗事百科文章鏈接與內(nèi)容，并將文章頭部?jī)?nèi)容與鏈接存儲(chǔ)到MySQL數(shù)據(jù)庫(kù)中。

(b)分析

A. 怎么提取首頁(yè)文章鏈接?

打開(kāi)首頁(yè)后查看源碼，搜索首頁(yè)任一篇文章內(nèi)容，可以看到"/article/118123230"鏈接，點(diǎn)擊進(jìn)去后發(fā)現(xiàn)這就是我們所要的文章內(nèi)容，所以我們?cè)谧詣?dòng)爬蟲(chóng)中需設(shè)置鏈接包含"article"

B. 怎么提取詳情頁(yè)文章內(nèi)容與鏈接

內(nèi)容

打開(kāi)詳情頁(yè)后，查看文章內(nèi)容如下：

分析可知利用包含屬性class且其值為content的div標(biāo)簽可***確定文章內(nèi)容，表達(dá)式如下：

 
 
  
  "http://div[@class='content']/text()"

鏈接

打開(kāi)任一詳情頁(yè)，復(fù)制詳情頁(yè)鏈接，查看詳情頁(yè)源碼，搜索鏈接如下：

采用以下XPath表達(dá)式可提取文章鏈接。

 
 
  
  ["http://link[@rel='canonical']/@href"]

（3）項(xiàng)目源碼

A. 創(chuàng)建爬蟲(chóng)項(xiàng)目

打開(kāi)CMD，切換到存儲(chǔ)爬蟲(chóng)項(xiàng)目的目錄下，輸入：

 
 
  
  scrapy startproject qsbkauto

B. 項(xiàng)目結(jié)構(gòu)說(shuō)明

spiders.qsbkspd.py：爬蟲(chóng)文件
items.py：項(xiàng)目實(shí)體，要提取的內(nèi)容的容器，如當(dāng)當(dāng)網(wǎng)商品的標(biāo)題、評(píng)論數(shù)等
pipelines.py：項(xiàng)目管道，主要用于數(shù)據(jù)的后續(xù)處理，如將數(shù)據(jù)寫入Excel和db等
settings.py：項(xiàng)目設(shè)置，如默認(rèn)是不開(kāi)啟pipeline、遵守robots協(xié)議等
scrapy.cfg：項(xiàng)目配置

C. 創(chuàng)建爬蟲(chóng)

進(jìn)入創(chuàng)建的爬蟲(chóng)項(xiàng)目，輸入：

 
 
  
  scrapy genspider -t crawl qsbkspd qiushibaie=ke.com（域名）

D. 定義items

 
 
  
  import scrapyclass QsbkautoItem(scrapy.Item):
  
      # define the fields for your item here like:
  
      # name = scrapy.Field()
  
      Link = scrapy.Field()     #文章鏈接
  
      Connent = scrapy.Field()  #文章內(nèi)容
  
      pass

E. 編寫爬蟲(chóng)

qsbkauto.py

  
  
   
   # -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom qsbkauto.items import QsbkautoItemfrom scrapy.http import Requestclass QsbkspdSpider(CrawlSpider):
   
     name = 'qsbkspd'
   
     allowed_domains = ['qiushibaike.com']
   
     #start_urls = ['http://qiushibaike.com/']
   
     def start_requests(self):
   
         i_headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0"}
   
         yield Request('http://www.qiushibaike.com/',headers=i_headers)
   
     rules = (
   
         Rule(LinkExtractor(allow=r'article/'), callback='parse_item', follow=True),
   
     )
   
     def parse_item(self, response):
   
         #i = {}
   
         #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
   
         #i['name'] = response.xpath('//div[@id="name"]').extract()
   
         #i['description'] = response.xpath('//div[@id="description"]').extract()
   
         i = QsbkautoItem()
   
         i["content"]=response.xpath("http://div[@class='content']/text()").extract()
   
         i["link"]=response.xpath("http://link[@rel='canonical']/@href").extract()
   
         return i

pipelines.py

 
 
  
  import MySQLdbimport timeclass QsbkautoPipeline(object):
  
    def exeSQL(self,sql):
  
        '''
  
        功能：連接MySQL數(shù)據(jù)庫(kù)并執(zhí)行sql語(yǔ)句
  
        @sql：定義SQL語(yǔ)句
  
        '''
  
        con = MySQLdb.connect(
  
            host='localhost',  # port
  
            user='root',       # usr_name
  
            passwd='xxxx',     # passname
  
            db='spdRet',       # db_name
  
            charset='utf8',
  
            local_infile = 1
  
            )
  
        con.query(sql)
  
        con.commit()
  
        con.close()
  
    def process_item(self, item, spider):
  
        link_url = item['link'][0]
  
        content_header = item['content'][0][0:10]
  
        curr_date = time.strftime('%Y-%m-%d',time.localtime(time.time()))
  
        content_header = curr_date+'__'+content_header
  
        if (len(link_url) and len(content_header)):#判斷是否為空值
  
            try:
  
                sql="insert into qiushi(content,link) values('"+content_header+"','"+link_url+"')"
  
                self.exeSQL(sql)
  
            except Exception as er:
  
                print("插入錯(cuò)誤，錯(cuò)誤如下：")
  
                print(er)
  
        else:
  
            pass
  
        return item

setting.py

關(guān)閉ROBOTSTXT_OBEY
設(shè)置USER_AGENT
開(kāi)啟ITEM_PIPELINES

F. 執(zhí)行爬蟲(chóng)

 
 
  
  scrapy crawl qsbkauto --nolog

G. 結(jié)果

【本文是專欄機(jī)構(gòu)“豈安科技”的原創(chuàng)文章，轉(zhuǎn)載請(qǐng)通過(guò)微信公眾號(hào)(bigsec)聯(lián)系原作者】

網(wǎng)頁(yè)名稱：一名合格的數(shù)據(jù)分析師分享Python網(wǎng)絡(luò)爬蟲(chóng)二三事（Scrapy自動(dòng)爬蟲(chóng)）
文章位置：http://www.5511xx.com/article/cdseigj.html

日韩无码专区无码一级三级片|91人人爱网站中日韩无码电影|厨房大战丰满熟妇|AV高清无码在线免费观看|另类AV日韩少妇熟女|中文日本大黄一级黄色片|色情在线视频免费|亚洲成人特黄a片|黄片wwwav色图欧美|欧亚乱色一区二区三区

新聞中心

其他資訊