爱草人人操人人操,一二三区无码99av在线,黄色中文无码电影

新聞中心

這里有您想知道的互聯(lián)網(wǎng)營(yíng)銷解決方案

2019網(wǎng)絡(luò)爬蟲(chóng)和相關(guān)工具

網(wǎng)絡(luò)爬蟲(chóng)

網(wǎng)絡(luò)爬蟲(chóng)(web crawler)，以前經(jīng)常稱之為網(wǎng)絡(luò)蜘蛛(spider)，是按照一定的規(guī)則自動(dòng)瀏覽萬(wàn)維網(wǎng)并獲取信息的機(jī)器人程序(或腳本)，曾經(jīng)被廣泛的應(yīng)用于互聯(lián)網(wǎng)搜索引擎。使用過(guò)互聯(lián)網(wǎng)和瀏覽器的人都知道，網(wǎng)頁(yè)中除了供用戶閱讀的文字信息之外，還包含一些超鏈接。網(wǎng)絡(luò)爬蟲(chóng)系統(tǒng)正是通過(guò)網(wǎng)頁(yè)中的超鏈接信息不斷獲得網(wǎng)絡(luò)上的其它頁(yè)面。正因如此，網(wǎng)絡(luò)數(shù)據(jù)采集的過(guò)程就像一個(gè)爬蟲(chóng)或者蜘蛛在網(wǎng)絡(luò)上漫游，所以才被形象的稱為網(wǎng)絡(luò)爬蟲(chóng)或者網(wǎng)絡(luò)蜘蛛。

爬蟲(chóng)的應(yīng)用領(lǐng)域

在理想的狀態(tài)下，所有ICP(Internet Content Provider)都應(yīng)該為自己的網(wǎng)站提供API接口來(lái)共享它們?cè)试S其他程序獲取的數(shù)據(jù)，在這種情況下爬蟲(chóng)就不是必需品，國(guó)內(nèi)比較有名的電商平臺(tái)(如淘寶、京東等)、社交平臺(tái)(如騰訊微博等)等網(wǎng)站都提供了自己的Open API，但是這類Open API通常會(huì)對(duì)可以抓取的數(shù)據(jù)以及抓取數(shù)據(jù)的頻率進(jìn)行限制。對(duì)于大多數(shù)的公司而言，及時(shí)的獲取行業(yè)相關(guān)數(shù)據(jù)是企業(yè)生存的重要環(huán)節(jié)之一，然而大部分企業(yè)在行業(yè)數(shù)據(jù)方面的匱乏是其與生俱來(lái)的短板，合理的利用爬蟲(chóng)來(lái)獲取數(shù)據(jù)并從中提取出有商業(yè)價(jià)值的信息是至關(guān)重要的。當(dāng)然爬蟲(chóng)還有很多重要的應(yīng)用領(lǐng)域，下面列舉了其中的一部分：

搜索引擎
新聞聚合
社交應(yīng)用
輿情監(jiān)控
行業(yè)數(shù)據(jù)

合法性和背景調(diào)研

爬蟲(chóng)合法性探討

網(wǎng)絡(luò)爬蟲(chóng)領(lǐng)域目前還屬于拓荒階段，雖然互聯(lián)網(wǎng)世界已經(jīng)通過(guò)自己的游戲規(guī)則建立起一定的道德規(guī)范(Robots協(xié)議，全稱是“網(wǎng)絡(luò)爬蟲(chóng)排除標(biāo)準(zhǔn)”)，但法律部分還在建立和完善中，也就是說(shuō)，現(xiàn)在這個(gè)領(lǐng)域暫時(shí)還是灰色地帶。

“法不禁止即為許可”，如果爬蟲(chóng)就像瀏覽器一樣獲取的是前端顯示的數(shù)據(jù)(網(wǎng)頁(yè)上的公開(kāi)信息)而不是網(wǎng)站后臺(tái)的私密敏感信息，就不太擔(dān)心法律法規(guī)的約束，因?yàn)槟壳按髷?shù)據(jù)產(chǎn)業(yè)鏈的發(fā)展速度遠(yuǎn)遠(yuǎn)超過(guò)了法律的完善程度。

在爬取網(wǎng)站的時(shí)候，需要限制自己的爬蟲(chóng)遵守Robots協(xié)議，同時(shí)控制網(wǎng)絡(luò)爬蟲(chóng)程序的抓取數(shù)據(jù)的速度;在使用數(shù)據(jù)的時(shí)候，必須要尊重網(wǎng)站的知識(shí)產(chǎn)權(quán)(從Web 2.0時(shí)代開(kāi)始，雖然Web上的數(shù)據(jù)很多都是由用戶提供的，但是網(wǎng)站平臺(tái)是投入了運(yùn)營(yíng)成本的，當(dāng)用戶在注冊(cè)和發(fā)布內(nèi)容時(shí)，平臺(tái)通常就已經(jīng)獲得了對(duì)數(shù)據(jù)的所有權(quán)、使用權(quán)和分發(fā)權(quán))。如果違反了這些規(guī)定，在打官司的時(shí)候敗訴幾率相當(dāng)高。

Robots.txt文件

大多數(shù)網(wǎng)站都會(huì)定義robots.txt文件，下面以淘寶的robots.txt文件為例，看看該網(wǎng)站對(duì)爬蟲(chóng)有哪些限制。

 
 
 
 
  
  
  
  User-agent:  Baiduspider
  
  
  
  Allow:  /article
  
  
  
  Allow:  /oshtml
  
  
  
  Disallow:  /product/
  
  
  
  Disallow:  /
  
  
  
  
  
  
  
  User-Agent:  Googlebot
  
  
  
  Allow:  /article
  
  
  
  Allow:  /oshtml
  
  
  
  Allow:  /product
  
  
  
  Allow:  /spu
  
  
  
  Allow:  /dianpu
  
  
  
  Allow:  /oversea
  
  
  
  Allow:  /list
  
  
  
  Disallow:  /
  
  
  
  
  
  
  
  User-agent:  Bingbot
  
  
  
  Allow:  /article
  
  
  
  Allow:  /oshtml
  
  
  
  Allow:  /product
  
  
  
  Allow:  /spu
  
  
  
  Allow:  /dianpu
  
  
  
  Allow:  /oversea
  
  
  
  Allow:  /list
  
  
  
  Disallow:  /
  
  
  
  
  
  
  
  User-Agent:  360Spider
  
  
  
  Allow:  /article
  
  
  
  Allow:  /oshtml
  
  
  
  Disallow:  /
  
  
  
  
  
  
  
  User-Agent:  Yisouspider
  
  
  
  Allow:  /article
  
  
  
  Allow:  /oshtml
  
  
  
  Disallow:  /
  
  
  
  
  
  
  
  User-Agent:  Sogouspider
  
  
  
  Allow:  /article
  
  
  
  Allow:  /oshtml
  
  
  
  Allow:  /product
  
  
  
  Disallow:  /
  
  
  
  
  
  
  
  User-Agent:  Yahoo!  Slurp
  
  
  
  Allow:  /product
  
  
  
  Allow:  /spu
  
  
  
  Allow:  /dianpu
  
  
  
  Allow:  /oversea
  
  
  
  Allow:  /list
  
  
  
  Disallow:  /
  
  
  
  
  
  
  
  User-Agent:  *
  
  
  
  Disallow:  /

注意上面robots.txt第一段的最后一行，通過(guò)設(shè)置“Disallow: /”禁止百度爬蟲(chóng)訪問(wèn)除了“Allow”規(guī)定頁(yè)面外的其他所有頁(yè)面。因此當(dāng)你在百度搜索“淘寶”的時(shí)候，搜索結(jié)果下方會(huì)出現(xiàn)：“由于該網(wǎng)站的robots.txt文件存在限制指令(限制搜索引擎抓取)，系統(tǒng)無(wú)法提供該頁(yè)面的內(nèi)容描述”。百度作為一個(gè)搜索引擎，至少在表面上遵守了淘寶網(wǎng)的robots.txt協(xié)議，所以用戶不能從百度上搜索到淘寶內(nèi)部的產(chǎn)品信息。

相關(guān)工具介紹

HTTP協(xié)議

在開(kāi)始講解爬蟲(chóng)之前，我們稍微對(duì)HTTP(超文本傳輸協(xié)議)做一些回顧，因?yàn)槲覀冊(cè)诰W(wǎng)頁(yè)上看到的內(nèi)容通常是瀏覽器執(zhí)行HTML語(yǔ)言得到的結(jié)果，而HTTP就是傳輸HTML數(shù)據(jù)的協(xié)議。HTTP和其他很多應(yīng)用級(jí)協(xié)議一樣是構(gòu)建在TCP(傳輸控制協(xié)議)之上的，它利用了TCP提供的可靠的傳輸服務(wù)實(shí)現(xiàn)了Web應(yīng)用中的數(shù)據(jù)交換。按照維基百科上的介紹，設(shè)計(jì)HTTP最初的目的是為了提供一種發(fā)布和接收HTML頁(yè)面的方法，也就是說(shuō)這個(gè)協(xié)議是瀏覽器和Web服務(wù)器之間傳輸?shù)臄?shù)據(jù)的載體。關(guān)于這個(gè)協(xié)議的詳細(xì)信息以及目前的發(fā)展?fàn)顩r，大家可以閱讀阮一峰老師的《HTTP 協(xié)議入門》、《互聯(lián)網(wǎng)協(xié)議入門》系列以及《圖解HTTPS協(xié)議》進(jìn)行了解，下圖是我在四川省網(wǎng)絡(luò)通信技術(shù)重點(diǎn)實(shí)驗(yàn)室工作期間用開(kāi)源協(xié)議分析工具Ethereal(抓包工具WireShark的前身)截取的訪問(wèn)百度首頁(yè)時(shí)的HTTP請(qǐng)求和響應(yīng)的報(bào)文(協(xié)議數(shù)據(jù))，由于Ethereal截取的是經(jīng)過(guò)網(wǎng)絡(luò)適配器的數(shù)據(jù)，因此可以清晰的看到從物理鏈路層到應(yīng)用層的協(xié)議數(shù)據(jù)。

HTTP請(qǐng)求(請(qǐng)求行+請(qǐng)求頭+空行+[消息體])：

HTTP響應(yīng)(響應(yīng)行+響應(yīng)頭+空行+消息體)：

說(shuō)明：但愿這兩張如同泛黃的照片般的截圖幫助你大概的了解到HTTP是一個(gè)怎樣的協(xié)議。

相關(guān)工具

1.Chrome Developer Tools：谷歌瀏覽器內(nèi)置的開(kāi)發(fā)者工具。

2.POSTMAN：功能強(qiáng)大的網(wǎng)頁(yè)調(diào)試與RESTful請(qǐng)求工具。

4.HTTPie：命令行HTTP客戶端。

 
 
 
 
  
  
  
  $ http --header http://www.scu.edu.cn
  
  
  
  HTTP/1.1 200 OK
  
  
  
  Accept-Ranges: bytes
  
  
  
  Cache-Control: private, max-age=600
  
  
  
  Connection: Keep-Alive
  
  
  
  Content-Encoding: gzip
  
  
  
  Content-Language: zh-CN
  
  
  
  Content-Length: 14403
  
  
  
  Content-Type: text/html
  
  
  
  Date: Sun, 27 May 2018 15:38:25 GMT
  
  
  
  ETag: "e6ec-56d3032d70a32-gzip"
  
  
  
  Expires: Sun, 27 May 2018 15:48:25 GMT
  
  
  
  Keep-Alive: timeout=5, max=100
  
  
  
  Last-Modified: Sun, 27 May 2018 13:44:22 GMT
  
  
  
  Server: VWebServer
  
  
  
  Vary: User-Agent,Accept-Encoding
  
  
  
  X-Frame-Options: SAMEORIGIN

5.BuiltWith：識(shí)別網(wǎng)站所用技術(shù)的工具。

 
 
 
 
  
  
  
  >>> import builtwith
  
  
  
  >>> builtwith.parse('http://www.bootcss.com/')
  
  
  
  {'web-servers': ['Nginx'], 'font-scripts': ['Font Awesome'], 'javascript-frameworks': ['Lo-dash', 'Underscore.js', 'Vue.js', 'Zepto', 'jQuery'], 'web-frameworks': ['Twitter Bootstrap']}
  
  
  
  >>>
  
  
  
  >>> import ssl
  
  
  
  >>> ssl._create_default_https_context = ssl._create_unverified_context
  
  
  
  >>> builtwith.parse('https://www.jianshu.com/')
  
  
  
  {'web-servers': ['Tengine'], 'web-frameworks': ['Twitter Bootstrap', 'Ruby on Rails'], 'programming-languages': ['Ruby']}

6.python-whois：查詢網(wǎng)站所有者的工具。

 
 
 
 
  
  
  
  >>> import whois
  
  
  
  >>> whois.whois('baidu.com')
  
  
  
  {'domain_name': ['BAIDU.COM', 'baidu.com'], 'registrar': 'MarkMonitor, Inc.', 'whois_server': 'whois.markmonitor.com', 'referral_url': None, 'updated_date': [datetime.datetime(2017, 7, 28, 2, 36, 28), datetime.datetime(2017, 7, 27, 19, 36, 28)], 'creation_date': [datetime.datetime(1999, 10, 11, 11, 5, 17), datetime.datetime(1999, 10, 11, 4, 5, 17)], 'expiration_date': [datetime.datetime(2026, 10, 11, 11, 5, 17), datetime.datetime(2026, 10, 11, 0, 0)], 'name_servers': ['DNS.BAIDU.COM', 'NS2.BAIDU.COM', 'NS3.BAIDU.COM', 'NS4.BAIDU.COM', 'NS7.BAIDU.COM', 'dns.baidu.com', 'ns4.baidu.com', 'ns3.baidu.com', 'ns7.baidu.com', 'ns2.baidu.com'], 'status': ['clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited', 'clientTransferProhibited https://icann.org/epp#clientTransferProhibited', 'clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited', 'serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited', 'serverTransferProhibited https://icann.org/epp#serverTransferProhibited', 'serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited', 'clientUpdateProhibited (https://www.icann.org/epp#clientUpdateProhibited)', 'clientTransferProhibited (https://www.icann.org/epp#clientTransferProhibited)', 'clientDeleteProhibited (https://www.icann.org/epp#clientDeleteProhibited)', 'serverUpdateProhibited (https://www.icann.org/epp#serverUpdateProhibited)', 'serverTransferProhibited (https://www.icann.org/epp#serverTransferProhibited)', 'serverDeleteProhibited (https://www.icann.org/epp#serverDeleteProhibited)'], 'emails': ['abusecomplaints@markmonitor.com', 'whoisrelay@markmonitor.com'], 'dnssec': 'unsigned', 'name': None, 'org': 'Beijing Baidu Netcom Science Technology Co., Ltd.', 'address': None, 'city': None, 'state': 'Beijing', 'zipcode': None, 'country': 'CN'}

7.robotparser：解析robots.txt的工具。

 
 
 
 
  
  
  
  >>> from urllib import robotparser
  
  
  
  >>> parser = robotparser.RobotFileParser()
  
  
  
  >>> parser.set_url('https://www.taobao.com/robots.txt')
  
  
  
  >>> parser.read()
  
  
  
  >>> parser.can_fetch('Hellokitty', 'http://www.taobao.com/article')
  
  
  
  False
  
  
  
  >>> parser.can_fetch('Baiduspider', 'http://www.taobao.com/article')
  
  
  
  True
  
  
  
  >>> parser.can_fetch('Baiduspider', 'http://www.taobao.com/product')
  
  
  
  False

一個(gè)簡(jiǎn)單的爬蟲(chóng)

一個(gè)基本的爬蟲(chóng)通常分為數(shù)據(jù)采集(網(wǎng)頁(yè)下載)、數(shù)據(jù)處理(網(wǎng)頁(yè)解析)和數(shù)據(jù)存儲(chǔ)(將有用的信息持久化)三個(gè)部分的內(nèi)容，當(dāng)然更為高級(jí)的爬蟲(chóng)在數(shù)據(jù)采集和處理時(shí)會(huì)使用并發(fā)編程或分布式技術(shù)，這就需要有調(diào)度器(安排線程或進(jìn)程執(zhí)行對(duì)應(yīng)的任務(wù))、后臺(tái)管理程序(監(jiān)控爬蟲(chóng)的工作狀態(tài)以及檢查數(shù)據(jù)抓取的結(jié)果)等的參與。

一般來(lái)說(shuō)，爬蟲(chóng)的工作流程包括以下幾個(gè)步驟：

設(shè)定抓取目標(biāo)(種子頁(yè)面/起始頁(yè)面)并獲取網(wǎng)頁(yè)。
當(dāng)服務(wù)器無(wú)法訪問(wèn)時(shí)，按照指定的重試次數(shù)嘗試重新下載頁(yè)面。
在需要的時(shí)候設(shè)置用戶代理或隱藏真實(shí)IP，否則可能無(wú)法訪問(wèn)頁(yè)面。
對(duì)獲取的頁(yè)面進(jìn)行必要的解碼操作然后抓取出需要的信息。
在獲取的頁(yè)面中通過(guò)某種方式(如正則表達(dá)式)抽取出頁(yè)面中的鏈接信息。
對(duì)鏈接進(jìn)行進(jìn)一步的處理(獲取頁(yè)面并重復(fù)上面的動(dòng)作)。
將有用的信息進(jìn)行持久化以備后續(xù)的處理。

下面的例子給出了一個(gè)從“搜狐體育”上獲取NBA新聞標(biāo)題和鏈接的爬蟲(chóng)。

 
 
 
 
  
  
  
  from urllib.error import URLError
  
  
  
  from urllib.request import urlopen
  
  
  
  
  
  
  
  import re
  
  
  
  import pymysql
  
  
  
  import ssl
  
  
  
  
  
  
  
  from pymysql import Error
  
  
  
  
  
  
  
  
  
  
  
  # 通過(guò)指定的字符集對(duì)頁(yè)面進(jìn)行解碼(不是每個(gè)網(wǎng)站都將字符集設(shè)置為utf-8)
  
  
  
  def decode_page(page_bytes, charsets=('utf-8',)):
  
  
  
      page_html = None
  
  
  
      for charset in charsets:
  
  
  
          try:
  
  
  
              page_html = page_bytes.decode(charset)
  
  
  
              break
  
  
  
          except UnicodeDecodeError:
  
  
  
              pass
  
  
  
              # logging.error('Decode:', error)
  
  
  
      return page_html
  
  
  
  
  
  
  
  
  
  
  
  # 獲取頁(yè)面的HTML代碼(通過(guò)遞歸實(shí)現(xiàn)指定次數(shù)的重試操作)
  
  
  
  def get_page_html(seed_url, *, retry_times=3, charsets=('utf-8',)):
  
  
  
      page_html = None
  
  
  
      try:
  
  
  
          page_html = decode_page(urlopen(seed_url).read(), charsets)
  
  
  
      except URLError:
  
  
  
          # logging.error('URL:', error)
  
  
  
          if retry_times > 0:
  
  
  
              return get_page_html(seed_url, retry_times=retry_times - 1,
  
  
  
                                   charsets=charsets)
  
  
  
      return page_html
  
  
  
  
  
  
  
  
  
  
  
  # 從頁(yè)面中提取需要的部分(通常是鏈接也可以通過(guò)正則表達(dá)式進(jìn)行指定)
  
  
  
  def get_matched_parts(page_html, pattern_str, pattern_ignore_case=re.I):
  
  
  
      pattern_regex = re.compile(pattern_str, pattern_ignore_case)
  
  
  
      return pattern_regex.findall(page_html) if page_html else []
  
  
  
  
  
  
  
  
  
  
  
  # 開(kāi)始執(zhí)行爬蟲(chóng)程序并對(duì)指定的數(shù)據(jù)進(jìn)行持久化操作
  
  
  
  def start_crawl(seed_url, match_pattern, *, max_depth=-1):
  
  
  
      conn = pymysql.connect(host='localhost', port=3306,
  
  
  
                             database='crawler', user='root',
  
  
  
                             password='123456', charset='utf8')
  
  
  
      try:
  
  
  
          with conn.cursor() as cursor:
  
  
  
              url_list = [seed_url]
  
  
  
              # 通過(guò)下面的字典避免重復(fù)抓取并控制抓取深度
  
  
  
              visited_url_list = {seed_url: 0}
  
  
  
              while url_list:
  
  
  
                  current_url = url_list.pop(0)
  
  
  
                  depth = visited_url_list[current_url]
  
  
  
                  if depth != max_depth:
  
  
  
                      # 嘗試用utf-8/gbk/gb2312三種字符集進(jìn)行頁(yè)面解碼
  
  
  
                      page_html = get_page_html(current_url, charsets=('utf-8', 'gbk', 'gb2312'))
  
  
  
                      links_list = get_matched_parts(page_html, match_pattern)
  
  
  
                      param_list = []
  
  
  
                      for link in links_list:
  
  
  
                          if link not in visited_url_list:
  
  
  
                              visited_url_list[link] = depth + 1
  
  
  
                              page_html = get_page_html(link, charsets=('utf-8', 'gbk', 'gb2312'))
  
  
  
                              headings = get_matched_parts(page_html, r'(.*)
  
  
  
                              if headings:
  
  
  
                                  param_list.append((headings[0], link))
  
  
  
                      cursor.executemany('insert into tb_result values (default, %s, %s)',
  
  
  
                                         param_list)
  
  
  
                      conn.commit()
  
  
  
      except Error:
  
  
  
          pass
  
  
  
          # logging.error('SQL:', error)
  
  
  
      finally:
  
  
  
          conn.close()
  
  
  
  
  
  
  
  
  
  
  
  def main():
  
  
  
      ssl._create_default_https_context = ssl._create_unverified_context
  
  
  
      start_crawl('http://sports.sohu.com/nba_a.shtml',
  
  
  
                  r']+test=a\s[^>]*href=["\'](.*?)["\']',
  
  
  
                  max_depth=2)
  
  
  
  
  
  
  
  
  
  
  
  if __name__ == '__main__':
  
  
  
      main()

由于使用了MySQL實(shí)現(xiàn)持久化操作，所以要先啟動(dòng)MySQL服務(wù)器再運(yùn)行該程序。

名稱欄目：2019網(wǎng)絡(luò)爬蟲(chóng)和相關(guān)工具
鏈接URL：http://www.5511xx.com/article/djohish.html

日韩无码专区无码一级三级片|91人人爱网站中日韩无码电影|厨房大战丰满熟妇|AV高清无码在线免费观看|另类AV日韩少妇熟女|中文日本大黄一级黄色片|色情在线视频免费|亚洲成人特黄a片|黄片wwwav色图欧美|欧亚乱色一区二区三区

新聞中心

其他資訊