蜜臀aV一区二区三区绯色,亚洲一级啪啪A级片成人电影

新聞中心

這里有您想知道的互聯網營銷解決方案

Tomcat是怎樣處理搜索引擎爬蟲請求的?

每個置身于互聯網中的站點，都需要搜索引擎的收錄，以及在適時在結果中的展現，從而將信息提供給用戶、讀者。而搜索引擎如何才能收錄我們的站點呢?

網站建設哪家好，找成都創(chuàng)新互聯！專注于網頁設計、網站建設、微信開發(fā)、微信小程序開發(fā)、集團企業(yè)網站建設等服務項目。為回饋新老客戶創(chuàng)新互聯還提供了萬源免費建站歡迎大家使用！

這就涉及到一個「搜索引擎的爬蟲」爬取站點內容的過程。只有被搜索引擎爬過并收錄的內容才有機會在特定query***之后在結果中展現。

這些搜索引擎內容的工具，又被稱為爬蟲、Sprider，Web crawler 等等。我們一方面歡迎其訪問站點以便收錄內容，一方面又因其對于正常服務的影響頭疼。畢竟 Spider 也是要占用服務器資源的， Spider 太多太頻繁的資源占用，正常用戶請求處理就會受到影響。所以一些站點干脆直接為搜索引擎提供了單獨的服務供其訪問，其他正常的用戶請求走另外的服務器。

說到這里需要提一下，對于是否是 Spider 的請求識別，是通過HTTP 請求頭中的User-Agent 字段來判斷的，每個搜索引擎有自己的獨立標識。而且通過這些內容，管理員也可以在訪問日志中了解搜索引擎爬過哪些內容。

此外，在對搜索引擎的「爬取聲明文件」robots.txt中，也會有類似的User-agent 描述。比如下面是taobao 的robots.txt描述

 
 
 
 
  
  
  
  User-agent:  Baiduspider 
  
  
  
  Allow:  /article 
  
  
  
  Allow:  /oshtml 
  
  
  
  Disallow:  /product/ 
  
  
  
  Disallow:  / 
  
  
  
   
  
  
  
  User-Agent:  Googlebot 
  
  
  
  Allow:  /article 
  
  
  
  Allow:  /oshtml 
  
  
  
  Allow:  /product 
  
  
  
  Allow:  /spu 
  
  
  
  Allow:  /dianpu 
  
  
  
  Allow:  /oversea 
  
  
  
  Allow:  /list 
  
  
  
  Disallow:  / 
  
  
  
   
  
  
  
  User-agent:  Bingbot 
  
  
  
  Allow:  /article 
  
  
  
  Allow:  /oshtml 
  
  
  
  Allow:  /product 
  
  
  
  Allow:  /spu 
  
  
  
  Allow:  /dianpu 
  
  
  
  Allow:  /oversea 
  
  
  
  Allow:  /list 
  
  
  
  Disallow:  / 
  
  
  
   
  
  
  
  User-Agent:  360Spider 
  
  
  
  Allow:  /article 
  
  
  
  Allow:  /oshtml 
  
  
  
  Disallow:  / 
  
  
  
   
  
  
  
  User-Agent:  Yisouspider 
  
  
  
  Allow:  /article 
  
  
  
  Allow:  /oshtml 
  
  
  
  Disallow:  / 
  
  
  
   
  
  
  
  User-Agent:  Sogouspider 
  
  
  
  Allow:  /article 
  
  
  
  Allow:  /oshtml 
  
  
  
  Allow:  /product 
  
  
  
  Disallow:  / 
  
  
  
   
  
  
  
  User-Agent:  Yahoo!  Slurp 
  
  
  
  Allow:  /product 
  
  
  
  Allow:  /spu 
  
  
  
  Allow:  /dianpu 
  
  
  
  Allow:  /oversea 
  
  
  
  Allow:  /list 
  
  
  
  Disallow:  /

我們再來看 Tomcat對于搜索引擎的請求做了什么特殊處理呢?

對于請求涉及到 Session，我們知道通過 Session，我們在服務端得以識別一個具體的用戶。那 Spider 的大量請求到達后，如果訪問頻繁同時請求量大時，就需要創(chuàng)建巨大量的 Session，需要占用和消耗很多內存，這無形中占用了正常用戶處理的資源。

為此， Tomcat 提供了一個「Valve」，用于對 Spider 的請求做一些處理。

首先識別 Spider 請求，對于 Spider 請求，使其使用相同的 SessionId繼續(xù)后面的請求流程，從而避免創(chuàng)建大量的 Session 數據。

這里需要注意，即使Spider顯式的傳了一個 sessionId過來，也會棄用，而是根據client Ip 來進行判斷，即對于相同的 Spider 只提供一個Session。

我們來看代碼：

 
 
 
 
  
  
  
  // If the incoming request has a valid session ID, no action is required 
  
  
  
  if (request.getSession(false) == null) { 
  
  
  
   
  
  
  
      // Is this a crawler - check the UA headers 
  
  
  
      Enumeration uaHeaders = request.getHeaders("user-agent"); 
  
  
  
      String uaHeader = null; 
  
  
  
      if (uaHeaders.hasMoreElements()) { 
  
  
  
          uaHeader = uaHeaders.nextElement(); 
  
  
  
      } 
  
  
  
   
  
  
  
      // If more than one UA header - assume not a bot 
  
  
  
      if (uaHeader != null && !uaHeaders.hasMoreElements()) { 
  
  
  
          if (uaPattern.matcher(uaHeader).matches()) { 
  
  
  
              isBot = true; 
  
  
  
              if (log.isDebugEnabled()) { 
  
  
  
                  log.debug(request.hashCode() + 
  
  
  
                          ": Bot found. UserAgent=" + uaHeader); 
  
  
  
              } 
  
  
  
          } 
  
  
  
      } 
  
  
  
   
  
  
  
      // If this is a bot, is the session ID known? 
  
  
  
      if (isBot) { 
  
  
  
          clientIp = request.getRemoteAddr(); 
  
  
  
          sessionId = clientIpSessionId.get(clientIp); 
  
  
  
          if (sessionId != null) { 
  
  
  
              request.setRequestedSessionId(sessionId); // 重用session 
  
  
  
          } 
  
  
  
      } 
  
  
  
  } 
  
  
  
   
  
  
  
  getNext().invoke(request, response); 
  
  
  
   
  
  
  
  if (isBot) { 
  
  
  
      if (sessionId == null) { 
  
  
  
          // Has bot just created a session, if so make a note of it 
  
  
  
          HttpSession s = request.getSession(false); 
  
  
  
          if (s != null) { 
  
  
  
              clientIpSessionId.put(clientIp, s.getId()); //針對Spider生成session 
  
  
  
              sessionIdClientIp.put(s.getId(), clientIp); 
  
  
  
              // #valueUnbound() will be called on session expiration 
  
  
  
              s.setAttribute(this.getClass().getName(), this); 
  
  
  
              s.setMaxInactiveInterval(sessionInactiveInterval); 
  
  
  
   
  
  
  
              if (log.isDebugEnabled()) { 
  
  
  
                  log.debug(request.hashCode() + 
  
  
  
                          ": New bot session. SessionID=" + s.getId()); 
  
  
  
              } 
  
  
  
          } 
  
  
  
      } else { 
  
  
  
          if (log.isDebugEnabled()) { 
  
  
  
              log.debug(request.hashCode() + 
  
  
  
                      ": Bot session accessed. SessionID=" + sessionId); 
  
  
  
          } 
  
  
  
      } 
  
  
  
  }

判斷Spider 是通過正則

 
 
 
 
  
  
  
  private String crawlerUserAgents = 
  
  
  
      ".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*"; 
  
  
  
  // 初始化Valve的時候進行compile 
  
  
  
  uaPattern = Pattern.compile(crawlerUserAgents);

這樣當 Spider 到達的時候就能通過 User-agent識別出來并進行特別處理從而減小受其影響。

這個 Valve的名字是：「CrawlerSessionManagerValve」，好名字一眼就能看出來作用。

其他還有問題么?我們看看，通過ClientIp來判斷進行Session共用。

最近 Tomcat 做了個bug fix，原因是這種通過ClientIp的判斷方式，當 Valve 配置在Engine下層，給多個Host 共用時，只能有一個Host生效。 fix之后，對于請求除ClientIp外，還有Host和 Context的限制，這些元素共同組成了 client標識，就能更大程度上共用Session。

修改內容如下：

總結下，該Valve 通過標識識別出 Spider 請求后，給其分配一個固定的Session，從而避免大量的Session創(chuàng)建導致我資源占用。

默認該Valve未開啟，需要在 server.xml中增加配置開啟。另外我們看上面提供的正則 pattern，和taobao 的robots.txt對比下，你會出現并沒有包含國內的這些搜索引擎的處理，這個時候怎么辦呢?

在配置的時候傳一下進來就OK啦，這是個public 的屬性

 
 
 
 
  
  
  
  public void setCrawlerUserAgents(String crawlerUserAgents) { 
  
  
  
      this.crawlerUserAgents = crawlerUserAgents; 
  
  
  
      if (crawlerUserAgents == null || crawlerUserAgents.length() == 0) { 
  
  
  
          uaPattern = null; 
  
  
  
      } else { 
  
  
  
          uaPattern = Pattern.compile(crawlerUserAgents); 
  
  
  
      } 
  
  
  
  }

【本文為專欄作者“侯樹成”的原創(chuàng)稿件，轉載請通過作者微信公眾號『Tomcat那些事兒』獲取授權】

新聞標題：Tomcat是怎樣處理搜索引擎爬蟲請求的?
網址分享：http://www.5511xx.com/article/dhisiod.html

日韩无码专区无码一级三级片|91人人爱网站中日韩无码电影|厨房大战丰满熟妇|AV高清无码在线免费观看|另类AV日韩少妇熟女|中文日本大黄一级黄色片|色情在线视频免费|亚洲成人特黄a片|黄片wwwav色图欧美|欧亚乱色一区二区三区

新聞中心

其他資訊