新聞中心

這里有您想知道的互聯(lián)網(wǎng)營銷解決方案

創(chuàng)新互聯(lián)Python教程：如何利用urllib包獲取網(wǎng)絡(luò)資源

如何利用 urllib 包獲取網(wǎng)絡(luò)資源

作者

盱眙網(wǎng)站建設(shè)公司創(chuàng)新互聯(lián),盱眙網(wǎng)站設(shè)計(jì)制作，有大型網(wǎng)站制作公司豐富經(jīng)驗(yàn)。已為盱眙上1000+提供企業(yè)網(wǎng)站建設(shè)服務(wù)。企業(yè)網(wǎng)站搭建\外貿(mào)網(wǎng)站制作要多少錢，請找那個售后服務(wù)好的盱眙做網(wǎng)站的公司定做！

Michael Foord

備注

There is a French translation of an earlier revision of this HOWTO, available at urllib2 - Le Manuel manquant.

概述

關(guān)于如何用 python 獲取 web 資源，以下文章或許也很有用：

Basic Authentication

基本認(rèn)證 的教程，帶有一些 Python 示例。

urllib.request 是用于獲取 URL （統(tǒng)一資源定位符）的 Python 模塊。它以 urlopen 函數(shù)的形式提供了一個非常簡單的接口，能用不同的協(xié)議獲取 URL。同時(shí)它還為處理各種常見情形提供了一個稍微復(fù)雜一些的接口——比如：基礎(chǔ)身份認(rèn)證、cookies、代理等等。這些功能是由名為 handlers 和 opener 的對象提供的。

urllib.request 支持多種 “URL 方案” （通過 URL中 ":" 之前的字符串加以區(qū)分——如 "ftp://python.org/"` 中的 ``"ftp"` ）即為采用其關(guān)聯(lián)網(wǎng)絡(luò)協(xié)議（FTP、HTTP 之類）的 URL 方案。本教程重點(diǎn)關(guān)注最常用的 HTTP 場景。

對于簡單場景而言， urlopen 用起來十分容易。但只要在打開 HTTP URL 時(shí)遇到錯誤或非常情況，就需要對超文本傳輸協(xié)議有所了解才行。最全面、最權(quán)威的 HTTP 參考是 RFC 2616 。那是一份技術(shù)文檔，并沒有追求可讀性。本文旨在說明 urllib 的用法，為了便于閱讀也附帶了足夠詳細(xì)的 HTTP 信息。本文并不是為了替代 urllib.request 文檔，只是其補(bǔ)充說明而已。

獲取 URL 資源

urllib.request 最簡單的使用方式如下所示：

 
 
 
 
  
  
  
  import urllib.request  
  
  
  with urllib.request.urlopen('http://python.org/') as response:  
  
  
   html = response.read()

如果想通過 URL 獲取資源并臨時(shí)存儲一下，可以采用 shutil.copyfileobj() 和 tempfile.NamedTemporaryFile() 函數(shù)：

 
 
 
 
  
  
  
  import shutil  
  
  
  import tempfile  
  
  
  import urllib.request  
  
  
    
  
  
  with urllib.request.urlopen('http://python.org/') as response:  
  
  
   with tempfile.NamedTemporaryFile(delete=False) as tmp_file:  
  
  
   shutil.copyfileobj(response, tmp_file)  
  
  
    
  
  
  with open(tmp_file.name) as html:  
  
  
   pass

urllib 的很多用法就是這么簡單（注意 URL 不僅可以 http: 開頭，還可以是 ftp: 、file: 等）。不過本教程的目的是介紹更加復(fù)雜的應(yīng)用場景，重點(diǎn)還是關(guān)注 HTTP。

HTTP 以請求和響應(yīng)為基礎(chǔ)——客戶端生成請求，服務(wù)器發(fā)送響應(yīng)。urllib.request 用 Request 對象來表示要生成的 HTTP 請求。最簡單的形式就是創(chuàng)建一個 Request 對象，指定了想要獲取的 URL。用這個 Request 對象作為參數(shù)調(diào)用``urlopen`` ，將會返回該 URL 的響應(yīng)對象。響應(yīng)對象類似于文件對象，就是說可以對其調(diào)用 .read() 之類的命令：

 
 
 
 
  
  
  
  import urllib.request  
  
  
    
  
  
  req = urllib.request.Request('http://www.voidspace.org.uk')  
  
  
  with urllib.request.urlopen(req) as response:  
  
  
   the_page = response.read()

請注意，urllib.request 用同一個 Request 接口處理所有 URL 方案。比如可生成 FTP 請求如下：

 
 
 
 
  
  
  
  req = urllib.request.Request('ftp://example.com/')

就 HTTP 而言，Request 對象能夠做兩件額外的事情：首先可以把數(shù)據(jù)傳給服務(wù)器。其次，可以將 有關(guān) 數(shù)據(jù)或請求本身的額外信息（metadata）傳給服務(wù)器——這些信息將會作為 HTTP “頭部”數(shù)據(jù)發(fā)送。下面依次看下。

數(shù)據(jù)

有時(shí)需要向某個 URL 發(fā)送數(shù)據(jù)，通常此 URL 會指向某個CGI（通用網(wǎng)關(guān)接口）腳本或其他 web 應(yīng)用。對于 HTTP 而言，這通常會用所謂的 POST 請求來完成。當(dāng)要把 Web 頁填寫的 HTML 表單提交時(shí)，瀏覽器通常會執(zhí)行此操作。但并不是所有的 POST 都來自表單：可以用 POST 方式傳輸任何數(shù)據(jù)到自己的應(yīng)用上。對于通常的 HTML 表單，數(shù)據(jù)需要以標(biāo)準(zhǔn)的方式編碼，然后作為 data 參數(shù)傳給 Request 對象。編碼過程是用 urllib.parse 庫的函數(shù)完成的：

 
 
 
 
  
  
  
  import urllib.parse  
  
  
  import urllib.request  
  
  
    
  
  
  url = 'http://www.someserver.com/cgi-bin/register.cgi'  
  
  
  values = {'name' : 'Michael Foord',  
  
  
   'location' : 'Northampton',  
  
  
   'language' : 'Python' }  
  
  
    
  
  
  data = urllib.parse.urlencode(values)  
  
  
  data = data.encode('ascii') # data should be bytes  
  
  
  req = urllib.request.Request(url, data)  
  
  
  with urllib.request.urlopen(req) as response:  
  
  
   the_page = response.read()

請注意，有時(shí)還需要采用其他編碼，比如由 HTML 表單上傳文件——更多細(xì)節(jié)請參見 HTML 規(guī)范，提交表單。

如果不傳遞 data 參數(shù)，urllib 將采用 GET 請求。GET 和 POST 請求有一點(diǎn)不同，POST 請求往往具有“副作用”，他們會以某種方式改變系統(tǒng)的狀態(tài)。例如，從網(wǎng)站下一個訂單，購買一大堆罐裝垃圾并運(yùn)送到家。盡管 HTTP 標(biāo)準(zhǔn)明確指出 POST 總是要導(dǎo)致副作用，而 GET 請求 從來不會 導(dǎo)致副作用。但沒有什么辦法能阻止 GET 和 POST 請求的副作用。數(shù)據(jù)也可以在 HTTP GET 請求中傳遞，只要把數(shù)據(jù)編碼到 URL 中即可。

做法如下所示：

 
 
 
 
  
  
  
  >>> import urllib.request  
  
  
  >>> import urllib.parse  
  
  
  >>> data = {}  
  
  
  >>> data['name'] = 'Somebody Here'  
  
  
  >>> data['location'] = 'Northampton'  
  
  
  >>> data['language'] = 'Python'  
  
  
  >>> url_values = urllib.parse.urlencode(data)  
  
  
  >>> print(url_values) # The order may differ from below.   
  
  
  name=Somebody+Here&language=Python&location=Northampton  
  
  
  >>> url = 'http://www.example.com/example.cgi'  
  
  
  >>> full_url = url + '?' + url_values  
  
  
  >>> data = urllib.request.urlopen(full_url)

請注意，完整的 URL 是通過在其中添加 ? 創(chuàng)建的，后面跟著經(jīng)過編碼的數(shù)據(jù)。

HTTP 頭部信息

下面介紹一個具體的 HTTP 頭部信息，以此說明如何在 HTTP 請求加入頭部信息。

有些網(wǎng)站 1 不愿被程序?yàn)g覽到，或者要向不同的瀏覽器發(fā)送不同版本 2 的網(wǎng)頁。默認(rèn)情況下，urllib 將自身標(biāo)識為“Python-urllib/xy”（其中 x 、 y 是 Python 版本的主、次版本號，例如 Python-urllib/2.5），這可能會讓網(wǎng)站不知所措，或者干脆就使其無法正常工作。瀏覽器是通過頭部信息 User-Agent 3 來標(biāo)識自己的。在創(chuàng)建 Request 對象時(shí)，可以傳入字典形式的頭部信息。以下示例將生成與之前相同的請求，只是將自身標(biāo)識為某個版本的 Internet Explorer 4 ：

 
 
 
 
  
  
  
  import urllib.parse  
  
  
  import urllib.request  
  
  
    
  
  
  url = 'http://www.someserver.com/cgi-bin/register.cgi'  
  
  
  user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'  
  
  
  values = {'name': 'Michael Foord',  
  
  
   'location': 'Northampton',  
  
  
   'language': 'Python' }  
  
  
  headers = {'User-Agent': user_agent}  
  
  
    
  
  
  data = urllib.parse.urlencode(values)  
  
  
  data = data.encode('ascii')  
  
  
  req = urllib.request.Request(url, data, headers)  
  
  
  with urllib.request.urlopen(req) as response:  
  
  
   the_page = response.read()

響應(yīng)對象也有兩個很有用的方法。請參閱有關(guān) info 和 geturl 部分，了解出現(xiàn)問題時(shí)會發(fā)生什么。

異常的處理

如果 urlopen 無法處理響應(yīng)信息，就會觸發(fā) URLError 。盡管與通常的 Python API 一樣，也可能觸發(fā) ValueError 、 TypeError 等內(nèi)置異常。

HTTPError 是 URLError 的子類，當(dāng) URL 是 HTTP 的情況時(shí)將會觸發(fā)。

上述異常類是從 urllib.error 模塊中導(dǎo)出的。

URLError

觸發(fā) URLError 的原因，通常是網(wǎng)絡(luò)不通（或者沒有到指定服務(wù)器的路由），或者指定的服務(wù)器不存在。這時(shí)觸發(fā)的異常會帶有一個 reason 屬性，是一個包含錯誤代碼和文本錯誤信息的元組。

例如：

 
 
 
 
  
  
  
  >>> req = urllib.request.Request('http://www.pretend_server.org')  
  
  
  >>> try: urllib.request.urlopen(req)  
  
  
  ... except urllib.error.URLError as e:  
  
  
  ... print(e.reason)   
  
  
  ...  
  
  
  (4, 'getaddrinfo failed')

HTTPError

從服務(wù)器返回的每個 HTTP 響應(yīng)都包含一個數(shù)字的 “狀態(tài)碼”。有時(shí)該狀態(tài)碼表明服務(wù)器無法完成該請求。默認(rèn)的處理函數(shù)將會處理這其中的一部分響應(yīng)。如若響應(yīng)是“redirection”，這是要求客戶端從另一 URL 處獲取數(shù)據(jù)，urllib 將會自行處理。對于那些無法處理的狀況，urlopen 將會引發(fā) HTTPError 。典型的錯誤包括：“404”（頁面無法找到）、“403”（請求遭拒絕）和“401”（需要身份認(rèn)證）。

全部的 HTTP 錯誤碼請參閱 RFC 2616 。

HTTPError 實(shí)例將包含一個整數(shù)型的“code”屬性，對應(yīng)于服務(wù)器發(fā)來的錯誤。

錯誤代碼

由于默認(rèn)處理函數(shù)會自行處理重定向（300 以內(nèi)的錯誤碼），而且 100—299 的狀態(tài)碼表示成功，因此通常只會出現(xiàn) 400—599 的錯誤碼。

http.server.BaseHTTPRequestHandler.responses 是很有用的響應(yīng)碼字典，其中給出了 RFC 2616 用到的所有響應(yīng)代碼。為方便起見，將此字典轉(zhuǎn)載如下：

 
 
 
 
  
  
  
  # Table mapping response codes to messages; entries have the  
  
  
  # form {code: (shortmessage, longmessage)}.  
  
  
  responses = {  
  
  
   100: ('Continue', 'Request received, please continue'),  
  
  
   101: ('Switching Protocols',  
  
  
   'Switching to new protocol; obey Upgrade header'),  
  
  
    
  
  
   200: ('OK', 'Request fulfilled, document follows'),  
  
  
   201: ('Created', 'Document created, URL follows'),  
  
  
   202: ('Accepted',  
  
  
   'Request accepted, processing continues off-line'),  
  
  
   203: ('Non-Authoritative Information', 'Request fulfilled from cache'),  
  
  
   204: ('No Content', 'Request fulfilled, nothing follows'),  
  
  
   205: ('Reset Content', 'Clear input form for further input.'),  
  
  
   206: ('Partial Content', 'Partial content follows.'),  
  
  
    
  
  
   300: ('Multiple Choices',  
  
  
   'Object has several resources -- see URI list'),  
  
  
   301: ('Moved Permanently', 'Object moved permanently -- see URI list'),  
  
  
   302: ('Found', 'Object moved temporarily -- see URI list'),  
  
  
   303: ('See Other', 'Object moved -- see Method and URL list'),  
  
  
   304: ('Not Modified',  
  
  
   'Document has not changed since given time'),  
  
  
   305: ('Use Proxy',  
  
  
   'You must use proxy specified in Location to access this '  
  
  
   'resource.'),  
  
  
   307: ('Temporary Redirect',  
  
  
   'Object moved temporarily -- see URI list'),  
  
  
    
  
  
   400: ('Bad Request',  
  
  
   'Bad request syntax or unsupported method'),  
  
  
   401: ('Unauthorized',  
  
  
   'No permission -- see authorization schemes'),  
  
  
   402: ('Payment Required',  
  
  
   'No payment -- see charging schemes'),  
  
  
   403: ('Forbidden',  
  
  
   'Request forbidden -- authorization will not help'),  
  
  
   404: ('Not Found', 'Nothing matches the given URI'),  
  
  
   405: ('Method Not Allowed',  
  
  
   'Specified method is invalid for this server.'),  
  
  
   406: ('Not Acceptable', 'URI not available in preferred format.'),  
  
  
   407: ('Proxy Authentication Required', 'You must authenticate with '  
  
  
   'this proxy before proceeding.'),  
  
  
   408: ('Request Timeout', 'Request timed out; try again later.'),  
  
  
   409: ('Conflict', 'Request conflict.'),  
  
  
   410: ('Gone',  
  
  
   'URI no longer exists and has been permanently removed.'),  
  
  
   411: ('Length Required', 'Client must specify Content-Length.'),  
  
  
   412: ('Precondition Failed', 'Precondition in headers is false.'),  
  
  
   413: ('Request Entity Too Large', 'Entity is too large.'),  
  
  
   414: ('Request-URI Too Long', 'URI is too long.'),  
  
  
   415: ('Unsupported Media Type', 'Entity body in unsupported format.'),  
  
  
   416: ('Requested Range Not Satisfiable',  
  
  
   'Cannot satisfy request range.'),  
  
  
   417: ('Expectation Failed',  
  
  
   'Expect condition could not be satisfied.'),  
  
  
    
  
  
   500: ('Internal Server Error', 'Server got itself in trouble'),  
  
  
   501: ('Not Implemented',  
  
  
   'Server does not support this operation'),  
  
  
   502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),  
  
  
   503: ('Service Unavailable',  
  
  
   'The server cannot process the request due to a high load'),  
  
  
   504: ('Gateway Timeout',  
  
  
   'The gateway server did not receive a timely response'),  
  
  
   505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),  
  
  
   }

當(dāng)觸發(fā)錯誤時(shí)，服務(wù)器通過返回 HTTP 錯誤碼和錯誤頁面進(jìn)行響應(yīng)?？梢詫?HTTPError 實(shí)例用作返回頁面的響應(yīng)。這意味著除了 code 屬性之外，錯誤對象還像 urllib.response 模塊返回的那樣具有 read、geturl 和 info 方法：

 
 
 
 
  
  
  
  >>> req = urllib.request.Request('http://www.python.org/fish.html')  
  
  
  >>> try:  
  
  
  ... urllib.request.urlopen(req)  
  
  
  ... except urllib.error.HTTPError as e:  
  
  
  ... print(e.code)  
  
  
  ... print(e.read())   
  
  
  ...  
  
  
  404  
  
  
  b'
  
  
  
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n
  
  
  
   ...  
  
  
   Page Not Found\n  
  
  
   ...


總之
若要準(zhǔn)備處理 HTTPError 或 URLError ，有兩種簡單的方案。推薦使用第二種方案。
第一種方案
 
 
 
 
  
  
  
  from urllib.request import Request, urlopen  
  
  
  from urllib.error import URLError, HTTPError  
  
  
  req = Request(someurl)  
  
  
  try:  
  
  
   response = urlopen(req)  
  
  
  except HTTPError as e:  
  
  
   print('The server couldn\'t fulfill the request.')  
  
  
   print('Error code: ', e.code)  
  
  
  except URLError as e:  
  
  
   print('We failed to reach a server.')  
  
  
   print('Reason: ', e.reason)  
  
  
  else:  
  
  
   # everything is fine 
 
 
 
備注
except HTTPError 必須 首先處理，否則 except URLError 將會 同時(shí) 捕獲 HTTPError 。
第二種方案
 
 
 
 
  
  
  
  from urllib.request import Request, urlopen  
  
  
  from urllib.error import URLError  
  
  
  req = Request(someurl)  
  
  
  try:  
  
  
   response = urlopen(req)  
  
  
  except URLError as e:  
  
  
   if hasattr(e, 'reason'):  
  
  
   print('We failed to reach a server.')  
  
  
   print('Reason: ', e.reason)  
  
  
   elif hasattr(e, 'code'):  
  
  
   print('The server couldn\'t fulfill the request.')  
  
  
   print('Error code: ', e.code)  
  
  
  else:  
  
  
   # everything is fine 
 
 
 
info 和 geturl 方法
由 urlopen （或者 HTTPError 實(shí)例）所返回的響應(yīng)包含兩個有用的方法： info() 和 geturl()，該響應(yīng)由模塊 urllib.response 定義。
geturl ——返回所獲取頁面的真實(shí) URL。該方法很有用，因?yàn)?urlopen （或 opener 對象）可能已經(jīng)經(jīng)過了一次重定向。已獲取頁面的 URL 未必就是所請求的 URL 。
info - 該方法返回一個類似字典的對象，描述了所獲取的頁面，特別是由服務(wù)器送出的頭部信息（headers） 。目前它是一個 http.client.HTTPMessage 實(shí)例。
Typical headers include ‘Content-length’, ‘Content-type’, and so on. See the Quick Reference to HTTP Headers for a useful listing of HTTP headers with brief explanations of their meaning and use.
Opener 和 Handler
When you fetch a URL you use an opener (an instance of the perhaps confusingly named urllib.request.OpenerDirector). Normally we have been using the default opener - via urlopen - but you can create custom openers. Openers use handlers. All the “heavy lifting” is done by the handlers. Each handler knows how to open URLs for a particular URL scheme (http, ftp, etc.), or how to handle an aspect of URL opening, for example HTTP redirections or HTTP cookies.
若要用已安裝的某個 handler 獲取 URL，需要創(chuàng)建一個 opener 對象，例如處理 cookie 的 opener，或?qū)χ囟ㄏ虿蛔鎏幚淼?opener。
若要創(chuàng)建 opener，請實(shí)例化一個 OpenerDirector ，然后重復(fù)調(diào)用 .add_handler(some_handler_instance) 。
或者也可以用 build_opener ，這是個用單次調(diào)用創(chuàng)建 opener 對象的便捷函數(shù)。build_opener 默認(rèn)會添加幾個 handler，不過還提供了一種快速添加和/或覆蓋默認(rèn) handler 的方法。
可能還需要其他類型的 handler，以便處理代理、身份認(rèn)證和其他常見但稍微特殊的情況。
install_opener 可用于讓 opener 對象成為（全局）默認(rèn) opener。這意味著調(diào)用 urlopen 時(shí)會采用已安裝的 opener。
opener 對象帶有一個 `open 方法，可供直接調(diào)用以獲取 url，方式與 urlopen 函數(shù)相同。除非是為了調(diào)用方便，否則沒必要去調(diào)用 install_opener 。
基本認(rèn)證
為了說明 handler 的創(chuàng)建和安裝過程，會用到 HTTPBasicAuthHandler 。有關(guān)該主題的更詳細(xì)的介紹——包括基本身份認(rèn)證的工作原理——請參閱 Basic Authentication Tutorial 。
如果需要身份認(rèn)證，服務(wù)器會發(fā)送一條請求身份認(rèn)證的頭部信息（以及 401 錯誤代碼）。這條信息中指明了身份認(rèn)證方式和“安全區(qū)域（realm）”。格式如下所示：WWW-Authenticate: SCHEME realm="REALM" 。
例如
 
 
 
 
  
  
  
  WWW-Authenticate: Basic realm="cPanel Users" 
 
 
 
然后，客戶端應(yīng)重試發(fā)起請求，請求數(shù)據(jù)中的頭部信息應(yīng)包含安全區(qū)域?qū)?yīng)的用戶名和密碼。這就是“基本身份認(rèn)證”。為了簡化此過程，可以創(chuàng)建 HTTPBasicAuthHandler 的一個實(shí)例及使用它的 opener。
HTTPBasicAuthHandler 用一個名為密碼管理器的對象來管理 URL、安全區(qū)域與密碼、用戶名之間的映射關(guān)系。如果知道確切的安全區(qū)域（來自服務(wù)器發(fā)送的身份認(rèn)證頭部信息），那就可以用到 HTTPPasswordMgr 。通常人們并不關(guān)心安全區(qū)域是什么，這時(shí)用``HTTPPasswordMgrWithDefaultRealm`` 就很方便，允許為 URL 指定默認(rèn)的用戶名和密碼。當(dāng)沒有為某個安全區(qū)域提供用戶名和密碼時(shí)，就會用到默認(rèn)值。下面用 None 作為 add_password 方法的安全區(qū)域參數(shù)，表明采用默認(rèn)用戶名和密碼。
首先需要身份認(rèn)證的是頂級 URL。比傳給 .add_password() 的 URL 級別“更深”的 URL 也會得以匹配：
 
 
 
 
  
  
  
  # create a password manager  
  
  
  password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()  
  
  
    
  
  
  # Add the username and password.  
  
  
  # If we knew the realm, we could use it instead of None.  
  
  
  top_level_url = "http://example.com/foo/"  
  
  
  password_mgr.add_password(None, top_level_url, username, password)  
  
  
    
  
  
  handler = urllib.request.HTTPBasicAuthHandler(password_mgr)  
  
  
    
  
  
  # create "opener" (OpenerDirector instance)  
  
  
  opener = urllib.request.build_opener(handler)  
  
  
    
  
  
  # use the opener to fetch a URL  
  
  
  opener.open(a_url)  
  
  
    
  
  
  # Install the opener.  
  
  
  # Now all calls to urllib.request.urlopen use our opener.  
  
  
  urllib.request.install_opener(opener) 
 
 
 
備注
在以上例子中，只向 build_opener 給出了 HTTPBasicAuthHandler 。默認(rèn)情況下，opener 會有用于處理常見狀況的 handler ——ProxyHandler （如果設(shè)置代理的話，比如設(shè)置了環(huán)境變量 http_proxy ），UnknownHandler 、HTTPHandler 、 HTTPDefaultErrorHandler 、 HTTPRedirectHandler 、 FTPHandler 、 FileHandler 、 DataHandler 、 HTTPErrorProcessor 。
top_level_url 其實(shí) 要么 是一條完整的 URL（包括 “http:” 部分和主機(jī)名及可選的端口號），比如 "http://example.com/" ， 要么 是一條“訪問權(quán)限”（即主機(jī)名，及可選的端口號），比如 "example.com" 或 "example.com:8080" （后一個示例包含了端口號）。訪問權(quán)限 不得 包含“用戶信息”部分——比如 "joe:password@example.com" 就不正確。
代理
urllib 將自動檢測并使用代理設(shè)置。 這是通過 ProxyHandler 實(shí)現(xiàn)的，當(dāng)檢測到代理設(shè)置時(shí)，是正常 handler 鏈中的一部分。通常這是一件好事，但有時(shí)也可能會無效 5。 一種方案是配置自己的 ProxyHandler ，不要定義代理。 設(shè)置的步驟與 Basic Authentication handler 類似:
 
 
 
 
  
  
  
  >>> proxy_support = urllib.request.ProxyHandler({})  
  
  
  >>> opener = urllib.request.build_opener(proxy_support)  
  
  
  >>> urllib.request.install_opener(opener) 
 
 
 
備注
目前 urllib.request 尚不 支持通過代理抓取 https 鏈接地址。 但此功能可以通過擴(kuò)展 urllib.request 來啟用，如以下例程所示 6。
備注
如果設(shè)置了 REQUEST_METHOD 變量，則會忽略 HTTP_PROXY ；參閱 getproxies() 文檔。
套接字與分層
Python 獲取 Web 資源的能力是分層的。urllib 用到的是 http.client 庫，而后者又用到了套接字庫。
從 Python 2.3 開始，可以指定套接字等待響應(yīng)的超時(shí)時(shí)間。這對必須要讀到網(wǎng)頁數(shù)據(jù)的應(yīng)用程序會很有用。默認(rèn)情況下，套接字模塊 不會超時(shí) 并且可以掛起。目前，套接字超時(shí)機(jī)制未暴露給 http.client 或 urllib.request 層使用。不過可以為所有套接字應(yīng)用設(shè)置默認(rèn)的全局超時(shí)。
 
 
 
 
  
  
  
  import socket  
  
  
  import urllib.request  
  
  
    
  
  
  # timeout in seconds  
  
  
  timeout = 10  
  
  
  socket.setdefaulttimeout(timeout)  
  
  
    
  
  
  # this call to urllib.request.urlopen now uses the default timeout  
  
  
  # we have set in the socket module  
  
  
  req = urllib.request.Request('http://www.voidspace.org.uk')  
  
  
  response = urllib.request.urlopen(req) 
 
 
 

備注
這篇文檔由 John Lee 審訂。
1
例如 Google。
2
對于網(wǎng)站設(shè)計(jì)而言，探測不同的瀏覽器是非常糟糕的做法——更為明智的做法是采用 web 標(biāo)準(zhǔn)構(gòu)建網(wǎng)站。不幸的是，很多網(wǎng)站依然向不同的瀏覽器發(fā)送不同版本的網(wǎng)頁。
3
MSIE 6 的 user-agent 信息是 “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)”
4
有關(guān) HTTP 請求的頭部信息，詳情請參閱 Quick Reference to HTTP Headers。
5
本人必須使用代理才能在工作中訪問互聯(lián)網(wǎng)。如果嘗試通過代理獲取 localhost URL，將會遭到阻止。IE 設(shè)置為代理模式，urllib 就會獲取到配置信息。為了用 localhost 服務(wù)器測試腳本，我必須阻止 urllib 使用代理。
6
urllib 的 SSL 代理 opener（CONNECT 方法）： ASPN Cookbook Recipe 。
                                                

                                                網(wǎng)頁題目：創(chuàng)新互聯(lián)Python教程：如何利用urllib包獲取網(wǎng)絡(luò)資源                                                

                                                分享路徑：http://www.5511xx.com/article/djigsss.html


                                            
                                                
                                                    其他資訊
                                                
                                                
                                                    
                                                        
                                                                如何備份和還原數(shù)據(jù)庫？（如何進(jìn)行數(shù)據(jù)庫的備份和恢復(fù)？）
                                                            

                                                                數(shù)據(jù)庫管理系統(tǒng)有哪些常見的？(數(shù)據(jù)庫管理系統(tǒng)有那些)
                                                            

                                                                服務(wù)器有哪些配件組成的？服務(wù)器產(chǎn)品的構(gòu)成
                                                            

                                                                這個站做了認(rèn)證沒生效不知道為什么其他站之前操作方法都一樣都沒
                                                            

                                                                展示精彩：用SQL Server制作曲線圖（sqlserver曲線圖）