Monday, June 10, 2013

urllib2 -- The Missing Mannual

urllib2 The Missing Mannual

Fetching URLs

The simplest way to use urllib2 is as follows :
import urllib2
response = urllib2.urlopen('')
html =
Request object

import urllib2

req = urllib2.Request('')
response = urllib2.urlopen(req)
the_page =

In the case of HTTP, there are two extra things that Request objects allow you to do: First, you can pass data to be sent to the server. Second, you can pass extra information ("metadata") about the data or the about request itself, to the server - this information is sent as HTTP "headers". Let's look at each of these in turn.


Sometimes you want to send data to a URL (often the URL will refer to a CGI (Common Gateway Interface) script [1] or other web application). With HTTP, this is often done using what's known as a POST request.

you can use a POST to transmit arbitrary data to your own application. In the common case of HTML forms, the data needs to be encoded in a standard way, and then passed to the Request object as the data argument. The encoding is done using a function from the urllib library

import urllib
import urllib2

url = ''
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }

data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page =
urlencode, Request
If you do not pass the data argument, urllib2 uses a GET request. One way in which GET and POST requests differ is that POST requests often have "side-effects": they change the state of the system in some way (for example by placing an order with the website for a hundredweight of tinned spam to be delivered to your door). Though the HTTP standard makes it clear that POSTs are intended to always cause side-effects, and GET requests never to cause side-effects, nothing prevents a GET request from having side-effects, nor a POST requests from having no side-effects. Data can also be passed in an HTTP GET request by encoding it in the URL itself.
v style="background-color: white; font-family: Arial, Helvetica, sans-serif; font-size: 12px; line-height: 21.33333396911621px;"> This is done as follows.
>>> import urllib2
>>> import urllib
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.urlencode(data)
>>> print url_values
>>> url = ''
>>> full_url = url + '?' + url_values
>>> data = urllib2.urlopen(full_url)
Notice that the full URL is created by adding a ? to the URL, followed by the encoded values.

Difference between POST and GET

GET requests a representation of the specified resource. Note that GET should not be used for operations that cause side-effects, such as using it for taking actions in web applications. One reason for this is that GET may be used arbitrarily by robots or crawlers, which should not need to consider the side effects that a request should cause.
POST submits data to be processed (e.g., from an HTML form) to the identified resource. The data is included in the body of the request. This may result in the creation of a new resource or the updates of existing resources or both.So essentially GET is used to retrieve remote data, and POST is used to insert/update remote data.


We'll discuss here one particular HTTP header, to illustrate how to add headers to your HTTP request.
Some websites (like google for example) dislike being browsed by programs. By default urllib2 identifies itself as Python-urllib/x.y (where x and y are the major and minor version numbers of the Python release, e.g. Python-urllib/2.5), which may confuse the site, or just plain not work. The way a browser identifies itself is through the User-Agent header [3]. When you create a Request object you can pass a dictionary of headers in. The following example makes the same request as above, but identifies itself as a version of Internet Explorer [4].
import urllib
import urllib2

url = ''
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page =
headers. Request(url, data, headers). The response also has two useful methods. See the section on info and geturl which comes after we have a look at what happens when things go wrong.

Handling Exceptions

urlopen raises URLError when it cannot handle a response.
HTTPError is the subclass of URLError.


Often, URLError is raised because there is no network connection (no route to the specified server), or the specified server doesn't exist. In this case, the exception raised will have a 'reason' attribute, which is a tuple containing an error code and a text error message.
>>> req = urllib2.Request('')
>>> try: urllib2.urlopen(req)
>>> except URLError, e:
>>>    print e.reason
(4, 'getaddrinfo failed')


Every HTTP response from the server contains a numeric "status code". Sometimes the status code indicates that the server is unable to fulfil the request. The default handlers will handle some of these responses for you (for example, if the response is a "redirection" that requests the client fetch the document from a different URL, urllib2 will handle that for you). For those it can't handle, urlopen will raise an HTTPError. Typical errors include '404' (page not found), '403' (request forbidden), and '401' (authentication required).

The HTTPError instance raised will have an integer 'code' attribute, which corresponds to the error sent by the server.

When an error is raised the server responds by returning an HTTP error code and an error page. You can use the HTTPError instance as a response on the page returned. This means that as well as the code attribute, it also has read, geturl, and info, methods.
>>> req = urllib2.Request('')
>>> try:
>>>     urllib2.urlopen(req)
>>> except URLError, e:
>>>     print e.code
>>>     print
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
<?xml-stylesheet href="./css/ht2html.css"
<html><head><title>Error 404: File Not Found</title>
...... etc...
Exception Handling

from urllib2 import Request, urlopen, URLError
req = Request(someurl)
    response = urlopen(req)
except URLError, e:
    if hasattr(e, 'reason'):
        print 'We failed to reach a server.'
        print 'Reason: ', e.reason
    elif hasattr(e, 'code'):
        print 'The server couldn\'t fulfill the request.'
        print 'Error code: ', e.code
    # everything is fine

info and geturl

The response returned by urlopen (or the HTTPError instance) has two useful methods info and geturl.
geturl - this returns the real URL of the page fetched. This is useful because urlopen (or the opener object used) may have followed a redirect. The URL of the page fetched may not be the same as the URL requested.
info - this returns a dictionary-like object that describes the page fetched, particularly the headers sent by the server. It is currently an httplib.HTTPMessage instance.
Typical headers include 'Content-length', 'Content-type', and so on. See the Quick Reference to HTTP Headers for a useful listing of HTTP headers with brief explanations of their meaning and use.

Openers and Handlers

When you fetch a URL you use an opener (an instance of the perhaps confusingly-namedurllib2.OpenerDirector). Normally we have been using the default opener - via urlopen - but you can create custom openers. Openers use handlers. All the "heavy lifting" is done by the handlers. Each handler knows how to open URLs for a particular URL scheme (http, ftp, etc.), or how to handle an aspect of URL opening, for example HTTP redirections or HTTP cookies.
You will want to create openers if you want to fetch URLs with specific handlers installed, for example to get an opener that handles cookies, or to get an opener that does not handle redirections.
To create an opener, instantiate an OpenerDirector, and then call .add_handler(some_handler_instance) repeatedly.
Alternatively, you can use build_opener, which is a convenience function for creating opener objects with a single function call. build_opener adds several handlers by default, but provides a quick way to add more and/or override the default handlers.
Other sorts of handlers you might want to can handle proxies, authentication, and other common but slightly specialised situations.
install_opener can be used to make an opener object the (global) default opener. This means that calls tourlopen will use the opener you have installed.
Opener objects have an open method, which can be called directly to fetch urls in the same way as the urlopenfunction: there's no need to call install_opener, except as a convenience.

Basic Authentication

To illustrate creating and installing a handler we will use the HTTPBasicAuthHandler. For a more detailed discussion of this subject - including an explanation of how Basic Authentication works - see the Basic Authentication Tutorial.
When authentication is required, the server sends a header (as well as the 401 error code) requesting authentication. This specifies the authentication scheme and a 'realm'. The header looks like : Www-authenticate: SCHEMErealm="REALM".
Www-authenticate: Basic realm="cPanel Users"
The client should then retry the request with the appropriate name and password for the realm included as a header in the request. This is 'basic authentication'. In order to simplify this process we can create an instance ofHTTPBasicAuthHandler and an opener to use this handler.
The HTTPBasicAuthHandler uses an object called a password manager to handle the mapping of URLs and realms to passwords and usernames. If you know what the realm is (from the authentication header sent by the server), then you can use a HTTPPasswordMgr. Frequently one doesn't care what the realm is. In that case, it is convenient to useHTTPPasswordMgrWithDefaultRealm. This allows you to specify a default username and password for a URL. This will be supplied in the absence of you providing an alternative combination for a specific realm. We indicate this by providing None as the realm argument to the add_password method.
The top-level URL is the first URL that requires authentication. URLs "deeper" than the URL you pass to .add_password() will also match.
# create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()

# Add the username and password.
# If we knew the realm, we could use it instead of ``None``.
top_level_url = ""
password_mgr.add_password(None, top_level_url, username, password)

handler = urllib2.HTTPBasicAuthHandler(password_mgr)

# create "opener" (OpenerDirector instance)
opener = urllib2.build_opener(handler)

# use the opener to fetch a URL

# Install the opener.
# Now all calls to urllib2.urlopen use our opener.
In the above example we only supplied our HHTPBasicAuthHandler to build_opener. By default openers have the handlers for normal situations - ProxyHandlerUnknownHandler,HTTPHandlerHTTPDefaultErrorHandlerHTTPRedirectHandlerFTPHandler,FileHandlerHTTPErrorProcessor.
top_level_url is in fact either a full URL (including the 'http:' scheme component and the hostname and optionally the port number) e.g. " an "authority" (i.e. the hostname, optionally including the port number) e.g. "" or "" (the latter example includes a port number). The authority, if present, must NOT contain the "userinfo" component - for example "" is not correct.

No comments:

Post a Comment