I would like to share with everyone different ways to use python to download files on a website.
Usually files are returned by clicking on links but sometimes there may be embedded files as well, for instance an image or PDF embedded into a web page.
We will be using an extra BeautifulSoup library here for parsing the webpages and making it easier for us to navigate but the whole job is done by the urllib2 library which is included by default in python.
First we will have a look at urrllib2 library in python. It allows opening webpages and files from web using urls.
To open an arbitrary url, you can use
import urrllib2 resp = urllib2.urlopen( 'http://www.testurl.com' )
The response is the object returned by the website.
Right now, we will be using BeautifulSoup library for viewing the webpage with ease. It is a very simple to use library that simplifies the task of navigating through HTML in webpages. You can get the library from here: http://www.crummy.com/software/BeautifulSoup/#Download
The library sometimes becomes tricky to install and use, so you can directly get the Tarball from: http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/ and then unzip bs4 it in your project folder and use it.
You need to import the library into python as
from bs4 import BeautifulSoup
First, we will go through the basics of BeautifulSoup and its use in easily navigating through a webpage’s code.
A soup can be created by the object returned by urllib2.
soup = BeautifulSoup( resp.read() )
Now is the time for some magic, you can easily process the soup using tags. For instance, to find all hyperlinks, you can use
links = soup.find_all( 'a' ) #p is an array of all hyperlink tags
Now to get the url from an object from the array is as easy as:
foreach links as link: # Processing each link and getting the url value url = link.get( 'href' )
Now let us see how to download files
File is embedded in the page HTML, taking example of a JPEG embedded in the site.
We can first find the image in the page easily using Beautiful Soup by
images = soup.find_all( 'img' )
You can get the url path for the image using the value of ‘src’
foreach images as image: #Processing each link and getting the url value filename = image.get( 'src' )
To get the file, you need to do something like
data = urllib2.urlopen( filname ).read()
The final step is saving the file
with open( "myfile.jpeg", "wb" ) as code : code.write( data )
There might be another case, when the file is returned on clicking a link in a browser. In our case, it wouldn’t be a click but a request using
res = urllib.urlopen( url )
Now, we need to identify that the response is a file. How do we do that?
The response header is somewhat different for files than webpages, it looks like
Content-Disposition: attachment; filename="filename.extension"
We can access the response header using
header = res.info()
and check whether the response has a Content Disposition header in it.
It is as simple as doing
if 'Content-Disposition' in str( header ): # It is a file
Now to download and save it, we can proceed the same way as last one
with open( "myfile", "wb" ) as code : code.write( res )
You can get the file name as well using the Content disposition header
A simple python script does that
filename = res.info()['Content-Disposition'] . split( '=' )[-1] . strip( '"' )
Basically the script uses the response array to get the Content-disposition object, and then we split it at the ‘=’ sign. So we have an array of strings and we are interested in only the last object ie “filename.extension”. We drop the ” by using
.split( '"' ) and done, we get the filename of the attachment.
One important thing to note is that the filename may be in the form of File%20name.txt(for File name.txt) as HTML encodes urls using an ASCII character. See http://www.w3schools.com/tags/ref_urlencode.asp for more details.
It can easily be fixed by
filename = urllib.unquote( filename )
That’s all and we can now download and save files from all websites using python 🙂