Downloading files from the internet is a common task in programming. As a Python developer, you‘ll often need to retrieve files from URLs to work with them in your projects.
In this comprehensive guide, we‘ll explore five different methods to download files from URLs using Python.
Overview of the 5 Ways to Download Files from URLs in Python
Here‘s a quick overview of the various approaches we‘ll cover:
- urllib.request – Python‘s built-in HTTP request library
- requests – Popular third-party HTTP library for Python
- urllib3 – More robust alternative to urllib with connection pooling
- wget – Command line downloader tool also available as a Python package
- PyCURL – Python bindings for the cURL library
We‘ll look at code examples of using each of these libraries/tools to download a sample image file from a URL.
Let‘s get started!
urllib.request – Python‘s Built-in HTTP Library
Python has a built-in urllib.request
module that makes HTTP requests simple. It is included in the Python standard library, so you don‘t need to install anything extra.
The urllib.request
module provides functions like urlopen()
to fetch URLs and urlretrieve()
to download files from the internet.
Here‘s an example to download an image from a URL using urllib.request
:
import urllib.request
url = ‘https://imgs.xkcd.com/comics/python.png‘
urllib.request.urlretrieve(url, ‘xkcd_comic.png‘)
We import urllib.request
, define the URL of the image, and call urlretrieve()
to download it to a local file named xkcd_comic.png
.
The urlretrieve()
function takes in the URL as first argument, and the filename to save as the second argument. It fetches the resource from the internet and saves it to the local filesystem.
Some key advantages of using urllib.request
:
- Simple and straight-forward to use
- Included in standard library, no dependencies
- Handles HTTP, FTP and other URL schemes
- Supports features like proxies, cookies, authentication
On the downside, it lacks some more advanced capabilities like connection pooling and retries.
Overall, urllib.request
is a handy way to download files from URLs when you don‘t want to add third-party dependencies.
requests – Powerful Third-Party HTTP Library
The requests library is one of the most popular third-party Python packages for working with HTTP. With over 18 million downloads per week on PyPI, requests has proven to be a favorite among Pythonistas.
To install requests:
pip install requests
Once installed, you can import requests and use it to download files:
import requests
url = ‘https://imgs.xkcd.com/comics/python.png‘
response = requests.get(url)
with open(‘comic.png‘, ‘wb‘) as f:
f.write(response.content)
We use requests.get()
to send a GET request to the URL and fetch the image content. This content is available in the response.content
attribute.
We open a file for writing bytes (‘wb‘
) and write the image data to it. This saves the downloaded image as comic.png
.
Requests makes it easy to make HTTP calls with very little code. Some of its benefits include:
- Simple and elegant API
- Supports authentication, sessions, proxies
- Automatic connection pooling and retries
- Works with JSON and other data formats
- Wide range of helper methods (status codes, headers etc)
Due to its simplicity and feature set, requests is likely the most commonly used package for downloading files in Python.
urllib3 – Advanced Connection Pooling and Retries
The urllib3 library focuses on improving URL handling performance in Python. It provides advanced connection management options lacking in urllib.request.
Let‘s see an example using urllib3:
import urllib3
http = urllib3.PoolManager()
url = ‘https://imgs.xkcd.com/comics/python.png‘
response = http.request(‘GET‘, url)
with open(‘xkcd.png‘, ‘wb‘) as f:
f.write(response.data)
We create a PoolManager
to manage connections. This handles pooling and reusing connections efficiently behind the scenes.
We make a GET request to the URL using http.request()
and write the response data to a file.
Some key features of urllib3:
- Smart connection pooling and reuse
- Retries on failed requests and timeouts
- Thread safety and multiprocessing support
- Supports gzip, deflate encoding
- Helper methods for headers, cookies etc
It provides low-level control similar to requests, but with better performance and robustness. urllib3 is a great choice for building custom HTTP clients with retries, pools, etc.
wget – Command line Download Tool
wget is a popular command line utility for downloading files from the internet. It is designed for robustness over slow or unstable connections.
wget is also available as a Python module. To install:
pip install wget
Downloading a file is as easy as:
import wget
url = ‘https://imgs.xkcd.com/comics/python.png‘
wget.download(url, ‘python.png‘)
We simply call wget.download()
with the URL and filename, and it downloads the file for us.
Some notable features of wget include:
- Resumes broken downloads
- Retries failed downloads
- Follows redirects
- Can download entire websites
- Supports proxies
- Offers many command line options
Since it is designed for the command line, the Python API lacks some flexibility. But it is very robust and great for simple file downloads.
PyCURL – Python interface for cURL
PyCURL provides Python bindings for the libcurl library. It allows making HTTP requests and working with a range of web protocols using the power of cURL.
To install PyCURL:
pip install pycurl
Here is an example of using PyCURL to download a file:
import pycurl
from io import BytesIO
buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.URL, ‘https://imgs.xkcd.com/comics/python.png‘)
c.setopt(c.WRITEDATA, buffer)
c.perform()
c.close()
with open(‘comic.png‘, ‘wb‘) as f:
f.write(buffer.getvalue())
We create a BytesIO
object to store the downloaded data. We initialize a pycurl Curl
object, set the URL option, and write the response data to the buffer.
The call to c.perform()
executes the download request. Finally we write the image data to a file.
PyCURL allows very fine-grained control over the HTTP requests like cookies, authentication, proxies etc. Some key features:
- Access to the full libcurl API
- Tuning of connection options
- Callbacks and data streaming
- Multi-part file uploads
- Works with Python async libraries
If you need low-level control of HTTP requests, PyCURL is quite powerful albeit complex.
Which Method Should You Use?
We‘ve explored several different approaches to download files using Python – some simple and others more robust.
Here are some guidelines on which option to pick:
-
urllib.request – Good for simple basic downloads using std library.
-
requests – Best for most common cases. Simple, yet powerful API.
-
urllib3 – Choose when you need connection pooling, retries, etc.
-
wget – Great for mirroring websites or robust command line downloads.
-
PyCURL – Only if you need low-level access to tune HTTP requests.
For most purposes, the requests module is recommended as it provides the right combination of simplicity, features, and performance.
But the others are useful in certain situations. urllib.request is compact, wget is hardy, urllib3 is fast, and PyCURL is customizable.
Downloading Large Files in Python
So far we looked at examples of downloading small image files. When dealing with larger downloads, it is important to stream the data in chunks instead of loading the entire file in memory.
Let‘s see how to download large files in chunks using the requests module:
import requests
url = ‘http://examples.com/large-file.zip‘
r = requests.get(url, stream=True)
with open("large-file.zip", "wb") as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
We make the GET request, passing stream=True
to receive the content in chunks using iter_content()
. This returns data in chunks of 1024 bytes by default.
We can specify a chunk_size
in bytes when calling iter_content()
. The chunks are written to the file in a loop.
Doing this avoids loading the entire download in memory. This works even for multi-GB files.
We can also track download progress by checking the number of bytes downloaded against Content-Length
header.
Handling HTTP Errors and Exceptions
When working with HTTP requests, there are many potential errors that can occur:
- Server errors like 500 or 503
- Timeouts due to slow response
- Network failures and disconnections
- Authorization errors like 401 or 403
- Resource not found 404 errors
It is a good practice to handle these scenarios in your code.
Here is an example with basic error handling using the requests module:
import requests
try:
response = requests.get(url)
response.raise_for_status() # Raises error for 400/500 statuses
except requests.exceptions.HTTPError as e:
print(f‘HTTP Error: {e}‘)
except requests.exceptions.ConnectionError as e:
print(f‘Connection Error: {e}‘)
except requests.exceptions.Timeout as e:
print(f‘Timeout Error: {e}‘)
except requests.exceptions.RequestException as e:
print(f‘Error: {e}‘)
We try making the GET request inside a try
block. The .raise_for_status()
call raises an exception if the status code indicates an error.
Specific exception classes like HTTPError
, ConnectionError
, Timeout
are handled separately. A generic RequestException
catches any other requests-related errors.
This ensures that HTTP-related issues while downloading files are handled properly in your Python program.
Progress Bars for File Downloads
To provide a better user experience, we can display a progress bar during large file downloads. Python has some useful modules for progress bars:
- tqdm – Quick simple progress bars. Just wrap any iterator.
- click – For command line progress bars.
- progressbar2 – More featured progress bars.
Here is a simple example using tqdm to show progress while downloading with requests:
import requests
from tqdm import tqdm
url = ‘http://examples.com/large-file.zip‘
r = requests.get(url, stream=True)
with open("large-file.zip", "wb") as f:
total_length = int(r.headers.get(‘content-length‘))
for chunk in tqdm(r.iter_content(chunk_size=1024), total=total_length/1024, unit=‘KB‘):
if chunk:
f.write(chunk)
We pass the iter_content()
iterable to tqdm()
to display a progress bar during the download loop.
The total
parameter tracks the total download size and unit
displays in KB. This gives a nice progress indicator for large downloads.
Progress bars provide visual feedback to the user and work across different Python interfaces like console, Jupyter notebooks, and GUIs.
Downloading Files Asynchronously with Threads/Processes
For downloading multiple files parallelly, we can use threads or processes in Python.
The concurrent.futures
module and other asynchronous libraries like asyncio
make it easy to run downloads concurrently.
For example:
import concurrent.futures
import requests
url_list = [‘https://url1.com/file1.zip‘, ‘https://url2.com/file2.zip‘]
# Download function
def download_file(url):
r = requests.get(url)
filename = url.split(‘/‘)[-1] # Extract filename
with open(filename, ‘wb‘) as f:
f.write(r.content)
print(f"{filename} download complete!")
# Thread pool executor
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(download_file, url) for url in url_list]
for future in concurrent.futures.as_completed(futures):
future.result()
We start multiple threads using a ThreadPoolExecutor
. The download_file()
function runs in each thread to download a particular URL.
The as_completed()
method yields the futures as they complete. This allows parallel downloads using threads.
Similarly, processes can be used via a ProcessPoolExecutor
. Asyncio also provides async alternatives.
Concurrency helps speed up bulk downloads by processing in parallel.
Resuming Broken Downloads
Sometimes downloads may get interrupted halfway due to network failures or connection issues.
To handle such scenarios, we need to resume partial downloads by starting from where it left off.
The requests library has built-in support for this using stream=True
and headers
arguments:
import requests
url = ‘https://examples.com/file.zip‘
r = requests.get(url, stream=True, headers={‘Range‘: ‘bytes=500-999‘})
with open(‘file.zip‘, ‘ab‘) as f:
for chunk in r.iter_content(1024):
if chunk:
f.write(chunk)
We make a GET request using a byte range header like bytes=500-999
to indicate starting byte offset.
Opening the file in append mode (‘ab‘
) continues writing from the end.
The server handles the range request properly, returning just the requested partial content.
For this to work correctly, the server should support range requests. The Content-Range
and Accept-Ranges
headers can be checked to verify.
Resuming downloads comes in handy when working with large files over unstable connections.
Caching Downloads to Avoid Repeated Fetches
When your program needs to download files regularly, it can save bandwidth by caching contents locally. This avoids repeated remote downloads of unchanged resources.
Python has a built-in cachecontrol
module that implements caching for requests.
To cache downloads, first install it:
pip install cachecontrol
Then we can wrap a requests session with caching:
import requests
import cachecontrol
session = cachecontrol.CacheControl(requests.Session())
response = session.get(url)
The cache works transparently, storing responses on the filesystem and validating resources using cache headers like ETags.
Cached responses are returned for repeated requests without going to the network. This speeds up your program and reduces bandwidth usage.
The cache module supports various cache types like file, Redis, MongoDB, etc.
Caching is useful for downloading static assets that change infrequently. It complements bandwidth-saving measures like compression and minification.
Download Accelerators and Download Managers
Apart from the Python libraries we‘ve discussed so far, there are also external download accelerator programs that can help speed up downloads.
Some popular open source download managers:
- axel – Lightweight CLI download accelerator
- uGet – Full featured Gtk+ downloader for Linux
- FreeRapid – Java-based multi-threaded downloader
These tools use techniques like connection reuse, parallel connections, and file segmentation to achieve faster speeds.
You can call these tools from within Python code using the subprocess module:
import subprocess
url = ‘http://examples.com/large.iso‘
subprocess.run([‘axel‘, ‘-n‘, ‘10‘, url]) # 10 connections
This launches axel
to download the file using 10 parallel connections for added speed.
Similarly, other download managers can be invoked as needed.
External download accelerators integrate nicely with Python scripts to optimize large file downloads.
Conclusion
In this comprehensive guide, we explored various methods and tools for downloading files using Python:
-
The standard urllib module provides basic downloading capabilities.
-
Requests is the most popular third party library with an elegant API.
-
urllib3 offers more advanced connection pooling and retry capabilities.
-
Command line tool wget can be used for simple robust downloads.
-
PyCURL provides low-level control for customizing requests.
We also looked at techniques like streaming, progress bars, error handling, resuming and caching to improve the download experience.
The requests
module is recommended for most download tasks as it strikes a nice balance of simplicity, features and performance.
To take your skills further, you can explore asynchronous downloads using threads/processes and integrating external download accelerators.
Happy downloading!