5 Ways to Download Files from a URL Using Python

Downloading files from the internet is a common task in programming. As a Python developer, you‘ll often need to retrieve files from URLs to work with them in your projects.

In this comprehensive guide, we‘ll explore five different methods to download files from URLs using Python.

Overview of the 5 Ways to Download Files from URLs in Python

Here‘s a quick overview of the various approaches we‘ll cover:

urllib.request – Python‘s built-in HTTP request library
requests – Popular third-party HTTP library for Python
urllib3 – More robust alternative to urllib with connection pooling
wget – Command line downloader tool also available as a Python package
PyCURL – Python bindings for the cURL library

We‘ll look at code examples of using each of these libraries/tools to download a sample image file from a URL.

Let‘s get started!

urllib.request – Python‘s Built-in HTTP Library

Python has a built-in urllib.request module that makes HTTP requests simple. It is included in the Python standard library, so you don‘t need to install anything extra.

The urllib.request module provides functions like urlopen() to fetch URLs and urlretrieve() to download files from the internet.

Here‘s an example to download an image from a URL using urllib.request:

import urllib.request

url = ‘https://imgs.xkcd.com/comics/python.png‘

urllib.request.urlretrieve(url, ‘xkcd_comic.png‘)

We import urllib.request, define the URL of the image, and call urlretrieve() to download it to a local file named xkcd_comic.png.

The urlretrieve() function takes in the URL as first argument, and the filename to save as the second argument. It fetches the resource from the internet and saves it to the local filesystem.

Some key advantages of using urllib.request:

Simple and straight-forward to use
Included in standard library, no dependencies
Handles HTTP, FTP and other URL schemes
Supports features like proxies, cookies, authentication

On the downside, it lacks some more advanced capabilities like connection pooling and retries.

Overall, urllib.request is a handy way to download files from URLs when you don‘t want to add third-party dependencies.

requests – Powerful Third-Party HTTP Library

The requests library is one of the most popular third-party Python packages for working with HTTP. With over 18 million downloads per week on PyPI, requests has proven to be a favorite among Pythonistas.

To install requests:

pip install requests

Once installed, you can import requests and use it to download files:

import requests

url = ‘https://imgs.xkcd.com/comics/python.png‘

response = requests.get(url)

with open(‘comic.png‘, ‘wb‘) as f:
    f.write(response.content)

We use requests.get() to send a GET request to the URL and fetch the image content. This content is available in the response.content attribute.

We open a file for writing bytes (‘wb‘) and write the image data to it. This saves the downloaded image as comic.png.

Requests makes it easy to make HTTP calls with very little code. Some of its benefits include:

Simple and elegant API
Supports authentication, sessions, proxies
Automatic connection pooling and retries
Works with JSON and other data formats
Wide range of helper methods (status codes, headers etc)

Due to its simplicity and feature set, requests is likely the most commonly used package for downloading files in Python.

urllib3 – Advanced Connection Pooling and Retries

The urllib3 library focuses on improving URL handling performance in Python. It provides advanced connection management options lacking in urllib.request.

Let‘s see an example using urllib3:

import urllib3

http = urllib3.PoolManager()

url = ‘https://imgs.xkcd.com/comics/python.png‘

response = http.request(‘GET‘, url)

with open(‘xkcd.png‘, ‘wb‘) as f:
   f.write(response.data)

We create a PoolManager to manage connections. This handles pooling and reusing connections efficiently behind the scenes.

We make a GET request to the URL using http.request() and write the response data to a file.

Some key features of urllib3:

Smart connection pooling and reuse
Retries on failed requests and timeouts
Thread safety and multiprocessing support
Supports gzip, deflate encoding
Helper methods for headers, cookies etc

It provides low-level control similar to requests, but with better performance and robustness. urllib3 is a great choice for building custom HTTP clients with retries, pools, etc.

wget – Command line Download Tool

wget is a popular command line utility for downloading files from the internet. It is designed for robustness over slow or unstable connections.

wget is also available as a Python module. To install:

pip install wget

Downloading a file is as easy as:

import wget

url = ‘https://imgs.xkcd.com/comics/python.png‘

wget.download(url, ‘python.png‘)

We simply call wget.download() with the URL and filename, and it downloads the file for us.

Some notable features of wget include:

Resumes broken downloads
Retries failed downloads
Follows redirects
Can download entire websites
Supports proxies
Offers many command line options

Since it is designed for the command line, the Python API lacks some flexibility. But it is very robust and great for simple file downloads.

PyCURL – Python interface for cURL

PyCURL provides Python bindings for the libcurl library. It allows making HTTP requests and working with a range of web protocols using the power of cURL.

To install PyCURL:

pip install pycurl

Here is an example of using PyCURL to download a file:

import pycurl
from io import BytesIO

buffer = BytesIO()

c = pycurl.Curl()
c.setopt(c.URL, ‘https://imgs.xkcd.com/comics/python.png‘)
c.setopt(c.WRITEDATA, buffer)

c.perform()
c.close()

with open(‘comic.png‘, ‘wb‘) as f:
  f.write(buffer.getvalue())

We create a BytesIO object to store the downloaded data. We initialize a pycurl Curl object, set the URL option, and write the response data to the buffer.

The call to c.perform() executes the download request. Finally we write the image data to a file.

PyCURL allows very fine-grained control over the HTTP requests like cookies, authentication, proxies etc. Some key features:

Access to the full libcurl API
Tuning of connection options
Callbacks and data streaming
Multi-part file uploads
Works with Python async libraries

If you need low-level control of HTTP requests, PyCURL is quite powerful albeit complex.

Which Method Should You Use?

We‘ve explored several different approaches to download files using Python – some simple and others more robust.

Here are some guidelines on which option to pick:

urllib.request – Good for simple basic downloads using std library.
requests – Best for most common cases. Simple, yet powerful API.
urllib3 – Choose when you need connection pooling, retries, etc.
wget – Great for mirroring websites or robust command line downloads.
PyCURL – Only if you need low-level access to tune HTTP requests.

For most purposes, the requests module is recommended as it provides the right combination of simplicity, features, and performance.

But the others are useful in certain situations. urllib.request is compact, wget is hardy, urllib3 is fast, and PyCURL is customizable.

Downloading Large Files in Python

So far we looked at examples of downloading small image files. When dealing with larger downloads, it is important to stream the data in chunks instead of loading the entire file in memory.

Let‘s see how to download large files in chunks using the requests module:

import requests 

url = ‘http://examples.com/large-file.zip‘

r = requests.get(url, stream=True)

with open("large-file.zip", "wb") as f:
    for chunk in r.iter_content(chunk_size=1024): 
        if chunk:
            f.write(chunk)

We make the GET request, passing stream=True to receive the content in chunks using iter_content(). This returns data in chunks of 1024 bytes by default.

We can specify a chunk_size in bytes when calling iter_content(). The chunks are written to the file in a loop.

Doing this avoids loading the entire download in memory. This works even for multi-GB files.

We can also track download progress by checking the number of bytes downloaded against Content-Length header.

Handling HTTP Errors and Exceptions

When working with HTTP requests, there are many potential errors that can occur:

Server errors like 500 or 503
Timeouts due to slow response
Network failures and disconnections
Authorization errors like 401 or 403
Resource not found 404 errors

It is a good practice to handle these scenarios in your code.

Here is an example with basic error handling using the requests module:

import requests

try:
    response = requests.get(url)   
    response.raise_for_status() # Raises error for 400/500 statuses
except requests.exceptions.HTTPError as e:
    print(f‘HTTP Error: {e}‘)  
except requests.exceptions.ConnectionError as e:
    print(f‘Connection Error: {e}‘)
except requests.exceptions.Timeout as e:
    print(f‘Timeout Error: {e}‘)
except requests.exceptions.RequestException as e:
    print(f‘Error: {e}‘)

We try making the GET request inside a try block. The .raise_for_status() call raises an exception if the status code indicates an error.

Specific exception classes like HTTPError, ConnectionError, Timeout are handled separately. A generic RequestException catches any other requests-related errors.

This ensures that HTTP-related issues while downloading files are handled properly in your Python program.

Progress Bars for File Downloads

To provide a better user experience, we can display a progress bar during large file downloads. Python has some useful modules for progress bars:

tqdm – Quick simple progress bars. Just wrap any iterator.
click – For command line progress bars.
progressbar2 – More featured progress bars.

Here is a simple example using tqdm to show progress while downloading with requests:

import requests
from tqdm import tqdm

url = ‘http://examples.com/large-file.zip‘

r = requests.get(url, stream=True)

with open("large-file.zip", "wb") as f:
    total_length = int(r.headers.get(‘content-length‘))
    for chunk in tqdm(r.iter_content(chunk_size=1024), total=total_length/1024, unit=‘KB‘):
        if chunk:
            f.write(chunk)

We pass the iter_content() iterable to tqdm() to display a progress bar during the download loop.

The total parameter tracks the total download size and unit displays in KB. This gives a nice progress indicator for large downloads.

Progress bars provide visual feedback to the user and work across different Python interfaces like console, Jupyter notebooks, and GUIs.

Downloading Files Asynchronously with Threads/Processes

For downloading multiple files parallelly, we can use threads or processes in Python.

The concurrent.futures module and other asynchronous libraries like asyncio make it easy to run downloads concurrently.

For example:

import concurrent.futures
import requests

url_list = [‘https://url1.com/file1.zip‘, ‘https://url2.com/file2.zip‘] 

# Download function
def download_file(url):
    r = requests.get(url)
    filename = url.split(‘/‘)[-1] # Extract filename
    with open(filename, ‘wb‘) as f:
        f.write(r.content)
    print(f"{filename} download complete!")

# Thread pool executor 
with concurrent.futures.ThreadPoolExecutor() as executor:
    futures = [executor.submit(download_file, url) for url in url_list]

    for future in concurrent.futures.as_completed(futures):
        future.result()

We start multiple threads using a ThreadPoolExecutor. The download_file() function runs in each thread to download a particular URL.

The as_completed() method yields the futures as they complete. This allows parallel downloads using threads.

Similarly, processes can be used via a ProcessPoolExecutor. Asyncio also provides async alternatives.

Concurrency helps speed up bulk downloads by processing in parallel.

Resuming Broken Downloads

Sometimes downloads may get interrupted halfway due to network failures or connection issues.

To handle such scenarios, we need to resume partial downloads by starting from where it left off.

The requests library has built-in support for this using stream=True and headers arguments:

import requests 

url = ‘https://examples.com/file.zip‘

r = requests.get(url, stream=True, headers={‘Range‘: ‘bytes=500-999‘}) 

with open(‘file.zip‘, ‘ab‘) as f:
    for chunk in r.iter_content(1024):
        if chunk: 
            f.write(chunk)

We make a GET request using a byte range header like bytes=500-999 to indicate starting byte offset.

Opening the file in append mode (‘ab‘) continues writing from the end.

The server handles the range request properly, returning just the requested partial content.

For this to work correctly, the server should support range requests. The Content-Range and Accept-Ranges headers can be checked to verify.

Resuming downloads comes in handy when working with large files over unstable connections.

Caching Downloads to Avoid Repeated Fetches

When your program needs to download files regularly, it can save bandwidth by caching contents locally. This avoids repeated remote downloads of unchanged resources.

Python has a built-in cachecontrol module that implements caching for requests.

To cache downloads, first install it:

pip install cachecontrol

Then we can wrap a requests session with caching:

import requests
import cachecontrol

session = cachecontrol.CacheControl(requests.Session())

response = session.get(url)

The cache works transparently, storing responses on the filesystem and validating resources using cache headers like ETags.

Cached responses are returned for repeated requests without going to the network. This speeds up your program and reduces bandwidth usage.

The cache module supports various cache types like file, Redis, MongoDB, etc.

Caching is useful for downloading static assets that change infrequently. It complements bandwidth-saving measures like compression and minification.

Download Accelerators and Download Managers

Apart from the Python libraries we‘ve discussed so far, there are also external download accelerator programs that can help speed up downloads.

Some popular open source download managers:

axel – Lightweight CLI download accelerator
uGet – Full featured Gtk+ downloader for Linux
FreeRapid – Java-based multi-threaded downloader

These tools use techniques like connection reuse, parallel connections, and file segmentation to achieve faster speeds.

You can call these tools from within Python code using the subprocess module:

import subprocess

url = ‘http://examples.com/large.iso‘ 

subprocess.run([‘axel‘, ‘-n‘, ‘10‘, url]) # 10 connections

This launches axel to download the file using 10 parallel connections for added speed.

Similarly, other download managers can be invoked as needed.

External download accelerators integrate nicely with Python scripts to optimize large file downloads.

Conclusion

In this comprehensive guide, we explored various methods and tools for downloading files using Python:

The standard urllib module provides basic downloading capabilities.
Requests is the most popular third party library with an elegant API.
urllib3 offers more advanced connection pooling and retry capabilities.
Command line tool wget can be used for simple robust downloads.
PyCURL provides low-level control for customizing requests.

We also looked at techniques like streaming, progress bars, error handling, resuming and caching to improve the download experience.

The requests module is recommended for most download tasks as it strikes a nice balance of simplicity, features and performance.

To take your skills further, you can explore asynchronous downloads using threads/processes and integrating external download accelerators.

Happy downloading!