Downloading ASDC Data with Python 3

njester · by **njester** » Mon May 24, 2021 11:49 am America/New_York

The following is an example of how data can be downloaded from https://asdc.larc.nasa.gov/data/ using Python 3.

The following code requires

The requests package which can be installed from the command line or terminal using the following command:
Code: Select all
```
pip install requests
```
Python 3.6 or newer. To check Python's version, simply launch the Python application and Python's version will be printed on the first line.
An Earthdata logon token. If you don't have a Earthdata login token, you can get one herehere.

To download a single file:

Copy the example script. Write the example script text to a file ending in .py. Advanced users may want to edit the code to add additional features and advanced filtering.
Select a top level URL. Search https://asdc.larc.nasa.gov/data/ for a directory or file that contains the data you want to download.
Find your Earthdata login token. You can generate a token or copy an existing token by going to https://urs.earthdata.nasa.gov/ and selecting the "Generate Token" option form the top menu. Please note the token's expiration date, after that date a new token must be generated.
Add your URL and token to the script.
Run the script.

Code: Select all

import requests
from pathlib import Path

url=<the URL of the file you want to download>
token=<your token>
header={"Authorization": f"Bearer {token}"}

response = requests.get(url, headers=header)
content = response.content
file_name = url.split('/')[-1]
data_path = Path(file_name)
data_path.write_bytes(content)

To download multiple files:

Copy the example script. Write the example script text to a file ending in .py. Advanced users may want to edit the code to add additional features and advanced filtering.
Select a top level URL. Search https://asdc.larc.nasa.gov/data/ for a directory that contains the data you want to download.
Find your Earthdata login token. You can generate a token or copy an existing token by going to https://urs.earthdata.nasa.gov/ and selecting the "Generate Token" option form the top menu. Please note the token's expiration date, after that date a new token must be generated.
Run the script.
If downloading multiple files, when prompted provide your top level URL, you can enter "test", and it will download a small dataset, allowing you to ensure your code is working.
If downloading multiple files, hen prompted, paste your token into the window. Please note, much like a website's password entry field, the value you enter is hidden. Simply paste the key in and hit enter. Some systems will clear your password once you paste the token, so you may have to copy it again if you want to rerun the code.
Wait for the script to find links. The script will check each page in the hierarchy and collect a list of file links. Depending on the number of pages that must be checked, this may take a while.
Let the script know if you want to remove existing files. If you've downloaded the data before and only want updates, you may want to leave the exiting files. The script will only re-download previously downloaded data if the file sizes have changed.
Verify you have enough space for the download. The script will use the file headers to get the size of all files to be downloaded and provided the total size of the download in MB. Ensure you have enough drive space for your download.
Download the files. The script will download the any files that are not already in the data folder, or are in the data folder, but have changed in size.

Code: Select all

from getpass import getpass
from http.client import NOT_FOUND, UNAUTHORIZED
from pathlib import Path

from requests import Session

url_to_path = lambda url, output_dir: output_dir.joinpath(url.split('/')[-1])
print("Welcome to the ASDC Download Script!\nThis script downloads data from https://asdc.larc.nasa.gov/data/")
with Session() as session:
    # get login
    url = input("Enter the top level URL (you can also enter 'test' to download a small dataset)\n\turl: ")
    if url == "test":
        url = "https://asdc.larc.nasa.gov/data/AJAX/CH2O_1/"
    token = getpass("Enter your token, if you don't have a token, get one from https://urs.earthdata.nasa.gov/\n\ttoken: ")
    if not token:
        print("Token cannot be blank, exiting.")
        exit()
    session.headers = {"Authorization": f"Bearer {token}"}\

    # verify login works
    response = session.get(url)
    if not response.ok:
        if response.status_code == UNAUTHORIZED:
            print(f"Earthdata Login reponded with Unauthorized, did you enter a valid token?")
            exit()
        if response.status_code == NOT_FOUND:
            print(f"The top level URL does not exist, select a URL within https://asdc.larc.nasa.gov/data/")
            exit()
    output_dir = Path('data')

    # get a list of all urls
    pages = [url]
    file_urls = []
    print("Getting file links")
    for i, page in enumerate(pages):
        print(f"Checking {page} for links", end="\r", flush=True)
        response = session.get(page)
        if not response.ok:
            if response.status_code == NOT_FOUND:
                print(f"The follwoing page was not found: {url}")
            else:
                print(f"Recieved {response.reason} status for {page}")
        content = response.content.decode('utf-8')
        if '<table id="indexlist">' not in content:
            print(f"Data table not found for {page}")
            continue
        table_content = content.split('<table id="indexlist">')[-1].split('</table>')[0]
        hrefs = {part.split('"')[0] for i, part in enumerate(table_content.split('href="')) if i}
        for href in hrefs:
            if href.endswith('/'):
                pages.append(page + href)
            else:
                file_urls.append(page + href)
    if not file_urls:
        print("No files found, exiting.")
        exit()
    
    # offer to remove existing data
    output_dir.mkdir(exist_ok=True)
    if output_dir.exists() and len(list(output_dir.iterdir())):
        if input(f"There's already data in {output_dir.absolute()}, \n\tRemove it? [y/n]: ") == "y":
            for path in output_dir.iterdir():
                path.unlink()

    # get a list of new files (ignore already download files if the size is the same)
    print("Getting size")
    total_size = 0
    file_count = len(file_urls)
    new_files = []
    for i, url in enumerate(file_urls):
        print(f"Getting size for file {i+1} of {file_count}", end="\r", flush=True)
        _response = session.head(url)
        if url_to_path(url, output_dir).exists() and _response.headers.get('content-length') != url_to_path(url, output_dir).stat().st_size:
            continue
        total_size += int(_response.headers.get('content-length'))
        new_files.append(url)
    if not new_files:
        print("No new files, exiting.")
        exit()
    if input(f"Found {len(new_files)} files totaling {total_size // 1024**2} MB in {output_dir.absolute()}.\n\tDownload [y/n]: ") == 'n':
        exit()
    
    # downlaod files
    for i, url in enumerate(new_files):
        print(f"Downloading file {i+1} of {len(file_urls)}", end="\r", flush=True)
        _response = session.get(url)
        with url_to_path(url, output_dir).open('wb') as file:
            file.write(_response._content)
    print("\nDownload Complete")

faisal1313 · by **faisal1313** » Mon Aug 16, 2021 6:03 am America/New_York

'str' object has no attribute 'open'

by **ASDC - cheyenne.e.land** » Mon Aug 16, 2021 11:21 am America/New_York

Hello,

Thank you for notifying us. A Subject Matter Expert has been notified and will make any necessary edits soon.

by **ASDC - cheyenne.e.land** » Mon Aug 16, 2021 12:03 pm America/New_York

Hello,

The code has been updated, thanks once again.

ariesds · by **ariesds** » Sat Sep 25, 2021 2:18 pm America/New_York

Hello,

I have tried the code for multiple file downloading. I'm wondering how to download the real capacities of file. In my case, I want to download the Calipso VFM at September, 2010 (https://asdc.larc.nasa.gov/data/CALIPSO ... 20/2010/09). I got 93kb for every file.

I'm beginner in python, I did not know which line and what synthax should be change.

I'm very grateful for your help.

Best regards,
Aries

by **ASDC - cheyenne.e.land** » Tue Sep 28, 2021 8:19 am America/New_York

Hello,

Thank you for your question. The author of this code has been notified and will answer your question shortly.

Regards,
ASDC User Services

njester · by **njester** » Wed Sep 29, 2021 1:58 pm America/New_York

ariesds wrote: ↑Sat Sep 25, 2021 2:18 pm America/New_York Hello,

I have tried the code for multiple file downloading. I'm wondering how to download the real capacities of file. In my case, I want to download the Calipso VFM at September, 2010 (https://asdc.larc.nasa.gov/data/CALIPSO ... 20/2010/09). I got 93kb for every file.

I'm beginner in python, I did not know which line and what synthax should be change.

I'm very grateful for your help.

Best regards,
Aries

What do you mean when you say "read capacities"? If all the files are the same size, there's a chance that you credentials were not correct which would cause it to download the html page which tells you that the login failed. I need to update the script to handle this case. Could you open up a file and let me know if the contents are .html code? If so, you may need to make sure you have valid credentials or make some. You can test your credentials or log in here:
https://urs.earthdata.nasa.gov/profile

barronh · by **barronh** » Wed Nov 10, 2021 3:31 pm America/New_York

I have Python 3.6.1 with requests v 2.26.0 and the code above does not work for me. I can use the wget script without a problem using the same username and password. Below is a slight modification of the code to output versions and diagnostic information.

```
from getpass import getpass
from pathlib import Path

from requests import Session

import requests
import sys

print(f'Python Version: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}')
print(f'requests Version {requests.__version__}')

# This code will prompt you to enter your username and password
username = input("Earthdata username: ")
password = getpass("Earthdata password: ")

# Replace this URL with the URL you want to download
url = "https://asdc.larc.nasa.gov/data/ACEPOL/ ... 019_R0.ict"

# This code downloads the file to your current working directory (where you ran python from)
file_name = Path(url).name

session = Session()
session.auth = (username, password)
_redirect = session.get(url)
print(repr(_redirect))
_response = session.get(_redirect.url)
print(repr(_response ))
with open(file_name, 'wb') as file:
file.write(_response._content)
```

I run the command with output shown below.
```
$ python temp.py
Python Version: 3.6.1
requests Version 2.26.0
Earthdata username: xxxx
Earthdata password:
<Response [401]>
<Response [401]>
```

The downloaded file is also pasted:

```
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>401 Unauthorized</title>
</head><body>
<h1>Unauthorized</h1>
<p>This server could not verify that you
are authorized to access the document
requested. Either you supplied the wrong
credentials (e.g., bad password), or your
browser doesn't understand how to supply
the credentials required.</p>
</body></html>
```

I also tried on another machine:
```
Python Version: 3.9.5
requests Version 2.25.1
Earthdata username: xxxx
Earthdata password: ········
<Response [401]>
<Response [401]>
```

I've deleted my username and hidden my password in the outputs. I have no problem using the wget approach with .urs_cookies and the same login information. I'd prefer to use a python-based approach that I could mix with pings to CMR to determine which data I need.

Thanks for any help.

barronh · by **barronh** » Wed Nov 10, 2021 4:07 pm America/New_York

I figured out how to make it work on my machine... I am not sure why, but the username and password must be given as auth to the session.get method:

```
from getpass import getpass
from pathlib import Path

from requests import Session

# Replace this URL with the URL you want to download
url = "https://asdc.larc.nasa.gov/data/ACEPOL/ ... 019_R0.ict"

# This code downloads the file to your current working directory (where you ran python from)
file_name = Path(url).name

session = Session()
#session.auth = (username, password)
_redirect = session.get(url)
_response = session.get(_redirect.url, auth=(username, password))
with open(file_name, 'wb') as file:
file.write(_response._content)
```

Note that if I add the `session.auth = (username, password)` back in, it fails. So I have to both not add session.auth and use auth in the session.get call. The same logic must be applied to the large script too.

I don't understand why this would be the case, but I hope it is useful for others. Again, I am on Python 3.6.1 and requests 2.25.1.

Thanks for the script!

njester · by **njester** » Mon Nov 15, 2021 9:18 am America/New_York

barronh wrote: ↑Wed Nov 10, 2021 4:07 pm America/New_York I figured out how to make it work on my machine... I am not sure why, but the username and password must be given as auth to the session.get method:

```
from getpass import getpass
from pathlib import Path

from requests import Session

# Replace this URL with the URL you want to download
url = "https://asdc.larc.nasa.gov/data/ACEPOL/ ... 019_R0.ict"

# This code downloads the file to your current working directory (where you ran python from)
file_name = Path(url).name

session = Session()
#session.auth = (username, password)
_redirect = session.get(url)
_response = session.get(_redirect.url, auth=(username, password))
with open(file_name, 'wb') as file:
file.write(_response._content)
```

Note that if I add the `session.auth = (username, password)` back in, it fails. So I have to both not add session.auth and use auth in the session.get call. The same logic must be applied to the large script too.

I don't understand why this would be the case, but I hope it is useful for others. Again, I am on Python 3.6.1 and requests 2.25.1.

Thanks for the script!

Thanks for the info, I'll see if I need to update the post.

Forum

Downloading ASDC Data with Python 3

Downloading ASDC Data with Python 3

Re: Downloading ASDC Data with Python 3

Re: Downloading ASDC Data with Python 3

Re: Downloading ASDC Data with Python 3

Re: Downloading ASDC Data with Python 3

Re: Downloading ASDC Data with Python 3

Re: Downloading ASDC Data with Python 3

Re: Downloading ASDC Data with Python 3

Re: Downloading ASDC Data with Python 3

Re: Downloading ASDC Data with Python 3