Downloading ASDC Data with Python 3

Please enter here to ask a question about any NASA Science related topics!
Post Reply
njester
Posts: 8
Joined: Sat Mar 06, 2021 9:03 am America/New_York

Downloading ASDC Data with Python 3

by njester » Mon May 24, 2021 11:49 am America/New_York

Below, there are two example scripts, the first downloads a file and the second downloads multiple files.

The following code requires
  1. The requests package which can be installed from the command line or terminal using the following command:

    Code: Select all

    pip install requests
  2. Python 3.6 or newer. To check Python's version, simply launch the Python application and Python's version will be printed on the first line.
  3. Earthdata logon credentials. If you don't have Earthdata credentials, you can set up an account here.
Downloading A File
The following example code allows a user to download a file from the ASDC using Python.
This example code can be used to download a single file. For downloading multiple files, use the second code example or wget.

The following code sets variables that are needed to download a file. This code will prompt the user for their Earthdata username and password. After authenticating, the data will be downloaded to your current working directory.

Code: Select all

from getpass import getpass
from pathlib import Path

from requests import Session 

# This code will prompt you to enter your username and password
username = input("Earthdata username: ")
password = getpass("Earthdata password: ")

# Replace this URL with the URL you want to download
url = "https://asdc.larc.nasa.gov/data/ACEPOL/MetNav_AircraftInSitu_Data_1/ACEPOL-HSKPING_ER2_20171019_R0.ict"

# This code downloads the file to your current working directory (where you ran python from)
file_name = Path(url).name

session = Session()
session.auth = (username, password) 
_redirect = session.get(url) 
_response = session.get(_redirect.url) 
with open(file_name, 'wb') as file:
    file.write(_response._content) 
Download Multiple Files
The following code can be used to download multiple files within a asdc.larc.nasa.gov/data directory. Simply find a top level directory that contains all the files to be downloaded.

To run the script:
  1. Define The Top Level URL
    At the end of the script, below if __name__ == "__main__":, is a line where users can enter the top level URL of the data they want to download.
  2. Add Filters
    Filters can be added to help define which files should and should not be downloaded. In the comments, there are examples of filters that will filter based on text matching or extensions. If you want to download all files, replace the filters lines with

    Code: Select all

    filters = lambda file: True
  3. To Run the Code
    This code should be added to a file with the .py extension. Once this code has been added to the file, it can be run by double clicking the file.

Code: Select all

from getpass import getpass
from html.parser import HTMLParser
from pathlib import Path
from typing import Any, List

from requests import Session
import requests
from requests import Response


# This class is used to process 
class MyHTMLParser(HTMLParser):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.hrefs: List[str] = []

    def handle_starttag(self, tag, attrs):
        if tag == "a":
            for (attr, value) in attrs:
                if attr == "href":
                    self.hrefs.append(value)

    def get_hrefs(self, response: Response) -> List[str]:
        self.feed(str(response._content))
        return self.hrefs

def get_session() -> Session:
    # This code will prompt you to enter your username and password
    username = input("Earthdata username: ")
    password = getpass("Earthdata password: ")

    # create a session that can be used to log in
    session = Session()
    session.auth = (username, password)
    
    return session

def is_correct_ext(file: str, filters: List[str]):
    for filter in filters:
        if file.lower().endswith(filter.lower()):
            return True
    return False

def get_files_to_download(session: Session, filter_list: List[str]) -> List[str]:

    # this function returns true if a href is a child ref
    is_child_href = lambda href: not href.startswith('http') and not (href.startswith('/') or href.startswith('#') or href.startswith('mailto'))

    # walk the directory to find all directory and file URLs
    print("Getting download files.")
    urls = [starting_url]
    files = []
    for url in urls:

        # get the page's contents
        print(f"Checking {url}")
        _redirect = session.get(url)
        _response = session.get(_redirect.url)

        # this class populates the hrefs list
        parser = MyHTMLParser()
        hrefs = parser.get_hrefs(_response)
        for href in hrefs:
            new_url = url+href
            if is_child_href(href) and href != url:
                if href.endswith('/'):
                    urls.append(new_url)
                elif is_correct_ext(new_url, filter_list) and new_url not in files:
                    files.append(new_url)
    return files

def verify_download(session: Session, files: List[str], output_dir: Path) -> List[str]:
    # create a data directory in the current working directory to store the downloaded files
    output_dir.mkdir(exist_ok=True)

    # should existing files be overwritten
    print("Checking for existing files")
    local_files = [file.name.lower() for file in output_dir.iterdir()]
    existing_files = list(filter(lambda file: Path(file).name.lower() in local_files, files))
    if existing_files and input(f"Overwrite {len(existing_files)} exiting files? [y/n]: ") not in ('y', 'ye', 'yes'):
        files = list(filter(lambda file: file not in existing_files, files))

    # calculate total file size so that the user can verify they have enough space
    print("Getting download size.")
    total_size = 0
    if not files:
        print("No files matched the filter or no files were found in the directory, exiting")
        exit()
    for file in files:
        _redirect = session.head(file)
        _response = session.head(_redirect.url)
        total_size += int(_response.headers.get('content-length'))

    # inform the user before starting download
    if input(f"Download {len(files)}, {total_size // 1024**2} MB? [y/n]: ").lower() not in ('y', 'ye', 'yes'):
        print("Exiting, consider adding more filters or starting at a lower level folder")
        exit()

    return files

def download(session: Session, files: List[str], output_dir: Path) -> None:
    # The following code downloads the files
    print(f"Downloading {len(files)} files.")
    for i, file in enumerate(files):
        print(f"Downloading file {i+1} of {len(files)}", end="\r")
        file_path = output_dir.joinpath(file.split('/')[-1])
        with session.get(file) as _redirect:
            _redirect = session.get(file) 
            _response = session.get(_redirect.url) 
            with file_path.open('wb') as file:
                file.write(_response._content)

if __name__ == "__main__":
    # This URL is the starting directory
    starting_url = input("Enter the top level URL: ")
    # enter "test" for the url to download form a small, 20MB directory.
    if starting_url == "test":
        starting_url = "https://asdc.larc.nasa.gov/data/MISR/MI3DCLDN.002/2000.09.20/"
        
    # the otuput directory Path('data') will save it to a file called 'data' in the current working directory
    output_dir = Path('data')

    filter_text = input("File types to download. (Ex: .ict, .h5, blank for all): ")
    filter_list = [filter.strip() for filter in filter_text.split(',')]

    session = get_session()
    files = get_files_to_download(session, filter_list)
    files = verify_download(session, files, output_dir)
    download(session, files, output_dir)
Last edited by njester on Wed Sep 29, 2021 1:52 pm America/New_York, edited 6 times in total.

Tags:

faisal1313
Posts: 1
Joined: Mon Aug 16, 2021 6:03 am America/New_York

Re: Downloading ASDC Data with Python 3

by faisal1313 » Mon Aug 16, 2021 6:03 am America/New_York

'str' object has no attribute 'open'

ASDC - cheyenne.e.land
User Services
User Services
Posts: 29
Joined: Mon Mar 22, 2021 3:55 pm America/New_York
Answers: 1

Re: Downloading ASDC Data with Python 3

by ASDC - cheyenne.e.land » Mon Aug 16, 2021 11:21 am America/New_York

Hello,

Thank you for notifying us. A Subject Matter Expert has been notified and will make any necessary edits soon.

ASDC - cheyenne.e.land
User Services
User Services
Posts: 29
Joined: Mon Mar 22, 2021 3:55 pm America/New_York
Answers: 1

Re: Downloading ASDC Data with Python 3

by ASDC - cheyenne.e.land » Mon Aug 16, 2021 12:03 pm America/New_York

Hello,

The code has been updated, thanks once again.

ariesds
Posts: 1
Joined: Sat Sep 25, 2021 2:10 pm America/New_York

Re: Downloading ASDC Data with Python 3

by ariesds » Sat Sep 25, 2021 2:18 pm America/New_York

Hello,

I have tried the code for multiple file downloading. I'm wondering how to download the real capacities of file. In my case, I want to download the Calipso VFM at September, 2010 (https://asdc.larc.nasa.gov/data/CALIPSO ... 20/2010/09). I got 93kb for every file.

I'm beginner in python, I did not know which line and what synthax should be change.

I'm very grateful for your help.

Best regards,
Aries

ASDC - cheyenne.e.land
User Services
User Services
Posts: 29
Joined: Mon Mar 22, 2021 3:55 pm America/New_York
Answers: 1

Re: Downloading ASDC Data with Python 3

by ASDC - cheyenne.e.land » Tue Sep 28, 2021 8:19 am America/New_York

Hello,

Thank you for your question. The author of this code has been notified and will answer your question shortly.

Regards,
ASDC User Services

njester
Posts: 8
Joined: Sat Mar 06, 2021 9:03 am America/New_York

Re: Downloading ASDC Data with Python 3

by njester » Wed Sep 29, 2021 1:58 pm America/New_York

ariesds wrote: Sat Sep 25, 2021 2:18 pm America/New_York Hello,

I have tried the code for multiple file downloading. I'm wondering how to download the real capacities of file. In my case, I want to download the Calipso VFM at September, 2010 (https://asdc.larc.nasa.gov/data/CALIPSO ... 20/2010/09). I got 93kb for every file.

I'm beginner in python, I did not know which line and what synthax should be change.

I'm very grateful for your help.

Best regards,
Aries
What do you mean when you say "read capacities"? If all the files are the same size, there's a chance that you credentials were not correct which would cause it to download the html page which tells you that the login failed. I need to update the script to handle this case. Could you open up a file and let me know if the contents are .html code? If so, you may need to make sure you have valid credentials or make some. You can test your credentials or log in here:
https://urs.earthdata.nasa.gov/profile

barronh
Posts: 2
Joined: Wed Nov 10, 2021 3:06 pm America/New_York

Re: Downloading ASDC Data with Python 3

by barronh » Wed Nov 10, 2021 3:31 pm America/New_York

I have Python 3.6.1 with requests v 2.26.0 and the code above does not work for me. I can use the wget script without a problem using the same username and password. Below is a slight modification of the code to output versions and diagnostic information.

```
from getpass import getpass
from pathlib import Path

from requests import Session

import requests
import sys

print(f'Python Version: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}')
print(f'requests Version {requests.__version__}')

# This code will prompt you to enter your username and password
username = input("Earthdata username: ")
password = getpass("Earthdata password: ")

# Replace this URL with the URL you want to download
url = "https://asdc.larc.nasa.gov/data/ACEPOL/ ... 019_R0.ict"

# This code downloads the file to your current working directory (where you ran python from)
file_name = Path(url).name

session = Session()
session.auth = (username, password)
_redirect = session.get(url)
print(repr(_redirect))
_response = session.get(_redirect.url)
print(repr(_response ))
with open(file_name, 'wb') as file:
file.write(_response._content)
```

I run the command with output shown below.
```
$ python temp.py
Python Version: 3.6.1
requests Version 2.26.0
Earthdata username: xxxx
Earthdata password:
<Response [401]>
<Response [401]>
```

The downloaded file is also pasted:

```
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>401 Unauthorized</title>
</head><body>
<h1>Unauthorized</h1>
<p>This server could not verify that you
are authorized to access the document
requested. Either you supplied the wrong
credentials (e.g., bad password), or your
browser doesn't understand how to supply
the credentials required.</p>
</body></html>
```

I also tried on another machine:
```
Python Version: 3.9.5
requests Version 2.25.1
Earthdata username: xxxx
Earthdata password: ········
<Response [401]>
<Response [401]>
```

I've deleted my username and hidden my password in the outputs. I have no problem using the wget approach with .urs_cookies and the same login information. I'd prefer to use a python-based approach that I could mix with pings to CMR to determine which data I need.

Thanks for any help.

barronh
Posts: 2
Joined: Wed Nov 10, 2021 3:06 pm America/New_York

Re: Downloading ASDC Data with Python 3

by barronh » Wed Nov 10, 2021 4:07 pm America/New_York

I figured out how to make it work on my machine... I am not sure why, but the username and password must be given as auth to the session.get method:

```
from getpass import getpass
from pathlib import Path

from requests import Session

# Replace this URL with the URL you want to download
url = "https://asdc.larc.nasa.gov/data/ACEPOL/ ... 019_R0.ict"

# This code downloads the file to your current working directory (where you ran python from)
file_name = Path(url).name

session = Session()
#session.auth = (username, password)
_redirect = session.get(url)
_response = session.get(_redirect.url, auth=(username, password))
with open(file_name, 'wb') as file:
file.write(_response._content)
```

Note that if I add the `session.auth = (username, password)` back in, it fails. So I have to both not add session.auth and use auth in the session.get call. The same logic must be applied to the large script too.

I don't understand why this would be the case, but I hope it is useful for others. Again, I am on Python 3.6.1 and requests 2.25.1.

Thanks for the script!

njester
Posts: 8
Joined: Sat Mar 06, 2021 9:03 am America/New_York

Re: Downloading ASDC Data with Python 3

by njester » Mon Nov 15, 2021 9:18 am America/New_York

barronh wrote: Wed Nov 10, 2021 4:07 pm America/New_York I figured out how to make it work on my machine... I am not sure why, but the username and password must be given as auth to the session.get method:

```
from getpass import getpass
from pathlib import Path

from requests import Session

# Replace this URL with the URL you want to download
url = "https://asdc.larc.nasa.gov/data/ACEPOL/ ... 019_R0.ict"

# This code downloads the file to your current working directory (where you ran python from)
file_name = Path(url).name

session = Session()
#session.auth = (username, password)
_redirect = session.get(url)
_response = session.get(_redirect.url, auth=(username, password))
with open(file_name, 'wb') as file:
file.write(_response._content)
```

Note that if I add the `session.auth = (username, password)` back in, it fails. So I have to both not add session.auth and use auth in the session.get call. The same logic must be applied to the large script too.

I don't understand why this would be the case, but I hope it is useful for others. Again, I am on Python 3.6.1 and requests 2.25.1.

Thanks for the script!
Thanks for the info, I'll see if I need to update the post.

Post Reply