Downloading ASDC Data with Python 3

Please enter here to ask a question about any NASA Science related topics!
Post Reply
njester
Posts: 5
Joined: Sat Mar 06, 2021 9:03 am America/New_York

Downloading ASDC Data with Python 3

by njester » Mon May 24, 2021 11:49 am America/New_York

Below, there are two example scripts, the first downloads a file and the second downloads multiple files.

The following code requires
  1. The requests package which can be installed from the command line or terminal using the following command:

    Code: Select all

    pip install requests
  2. Python 3.6 or newer. To check Python's version, simply launch the Python application and Python's version will be printed on the first line.
  3. Earthdata logon credentials. If you don't have Earthdata credentials, you can set up an account here.
Downloading A File
The following example code allows a user to download a file from the ASDC using Python.
This example code can be used to download a single file. For downloading multiple files, use the second code example or wget.

The following code sets variables that are needed to download a file. This code will prompt the user for their Earthdata username and password. After authenticating, the data will be downloaded to your current working directory.

Code: Select all

from getpass import getpass
from pathlib import Path

from requests import Session 

# This code will prompt you to enter your username and password
username = input("Earthdata username: ")
password = getpass("Earthdata password: ")

# Replace this URL with the URL you want to download
url = "https://asdc.larc.nasa.gov/data/ACEPOL/MetNav_AircraftInSitu_Data_1/ACEPOL-HSKPING_ER2_20171019_R0.ict"

# This code downloads the file to your current working directory (where you ran python from)
file_name = Path(url).name

session = Session()
session.auth = (username, password) 
_redirect = session.get(url) 
_response = session.get(_redirect.url) 
with open(file_name, 'wb') as file:
    file.write(_response._content) 
Download Multiple Files
The following code can be used to download multiple files within a asdc.larc.nasa.gov/data directory. Simply find a top level directory that contains all the files to be downloaded.

To run the script:
  1. Define The Top Level URL
    At the end of the script, below if __name__ == "__main__":, is a line where users can enter the top level URL of the data they want to download.
  2. Add Filters
    Filters can be added to help define which files should and should not be downloaded. In the comments, there are examples of filters that will filter based on text matching or extensions. If you want to download all files, replace the filters lines with

    Code: Select all

    filters = lambda file: True
  3. To Run the Code
    This code should be added to a file with the .py extension. Once this code has been added to the file, it can be run by double clicking the file.

Code: Select all

from getpass import getpass
from html.parser import HTMLParser
from pathlib import Path
from typing import Any, List

from requests import Session
import requests
from requests import Response


# This class is used to process 
class MyHTMLParser(HTMLParser):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.hrefs: List[str] = []

    def handle_starttag(self, tag, attrs):
        if tag == "a":
            for (attr, value) in attrs:
                if attr == "href":
                    self.hrefs.append(value)

    def get_hrefs(self, response: Response) -> List[str]:
        self.feed(str(response._content))
        return self.hrefs

def get_session() -> Session:
    # This code will prompt you to enter your username and password
    username = input("Earthdata username: ")
    password = getpass("Earthdata password: ")

    # create a session that can be used to log in
    session = Session()
    session.auth = (username, password)
    return session

def is_correct_ext(file: str, filters: List[str]):
    for filter in filters:
        if file.lower().endswith(filter.lower()):
            return True
    return False

def get_files_to_download(session: Session, filter_list: List[str]) -> List[str]:

    # this function returns true if a href is a child ref
    is_child_href = lambda href: not href.startswith('http') and not (href.startswith('/') or href.startswith('#') or href.startswith('mailto'))

    # walk the directory to find all directory and file URLs
    print("Getting download files.")
    urls = [starting_url]
    files = []
    for url in urls:

        # get the page's contents
        print(f"Checking {url}")
        _redirect = session.get(url)
        _response = session.get(_redirect.url)

        # this class populates the hrefs list
        parser = MyHTMLParser()
        hrefs = parser.get_hrefs(_response)
        for href in hrefs:
            new_url = url+href
            if is_child_href(href) and href != url:
                if href.endswith('/'):
                    urls.append(new_url)
                elif is_correct_ext(new_url, filter_list) and new_url not in files:
                    files.append(new_url)
    return files

def verify_download(session: Session, files: List[str], output_dir: Path) -> List[str]:
    # create a data directory in the current working directory to store the downloaded files
    output_dir.mkdir(exist_ok=True)

    # should existing files be overwritten
    print("Checking for existing files")
    local_files = [file.name.lower() for file in output_dir.iterdir()]
    existing_files = list(filter(lambda file: Path(file).name.lower() in local_files, files))
    if existing_files and input(f"Overwrite {len(existing_files)} exiting files? [y/n]: ") not in ('y', 'ye', 'yes'):
        files = list(filter(lambda file: file not in existing_files, files))

    # calculate total file size so that the user can verify they have enough space
    print("Getting download size.")
    total_size = 0
    if not files:
        print("No files matched the filter or no files were found in the directory, exiting")
        exit()
    for file in files:
        _redirect = session.head(file)
        _response = session.head(_redirect.url)
        total_size += int(_response.headers.get('content-length'))

    # inform the user before starting download
    if input(f"Download {len(files)}, {total_size // 1024**2} MB? [y/n]: ").lower() not in ('y', 'ye', 'yes'):
        print("Exiting, consider adding more filters or starting at a lower level folder")
        exit()

    return files

def download(session: Session, files: List[str], output_dir: Path) -> None:
    # The following code downloads the files
    print(f"Downloading {len(files)} files.")
    for i, file in enumerate(files):
        print(f"Downloading file {i+1} of {len(files)}", end="\r")
        file_path = output_dir.joinpath(file.split('/')[-1])
        with session.get(file) as _redirect:
            _redirect = session.get(file) 
            _response = session.get(_redirect.url) 
            with file_path.open('wb') as file:
                file.write(_response._content)

if __name__ == "__main__":
    # This URL is the starting directory
    starting_url = input("Enter the top level URL: ")
    # enter "test" for the url to download form a small, 20MB directory.
    if starting_url == "test":
        starting_url = "https://asdc.larc.nasa.gov/data/MISR/MI3DCLDN.002/2000.09.20/"
        
    # the otuput directory Path('data') will save it to a file called 'data' in the current working directory
    output_dir = Path('data')

    filter_text = input("File types to download. (Ex: .ict, .h5, blank for all): ")
    filter_list = [filter.strip() for filter in filter_text.split(',')]

    session = get_session()
    files = get_files_to_download(session, filter_list)
    files = verify_download(session, files, output_dir)
    download(session, files, output_dir)
Last edited by njester on Mon Aug 30, 2021 10:07 am America/New_York, edited 4 times in total.

Tags:

faisal1313
Posts: 1
Joined: Mon Aug 16, 2021 6:03 am America/New_York

Re: Downloading ASDC Data with Python 3

by faisal1313 » Mon Aug 16, 2021 6:03 am America/New_York

'str' object has no attribute 'open'

ASDC - cheyenne.e.land
User Services
User Services
Posts: 27
Joined: Mon Mar 22, 2021 3:55 pm America/New_York
Answers: 1

Re: Downloading ASDC Data with Python 3

by ASDC - cheyenne.e.land » Mon Aug 16, 2021 11:21 am America/New_York

Hello,

Thank you for notifying us. A Subject Matter Expert has been notified and will make any necessary edits soon.

ASDC - cheyenne.e.land
User Services
User Services
Posts: 27
Joined: Mon Mar 22, 2021 3:55 pm America/New_York
Answers: 1

Re: Downloading ASDC Data with Python 3

by ASDC - cheyenne.e.land » Mon Aug 16, 2021 12:03 pm America/New_York

Hello,

The code has been updated, thanks once again.

Post Reply