parse_urls: Parse URLs from various sources#

parse_urls module provides basic functions to parse URLs from different sources. The module provides functions to parse URLs from:

Different functions to parse URLs#

Functions

Description

from_file()

parse urls from a file which only contains urls

from_html()

parse urls from html website

from_sentinel_meta4()

parse a urls from a given JSON file

from_EarthExplorer_order()

parse urls from orders in earthexplorer

You can import parse_urls at the beginning.

from data_downloader import parse_urls

Following is a brief introduction to those functions.

from_file#

This function parses URLs from a given file, which only contains URLs.

Tip

This function is only useful when the file only contains URLs (one column). If the file contains multiple columns, you are suggested to use pandas to read the file.

Example:

from data_downloader import parse_urls, downloader

url_file = '/media/fancy/gpm/subset_GPM_3IMERGM_06_20200513_134318.txt'
urls = parse_urls.from_file(url_file)

downloader.download_datas(urls, folder_out)

Here is an example of use case: gpm_example.

from_html#

This function parses URLs from a given HTML websites (url). It can parse URLs with a specific suffix and depth. Following example shows how to parse URLs with suffix .nc and depth 1.

Example:

from data_downloader import parse_urls

url = 'https://cds-espri.ipsl.upmc.fr/espri/pubipsl/iasib_CH4_2014_uk.jsp'
urls = parse_urls.from_html(url, suffix=['.nc'], suffix_depth=1)
urls_all = parse_urls.from_html(url, suffix=['.nc'], suffix_depth=1, url_depth=1)

print(f"Found {len(urls)} urls, {len(urls_all)} urls in total")
Found 357 urls, 2903 urls in total

Tip

This function is used to parse URLs for the LiCSARService and SentinelOrbit services. For more details, you can refer to the source code of these services.