parse_urls#

parse_urls module provides functions to parse URLs from different sources. The module provides functions to parse URLs from:

Different functions to parse URLs#

Functions

Description

from_file()

parse urls from a file which only contains urls

from_html()

parse urls from html website

from_sentinel_meta4()

parse a urls from a given JSON file

from_EarthExplorer_order()

parse urls from orders in earthexplorer

Functions#

data_downloader.parse_urls.from_file(url_file: str | Path) list#

parse urls from a file which only contains urls

Added in version 1.2.

Parameters:#

url_file: str

path to file which only contains urls

Return:#

a list contains urls

data_downloader.parse_urls.from_html(url: str, suffix: list[str] | None = None, suffix_depth: int = 0, url_depth: int = 0) list#

parse urls from html website

Parameters:#

url: str

the website contains data

suffix: list[str] | None, optional

data format. suffix should be a list contains multipart. if suffix_depth is 0, all ‘.’ will parsed. Examples:

  • when set ‘suffix_depth=0’:
    • suffix of ‘xxx8.1_GLOBAL.nc’ should be [‘.1_GLOBAL’, ‘.nc’]

    • suffix of ‘xxx.tar.gz’ should be [‘.tar’, ‘.gz’]

  • when set ‘suffix_depth=1’:
    • suffix of ‘xxx8.1_GLOBAL.nc’ should be [‘.nc’]

    • suffix of ‘xxx.tar.gz’ should be [‘.gz’]

suffix_depth: int

Number of suffixes

url_depth: int

depth of url in website will parsed

Return:#

a list contains urls

Example:#

>>> from downloader import parse_urls
>>> url = 'https://cds-espri.ipsl.upmc.fr/espri/pubipsl/iasib_CH4_2014_uk.jsp'
>>> urls = parse_urls.from_html(url, suffix=['.nc'], suffix_depth=1)
>>> urls_all = parse_urls.from_html(url, suffix=['.nc'], suffix_depth=1, url_depth=1)
>>> print(len(urls_all)-len(urls))
data_downloader.parse_urls.from_sentinel_meta4(url_file: str | Path) list#

parse urls from sentinel products.meta4 file downloaded from https://scihub.copernicus.eu/dhus

Parameters:#

url_file: str

path to products.meta4

Return:#

a list contains urls

data_downloader.parse_urls.from_EarthExplorer_order(username: str | None = None, passwd: str | None = None, email: str | None = None, order: str | dict | None = None, url_host: str | None = None) dict#

parse urls from orders in earthexplorer.

Reference: [bulk-downloader](https://code.usgs.gov/espa/bulk-downloader)

Parameters:#

username, passwd: str, optional

your username and passwd to login in EarthExplorer. Could be None when you have save them in .netrc

email: str, optional

email address for the user that submitted the order

order: str or dict

which order to download. If None, all orders retrieved from EarthExplorer will be used.

url_host: str

if host is not USGS ESPA

Return:#

a dict in format of {orderid: urls}

Example:#

>>> from pathlib import Path
>>> from data_downloader import downloader, parse_urls
>>> folder_out = Path('D:\data')
>>> urls_info = parse_urls.from_EarthExplorer_order('your username', 'your passwd')
>>> for odr in urls_info.keys():
>>>     folder = folder_out.joinpath(odr)
>>>     if not folder.exists():
>>>         folder.mkdir()
>>>     urls = urls_info[odr]
>>>     downloader.download_datas(urls, folder)
data_downloader.parse_urls.from_urls_file(url_file: str | Path) list#

parse urls from a file which only contains urls

Warning

This function will be deprecated in the future. Please use from_file() instead.

See also

from_file()

Parameters:#

url_file: str

path to file which only contains urls

Return:#

a list contains urls