parse_urls#

parse_urls module provides functions to parse URLs from different sources. The module provides functions to parse URLs from:

Different functions to parse URLs#
Functions	Description
`from_file()`	parse urls from a file which only contains urls
`from_html()`	parse urls from html website
`from_sentinel_meta4()`	parse a urls from a given JSON file
`from_EarthExplorer_order()`	parse urls from orders in earthexplorer

Functions#

data_downloader.parse_urls.from_file(url_file: str | Path) → list#

parse urls from a file which only contains urls

Added in version 1.2.

Parameters:#

url_file: str: path to file which only contains urls

Return:#

a list contains urls

data_downloader.parse_urls.from_html(url: str, suffix: list[str] | None = None, suffix_depth: int = 0, url_depth: int = 0) → list#

parse urls from html website

Parameters:#

url: str

the website contains data

suffix: list[str] | None, optional

data format. suffix should be a list contains multipart. if suffix_depth is 0, all ‘.’ will parsed. Examples:

when set ‘suffix_depth=0’:
- suffix of ‘xxx8.1_GLOBAL.nc’ should be [‘.1_GLOBAL’, ‘.nc’]
- suffix of ‘xxx.tar.gz’ should be [‘.tar’, ‘.gz’]
when set ‘suffix_depth=1’:
- suffix of ‘xxx8.1_GLOBAL.nc’ should be [‘.nc’]
- suffix of ‘xxx.tar.gz’ should be [‘.gz’]

suffix_depth: int

Number of suffixes

url_depth: int

depth of url in website will parsed

Return:#

a list contains urls

Example:#

>>> from downloader import parse_urls

>>> url = 'https://cds-espri.ipsl.upmc.fr/espri/pubipsl/iasib_CH4_2014_uk.jsp'
>>> urls = parse_urls.from_html(url, suffix=['.nc'], suffix_depth=1)
>>> urls_all = parse_urls.from_html(url, suffix=['.nc'], suffix_depth=1, url_depth=1)
>>> print(len(urls_all)-len(urls))

data_downloader.parse_urls.from_sentinel_meta4(url_file: str | Path) → list#

parse urls from sentinel products.meta4 file downloaded from https://scihub.copernicus.eu/dhus

Parameters:#

url_file: str: path to products.meta4

Return:#

a list contains urls

parse urls from orders in earthexplorer.

Reference: [bulk-downloader](https://code.usgs.gov/espa/bulk-downloader)

Parameters:#

username, passwd: str, optional: your username and passwd to login in EarthExplorer. Could be None when you have save them in .netrc
email: str, optional: email address for the user that submitted the order
order: str or dict: which order to download. If None, all orders retrieved from EarthExplorer will be used.
url_host: str: if host is not USGS ESPA

Return:#

a dict in format of {orderid: urls}

Example:#

>>> from pathlib import Path
>>> from data_downloader import downloader, parse_urls
>>> folder_out = Path('D:\data')
>>> urls_info = parse_urls.from_EarthExplorer_order('your username', 'your passwd')
>>> for odr in urls_info.keys():
>>>     folder = folder_out.joinpath(odr)
>>>     if not folder.exists():
>>>         folder.mkdir()
>>>     urls = urls_info[odr]
>>>     downloader.download_datas(urls, folder)

data_downloader.parse_urls.from_urls_file(url_file: str | Path) → list#

parse urls from a file which only contains urls

Warning

This function will be deprecated in the future. Please use from_file() instead.

Parameters:#

url_file: str: path to file which only contains urls

Return:#

a list contains urls

parse_urls#

Functions#

Parameters:#

Return:#

Parameters:#

Return:#

Example:#

Parameters:#

Return:#

Parameters:#

Return:#

Example:#

Parameters:#

Return:#

This Page