parse_urls#
parse_urls module provides functions to parse URLs from different sources. The module provides functions to parse URLs from:
Functions |
Description |
|---|---|
parse urls from a file which only contains urls |
|
parse urls from html website |
|
parse a urls from a given JSON file |
|
parse urls from orders in earthexplorer |
Functions#
- data_downloader.parse_urls.from_file(url_file: str | Path) list#
parse urls from a file which only contains urls
Added in version 1.2.
Parameters:#
- url_file: str
path to file which only contains urls
Return:#
a list contains urls
- data_downloader.parse_urls.from_html(url: str, suffix: list[str] | None = None, suffix_depth: int = 0, url_depth: int = 0) list#
parse urls from html website
Parameters:#
- url: str
the website contains data
- suffix: list[str] | None, optional
data format. suffix should be a list contains multipart. if suffix_depth is 0, all ‘.’ will parsed. Examples:
- when set ‘suffix_depth=0’:
suffix of ‘xxx8.1_GLOBAL.nc’ should be [‘.1_GLOBAL’, ‘.nc’]
suffix of ‘xxx.tar.gz’ should be [‘.tar’, ‘.gz’]
- when set ‘suffix_depth=1’:
suffix of ‘xxx8.1_GLOBAL.nc’ should be [‘.nc’]
suffix of ‘xxx.tar.gz’ should be [‘.gz’]
- suffix_depth: int
Number of suffixes
- url_depth: int
depth of url in website will parsed
Return:#
a list contains urls
Example:#
>>> from downloader import parse_urls
>>> url = 'https://cds-espri.ipsl.upmc.fr/espri/pubipsl/iasib_CH4_2014_uk.jsp' >>> urls = parse_urls.from_html(url, suffix=['.nc'], suffix_depth=1) >>> urls_all = parse_urls.from_html(url, suffix=['.nc'], suffix_depth=1, url_depth=1) >>> print(len(urls_all)-len(urls))
- data_downloader.parse_urls.from_sentinel_meta4(url_file: str | Path) list#
parse urls from sentinel products.meta4 file downloaded from https://scihub.copernicus.eu/dhus
Parameters:#
- url_file: str
path to products.meta4
Return:#
a list contains urls
- data_downloader.parse_urls.from_EarthExplorer_order(username: str | None = None, passwd: str | None = None, email: str | None = None, order: str | dict | None = None, url_host: str | None = None) dict#
parse urls from orders in earthexplorer.
Reference: [bulk-downloader](https://code.usgs.gov/espa/bulk-downloader)
Parameters:#
- username, passwd: str, optional
your username and passwd to login in EarthExplorer. Could be None when you have save them in .netrc
- email: str, optional
email address for the user that submitted the order
- order: str or dict
which order to download. If None, all orders retrieved from EarthExplorer will be used.
- url_host: str
if host is not USGS ESPA
Return:#
a dict in format of {orderid: urls}
Example:#
>>> from pathlib import Path >>> from data_downloader import downloader, parse_urls >>> folder_out = Path('D:\data') >>> urls_info = parse_urls.from_EarthExplorer_order('your username', 'your passwd') >>> for odr in urls_info.keys(): >>> folder = folder_out.joinpath(odr) >>> if not folder.exists(): >>> folder.mkdir() >>> urls = urls_info[odr] >>> downloader.download_datas(urls, folder)
- data_downloader.parse_urls.from_urls_file(url_file: str | Path) list#
parse urls from a file which only contains urls
Warning
This function will be deprecated in the future. Please use
from_file()instead.See also
Parameters:#
- url_file: str
path to file which only contains urls
Return:#
a list contains urls