SitemapRequestLoader
Hierarchy
- RequestLoader
- SitemapRequestLoader
Index
Methods
__init__
Initialize the sitemap request loader.
Parameters
sitemap_urls: list[str]
Configuration options for the loader.
http_client: HttpClient
the instance of
HttpClient
to use for fetching sitemaps.optionalkeyword-onlyproxy_info: ProxyInfo | None = None
Optional proxy to use for fetching sitemaps.
optionalkeyword-onlyinclude: list[re.Pattern[Any] | Glob] | None = None
List of glob or regex patterns to include URLs.
optionalkeyword-onlyexclude: list[re.Pattern[Any] | Glob] | None = None
List of glob or regex patterns to exclude URLs.
optionalkeyword-onlymax_buffer_size: int = 200
Maximum number of URLs to buffer in memory.
optionalkeyword-onlyparse_sitemap_options: ParseSitemapOptions | None = None
Options for parsing sitemaps, such as
SitemapSource
andmax_urls
.
Returns None
abort_loading
Abort the sitemap loading process.
Returns None
fetch_next_request
Fetch the next request to process.
Returns Request | None
get_handled_count
Return the number of handled requests.
Returns int
get_total_count
Return the total number of URLs found so far.
Returns int
is_empty
Check if there are no more URLs to process.
Returns bool
is_finished
Check if all URLs have been processed.
Returns bool
mark_request_as_handled
Mark a request as successfully handled.
Parameters
request: Request
Returns ProcessedRequest | None
to_tandem
Combine the loader with a request manager to support adding and reclaiming requests.
Parameters
optionalrequest_manager: RequestManager | None = None
Request manager to combine the loader with. If None is given, the default request queue is used.
Returns RequestManagerTandem
A request loader that reads URLs from sitemap(s).
The loader fetches and parses sitemaps in the background, allowing crawling to start before all URLs are loaded. It supports filtering URLs using glob and regex patterns.