websweep.utils package
Submodules
websweep.utils.backend module
- websweep.utils.backend.detect_existing_overview_backend(base_folder)[source]
Detect an existing overview store in
base_folder.- Parameters:
base_folder (Path)
- Return type:
str | None
websweep.utils.json_io module
- websweep.utils.json_io.json_dumps(obj)[source]
Serialize an object to UTF-8 JSON bytes (orjson when available).
- Return type:
bytes
websweep.utils.public_suffix module
Utilities for loading and refreshing the public suffix list (PSL).
websweep.utils.source_urls module
- websweep.utils.source_urls.read_source_urls(source_file_path)[source]
Parse a source CSV/TSV and return URLs with optional identifiers.
Supported headers: - url / website / domain - identifier / id (optional)
Input hygiene: - auto-detects CSV vs TSV delimiters - keeps only level-0 rows when a level column exists - normalizes URLs and skips non-web schemes - removes exact duplicate (url, identifier) pairs while preserving order
- Parameters:
source_file_path (Path)
- Return type:
List[Tuple[str, str | None]]
websweep.utils.utils module
- websweep.utils.utils.create_regex_pattern(keywords, regex)[source]
Build a case-insensitive regex from literal keywords and raw regex text.
- websweep.utils.utils.set_regex(classification_file_path=None, allow_extensions=None, block_extensions=None)[source]
Load URL classification rules and return compiled regex/extension filters.