websweep.utils package

Submodules

websweep.utils.backend module

websweep.utils.backend.detect_existing_overview_backend(base_folder)[source]

Detect an existing overview store in base_folder.

Parameters:

base_folder (Path)

Return type:

str | None

websweep.utils.backend.duckdb_available()[source]

Return True when the optional duckdb dependency can be imported.

Return type:

bool

websweep.utils.backend.resolve_overview_backend(base_folder, use_database, override_backend, urls_count=None)[source]

Resolve which overview backend to use (duckdb/sqlite/csv).

Parameters:
  • base_folder (Path)

  • use_database (bool)

  • override_backend (str | None)

  • urls_count (int | None)

Return type:

str

websweep.utils.json_io module

websweep.utils.json_io.json_dumps(obj)[source]

Serialize an object to UTF-8 JSON bytes (orjson when available).

Return type:

bytes

websweep.utils.json_io.json_loads(value)[source]

Parse JSON from bytes or text using the active JSON backend.

websweep.utils.json_io.append_jsonl(path, records)[source]

Append dictionaries to a NDJSON file, one JSON object per line.

Parameters:

records (Iterable[dict])

Return type:

None

websweep.utils.public_suffix module

Utilities for loading and refreshing the public suffix list (PSL).

websweep.utils.public_suffix.ensure_public_suffix_list()[source]

Return a local PSL path, updating it from GitHub when configured.

Return type:

Path

websweep.utils.public_suffix.build_tldextract_extractor(tldextract_module)[source]

Build a configured TLDExtract instance backed by the local PSL file.

websweep.utils.source_urls module

websweep.utils.source_urls.read_source_urls(source_file_path)[source]

Parse a source CSV/TSV and return URLs with optional identifiers.

Supported headers: - url / website / domain - identifier / id (optional)

Input hygiene: - auto-detects CSV vs TSV delimiters - keeps only level-0 rows when a level column exists - normalizes URLs and skips non-web schemes - removes exact duplicate (url, identifier) pairs while preserving order

Parameters:

source_file_path (Path)

Return type:

List[Tuple[str, str | None]]

websweep.utils.utils module

websweep.utils.utils.create_regex_pattern(keywords, regex)[source]

Build a case-insensitive regex from literal keywords and raw regex text.

websweep.utils.utils.set_regex(classification_file_path=None, allow_extensions=None, block_extensions=None)[source]

Load URL classification rules and return compiled regex/extension filters.

websweep.utils.utils.classify_url(url, level, url_regex_mail, negative_regex, url_regex, allowed_extensions=None, blocked_extensions=None)[source]

Return whether a URL should be crawled for the given crawl depth.

Return type:

bool

websweep.utils.utils.clean_url(url)[source]

Strip scheme and www. prefix for lightweight URL normalization.

Module contents