websweep.consolidator package
Submodules
websweep.consolidator.consolidator module
This module provides the Consolidator model-controller.
- class websweep.consolidator.consolidator.Domain(domain, identifier, phone, email, fax, zipcode, address, kvk, btw, text)[source]
Bases:
objectA data class representing a domain with various attributes.
- Parameters:
domain (str)
identifier (str)
phone (Counter)
email (Counter)
fax (Counter)
zipcode (Counter)
address (Counter)
kvk (Counter)
btw (Counter)
text (str)
- domain
The domain name.
- Type:
str
- identifier
The identifier of the domain.
- Type:
str
- phone
A counter for phone numbers.
- Type:
Counter
- email
A counter for email addresses.
- Type:
Counter
- fax
A counter for fax numbers.
- Type:
Counter
- zipcode
A counter for zip codes.
- Type:
Counter
- address
A counter for addresses.
- Type:
Counter
- kvk
A counter for KVK numbers.
- Type:
Counter
- btw
A counter for BTW numbers.
- Type:
Counter
- text
The text associated with the domain.
- Type:
str
- domain: str
- identifier: str
- phone: Counter
- email: Counter
- fax: Counter
- zipcode: Counter
- address: Counter
- kvk: Counter
- btw: Counter
- text: str
- class websweep.consolidator.consolidator.Consolidator(input_file=None, target_folder_path=None, output_file=None, chunk_size=10000)[source]
Bases:
objectProcess domain-level information from NDJSON files.
The consolidator reads extracted page-level records in chunks, aggregates values per domain, and writes a merged domain-level output file.
- Parameters:
input_file (str | Path | None)
target_folder_path (str | Path | None)
output_file (str | Path | None)
chunk_size (int)
- save_orjson_loads(line)[source]
Loads a line from an ndjson file using orjson.
- Parameters:
line (str) – A line from an ndjson file.
- Returns:
A dictionary representing the line.
- Return type:
Dict[str, Any]
- read_ndjson_in_chunks()[source]
Reads an ndjson file in chunks.
- Yields:
Generator[List[Dict[str, Any]], None, None] – A generator that yields lists of dictionaries, each representing a line in the ndjson file.
- Return type:
Generator[List[Dict[str, Any]], None, None]
- create_domain_info(chunk, output_file)[source]
Creates domain information from sorted chunks and writes to an output file.
- Parameters:
chunk (List[Dict[str, Any]]) – A list of dictionaries, each representing a domain.
output_file (str) – The path to the output file where the domain information will be written.