websweep.consolidator package

Submodules

websweep.consolidator.consolidator module

This module provides the Consolidator model-controller.

class websweep.consolidator.consolidator.Domain(domain, identifier, phone, email, fax, zipcode, address, kvk, btw, text)[source]

Bases: object

A data class representing a domain with various attributes.

Parameters:
  • domain (str)

  • identifier (str)

  • phone (Counter)

  • email (Counter)

  • fax (Counter)

  • zipcode (Counter)

  • address (Counter)

  • kvk (Counter)

  • btw (Counter)

  • text (str)

domain

The domain name.

Type:

str

identifier

The identifier of the domain.

Type:

str

phone

A counter for phone numbers.

Type:

Counter

email

A counter for email addresses.

Type:

Counter

fax

A counter for fax numbers.

Type:

Counter

zipcode

A counter for zip codes.

Type:

Counter

address

A counter for addresses.

Type:

Counter

kvk

A counter for KVK numbers.

Type:

Counter

btw

A counter for BTW numbers.

Type:

Counter

text

The text associated with the domain.

Type:

str

domain: str
identifier: str
phone: Counter
email: Counter
fax: Counter
zipcode: Counter
address: Counter
kvk: Counter
btw: Counter
text: str
to_dict()[source]

Converts the Domain object into a dictionary.

Returns:

A dictionary representation of the Domain object.

Return type:

Dict[str, Any]

classmethod from_dict(d)[source]

Creates a Domain object from a dictionary.

Parameters:

d (Dict[str, Any]) – A dictionary containing Domain attributes.

Returns:

A new Domain object created from the dictionary.

Return type:

Domain

class websweep.consolidator.consolidator.Consolidator(input_file=None, target_folder_path=None, output_file=None, chunk_size=10000)[source]

Bases: object

Process domain-level information from NDJSON files.

The consolidator reads extracted page-level records in chunks, aggregates values per domain, and writes a merged domain-level output file.

Parameters:
  • input_file (str | Path | None)

  • target_folder_path (str | Path | None)

  • output_file (str | Path | None)

  • chunk_size (int)

save_orjson_loads(line)[source]

Loads a line from an ndjson file using orjson.

Parameters:

line (str) – A line from an ndjson file.

Returns:

A dictionary representing the line.

Return type:

Dict[str, Any]

read_ndjson_in_chunks()[source]

Reads an ndjson file in chunks.

Yields:

Generator[List[Dict[str, Any]], None, None] – A generator that yields lists of dictionaries, each representing a line in the ndjson file.

Return type:

Generator[List[Dict[str, Any]], None, None]

create_domain_info(chunk, output_file)[source]

Creates domain information from sorted chunks and writes to an output file.

Parameters:
  • chunk (List[Dict[str, Any]]) – A list of dictionaries, each representing a domain.

  • output_file (str) – The path to the output file where the domain information will be written.

merge_domain_files(input_files, final_output)[source]

Merges multiple domain files into a single file.

Parameters:
  • input_files (List[str]) – A list of file paths to be merged.

  • final_output (str) – Path to the final output file.

consolidate(final_output=None)[source]

Run full consolidation: chunk, aggregate per chunk, then merge chunks.

Parameters:

final_output (str | Path | None)

Module contents