websweep.extractor package
Submodules
websweep.extractor.add_host module
Legacy host-enrichment helpers.
Historically this module contained one-off analysis scripts that were never part of the runtime pipeline. It is kept as an intentionally empty module to preserve import paths referenced by older docs.
websweep.extractor.extractor module
This module provides the Extracter model-controller.
- class websweep.extractor.extractor.FileExtractor(info)[source]
Bases:
objectA class for extracting data from one specific file. This class is used by the extractor pipeline. Custom FileExtractor subclasses can be build, extending the data extracting functionalities.
- Parameters:
info – tuple A tuple containing metadata about the file to extract data from, including the domain, level, website, date and path.
- extracting()[source]
Initiates the extracting of data from a file at the specified file path. Calls extracting_default_metadata() and extract_extended_metadata().
- extract_default_metadata()[source]
Defines methods that include the default extracting functionalities.
- extract_extended_metadata()
Defines methods that include the extendable extracting functionalities in subclasses.
- class websweep.extractor.extractor.Extractor(target_folder_path, use_database=True, extractor_delete_files=False, start_date='0000-01-01', end_date='9999-01-01', file_extractor=None, overview_backend=None, workers=None, imap_chunksize=50, maxtasksperchild=1000, extract_timeout_seconds=10, **kwargs)[source]
Bases:
objectA class for extracting data from files and storing it in the target folder.
- Parameters:
target_folder_path – str The path to the folder where the extracted data is stored.
use_database – bool, optional Whether or not to use a database backend (duckdb/sqlite) for the overview file. If False, TSV is used. Default is True.
extractor_delete_files – bool, optional Whether or not to delete the original files after extracting data. Default is False.
file_extractor (FileExtractor) – FileExtractor, optional An custom instance of a FileExtractor class used to extract data from files. Default is None, in which case it will use the default FileExtractor class.
overview_backend (str | None)
workers (int | None)
imap_chunksize (int)
maxtasksperchild (int)
extract_timeout_seconds (int)