websweep package
Subpackages
- websweep.consolidator package
- Submodules
- websweep.consolidator.consolidator module
DomainDomain.domainDomain.identifierDomain.phoneDomain.emailDomain.faxDomain.zipcodeDomain.addressDomain.kvkDomain.btwDomain.textDomain.domainDomain.identifierDomain.phoneDomain.emailDomain.faxDomain.zipcodeDomain.addressDomain.kvkDomain.btwDomain.textDomain.to_dict()Domain.from_dict()
Consolidator
- Module contents
- websweep.crawler package
- websweep.extractor package
- websweep.utils package
Submodules
websweep.config module
This module provides the WebSweep config functionality.
- websweep.config.current_websweep_instance()[source]
Return the current websweep location
- Return type:
Path
- websweep.config.init_app(target_folder_path, source_file_path, extractor_delete_files, use_database, extractor_addon_file=None)[source]
Initialize the application.
- Parameters:
target_folder_path (str)
source_file_path (str)
extractor_delete_files (bool)
use_database (bool)
extractor_addon_file (Path | None)
- Return type:
int
- websweep.config.restore_app(target_folder_path)[source]
Restore existing application.
- Parameters:
target_folder_path (Path)
- Return type:
int
- websweep.config.get_target_folder_path(config_file=None)[source]
Return the current WebSweep instance location path
- Parameters:
config_file (Path)
- Return type:
Path
- websweep.config.get_source_file_path(config_file=None)[source]
Return the current source file path
- Parameters:
config_file (Path)
- Return type:
Path
- websweep.config.get_extractor_delete(config_file=None)[source]
Return whether to delete processed raw files
- Parameters:
config_file (Path)
- Return type:
bool
websweep.main module
- websweep.main.operate()[source]
Validate active instance configuration before running operational commands.
- websweep.main.init(headless=<typer.models.OptionInfo object>)[source]
Initialise a new WebSweep instance. The instance location is stored in the application config file, a new folder location is created and a setting file is created within this folder.
- Parameters:
headless (bool)
- Return type:
None
- websweep.main.main(version=<typer.models.OptionInfo object>)[source]
Typer root callback.
- Parameters:
version (bool | None)
- Return type:
None
- websweep.main.restore(headless=<typer.models.OptionInfo object>)[source]
Restore configuration of existing WebSweep instance. The exisiting location is stored in the application config file and the exisiting settings in the settings file are validated.
- Parameters:
headless (bool)
- Return type:
None
- websweep.main.cli_config(delete_processed_files=<typer.models.OptionInfo object>, source_file_path=<typer.models.OptionInfo object>)[source]
Alter WebSweep configuration settings
- Parameters:
delete_processed_files (bool)
source_file_path (str)
- Return type:
None
- websweep.main.websweep_address()[source]
Open configured WebSweep instance folder
- Return type:
None
- websweep.main.crawl(complement=<typer.models.OptionInfo object>, sock_connect=<typer.models.OptionInfo object>, extract=<typer.models.OptionInfo object>, classification_file=<typer.models.OptionInfo object>, allow_extensions=<typer.models.OptionInfo object>, block_extensions=<typer.models.OptionInfo object>, target_temp_folder_path=<typer.models.OptionInfo object>)[source]
Start crawling websites.
- Parameters:
complement (str)
sock_connect (int)
extract (bool)
classification_file (Path)
allow_extensions (str)
block_extensions (str)
target_temp_folder_path (Path)
- Return type:
None
- websweep.main.extract(start_date=<typer.models.OptionInfo object>, end_date=<typer.models.OptionInfo object>, workers=<typer.models.OptionInfo object>)[source]
Start extracting data from fetched files.
- Parameters:
start_date (str)
end_date (str)
workers (int)
- Return type:
None
- websweep.main.consolidate(input_file=<typer.models.OptionInfo object>, output_file=<typer.models.OptionInfo object>, chunk_size=<typer.models.OptionInfo object>)[source]
Consolidate page-level extracted NDJSON into domain-level NDJSON.
- Parameters:
input_file (Path | None)
output_file (Path | None)
chunk_size (int)
- Return type:
None