websweep package

Subpackages

Submodules

websweep.config module

This module provides the WebSweep config functionality.

websweep.config.current_websweep_instance()[source]

Return the current websweep location

Return type:

Path

websweep.config.init_app(target_folder_path, source_file_path, extractor_delete_files, use_database, extractor_addon_file=None)[source]

Initialize the application.

Parameters:
  • target_folder_path (str)

  • source_file_path (str)

  • extractor_delete_files (bool)

  • use_database (bool)

  • extractor_addon_file (Path | None)

Return type:

int

websweep.config.restore_app(target_folder_path)[source]

Restore existing application.

Parameters:

target_folder_path (Path)

Return type:

int

websweep.config.get_target_folder_path(config_file=None)[source]

Return the current WebSweep instance location path

Parameters:

config_file (Path)

Return type:

Path

websweep.config.get_source_file_path(config_file=None)[source]

Return the current source file path

Parameters:

config_file (Path)

Return type:

Path

websweep.config.get_extractor_delete(config_file=None)[source]

Return whether to delete processed raw files

Parameters:

config_file (Path)

Return type:

bool

websweep.config.get_extractor_addon_file(config_file=None)[source]

Return the configured extractor add-on file path or None.

Parameters:

config_file (Path)

Return type:

Path | None

websweep.config.get_use_database(config_file=None)[source]

Return whether overview data should use a database backend.

Parameters:

config_file (Path)

Return type:

bool

websweep.main module

websweep.main.operate()[source]

Validate active instance configuration before running operational commands.

websweep.main.init(headless=<typer.models.OptionInfo object>)[source]

Initialise a new WebSweep instance. The instance location is stored in the application config file, a new folder location is created and a setting file is created within this folder.

Parameters:

headless (bool)

Return type:

None

websweep.main.main(version=<typer.models.OptionInfo object>)[source]

Typer root callback.

Parameters:

version (bool | None)

Return type:

None

websweep.main.restore(headless=<typer.models.OptionInfo object>)[source]

Restore configuration of existing WebSweep instance. The exisiting location is stored in the application config file and the exisiting settings in the settings file are validated.

Parameters:

headless (bool)

Return type:

None

websweep.main.cli_config(delete_processed_files=<typer.models.OptionInfo object>, source_file_path=<typer.models.OptionInfo object>)[source]

Alter WebSweep configuration settings

Parameters:
  • delete_processed_files (bool)

  • source_file_path (str)

Return type:

None

websweep.main.websweep_address()[source]

Open configured WebSweep instance folder

Return type:

None

websweep.main.crawl(complement=<typer.models.OptionInfo object>, sock_connect=<typer.models.OptionInfo object>, extract=<typer.models.OptionInfo object>, classification_file=<typer.models.OptionInfo object>, allow_extensions=<typer.models.OptionInfo object>, block_extensions=<typer.models.OptionInfo object>, target_temp_folder_path=<typer.models.OptionInfo object>)[source]

Start crawling websites.

Parameters:
  • complement (str)

  • sock_connect (int)

  • extract (bool)

  • classification_file (Path)

  • allow_extensions (str)

  • block_extensions (str)

  • target_temp_folder_path (Path)

Return type:

None

websweep.main.extract(start_date=<typer.models.OptionInfo object>, end_date=<typer.models.OptionInfo object>, workers=<typer.models.OptionInfo object>)[source]

Start extracting data from fetched files.

Parameters:
  • start_date (str)

  • end_date (str)

  • workers (int)

Return type:

None

websweep.main.consolidate(input_file=<typer.models.OptionInfo object>, output_file=<typer.models.OptionInfo object>, chunk_size=<typer.models.OptionInfo object>)[source]

Consolidate page-level extracted NDJSON into domain-level NDJSON.

Parameters:
  • input_file (Path | None)

  • output_file (Path | None)

  • chunk_size (int)

Return type:

None

Module contents