websweep.crawler package

Submodules

websweep.crawler.crawler module

This module provides the Crawler model-controller.

class websweep.crawler.crawler.Crawler(target_folder_path, target_temp_folder_path=None, save_html=True, max_level=3, classification_file_path=None, allow_extensions=None, block_extensions=None, verify_ssl=False, concurrency_base_urls=60, threads_bs4=10, threads_download=120, use_database=True, sock_connect=180, extract=False, headers=None, file_extractor=None, max_pages_per_domain=50, min_days_between_crawls=30, chunk_size=1000000, overview_backend=None, concurrency_pages=None, page_batch_size=500, base_url_batch_size=1000, max_concurrency_per_domain=1, overview_create_indexes=None, duckdb_deduplicate=False, **kwargs)[source]

Bases: object

Crawl websites to a bounded depth and store crawl overview plus raw pages.

Parameters:

overview_backend (str | None)
concurrency_pages (int | None)
page_batch_size (int)
base_url_batch_size (int)
max_concurrency_per_domain (int)
overview_create_indexes (bool | None)
duckdb_deduplicate (bool)

get_urls(r, url, domain, level, identifier)[source]: Parse code and return content and urls. Defined it here to be able to pickle it and process it in a thread pool.

crawl_base_urls(urls)[source]

Create initial asynchronous task to fetch all urls

Parameters:: urls – List of all level 0 urls to visit

crawl_complement_base_urls(complement_date)[source]: Re-crawl failed level-0 URLs from a specific crawl session_date.

websweep.crawler package

Submodules

websweep.crawler.crawler module

Module contents