websweep.crawler package
Submodules
websweep.crawler.crawler module
This module provides the Crawler model-controller.
- class websweep.crawler.crawler.Crawler(target_folder_path, target_temp_folder_path=None, save_html=True, max_level=3, classification_file_path=None, allow_extensions=None, block_extensions=None, verify_ssl=False, concurrency_base_urls=60, threads_bs4=10, threads_download=120, use_database=True, sock_connect=180, extract=False, headers=None, file_extractor=None, max_pages_per_domain=50, min_days_between_crawls=30, chunk_size=1000000, overview_backend=None, concurrency_pages=None, page_batch_size=500, base_url_batch_size=1000, max_concurrency_per_domain=1, overview_create_indexes=None, duckdb_deduplicate=False, **kwargs)[source]
Bases:
objectCrawl websites to a bounded depth and store crawl overview plus raw pages.
- Parameters:
overview_backend (str | None)
concurrency_pages (int | None)
page_batch_size (int)
base_url_batch_size (int)
max_concurrency_per_domain (int)
overview_create_indexes (bool | None)
duckdb_deduplicate (bool)
- get_urls(r, url, domain, level, identifier)[source]
Parse code and return content and urls. Defined it here to be able to pickle it and process it in a thread pool.