WebSweep Example for Researchers
This notebook is split into two parts:
Part A: the smallest default workflow (crawler -> extractor -> consolidator)
Part B: extended usage patterns (one-pass mode, custom
FileExtractor, and parameter defaults)
What each pipeline step does:
Crawler: starts from base URLs (domains) and crawls pages following only within-domain links, applies exclusion rules, and stops at depth
max_level(default3).Extractor: reads crawled pages and extracts page-level fields such as cleaned text (
text), metadata (meta_*), and others (e.g.,zipcode,address).Consolidator: merges page-level records to one domain-level record with concatenated text, and aggregated data (e.g.,
zipcodecounts: most frequent can be treated as main postcode, others as additional postcodes).
Input:
URL list
|
v
[Crawler]
In: URL list + crawl rules + max_level (default 3)
Out: crawled_data/*.zip + overview_urls.{duckdb|db|tsv}
|
v
[Extractor]
In: overview file + crawled_data/*.zip
Out: extracted_data/*.ndjson (text, metadata, postcode/address, ...)
|
v
[Consolidator]
In: extracted_data/*.ndjson
Out: consolidated_data/*.ndjson (domain-level, postcode counts, concatenated text)
This notebook demonstrates the library API. For the CLI workflow (instance setup, recurring runs, and command options), see the README or the User Guide in the docs.
[12]:
from pathlib import Path
import json
urls = [
"https://www.dggrootverbruik.nl/",
"https://www.gosliga.nl/",
"https://www.heeren2.nl/",
]
# Set up the paths
run_dir = Path("./data")
output_dir = run_dir / "research_output"
output_dir.mkdir(parents=True, exist_ok=True)
print("urls:", urls)
print("output_dir:", output_dir)
urls: ['https://www.dggrootverbruik.nl/', 'https://www.gosliga.nl/', 'https://www.heeren2.nl/']
output_dir: data/research_output
Part A. Default Workflow (No Extra Parameters)
Step 1. Crawl (defaults)
The crawler starts from the base URL list, downloads pages, follows links that stay within the same domain, skips excluded URLs/files, and continues up to depth max_level=3 by default.
Input
urls
Output
crawled_data/*.zipoverview_urls.{duckdb|db|tsv}
[13]:
from websweep.crawler.crawler import Crawler
crawler = Crawler(target_folder_path=output_dir)
crawler.crawl_base_urls(urls)
# Note: the code detects if the data exists. If you run it twice, the second time it will skip crawling
100%|█████████████████████████████████████████████| 3/3 [00:11<00:00, 3.74s/it]
Crawled 19 pages from 3 urls to level 3 in 11.2 seconds.
[14]:
# Print what it has downloaded
from pathlib import Path
print('Crawled data files:')
for p in sorted((output_dir / 'crawled_data').rglob('*')):
if p.is_file():
print(p.relative_to(output_dir))
Crawled data files:
crawled_data/dggrootverbruik.nl.zip
crawled_data/gosliga.nl.zip
crawled_data/heeren2.nl.zip
Step 2. Extract (defaults)
The extractor reads crawled pages and writes one record per page with structured fields, including cleaned text (text), metadata (meta_*), and location fields (zipcode, address).
Input
overview_urls.*crawled_data/*.zip
Output
extracted_data/*.ndjson
[15]:
from websweep.extractor.extractor import Extractor
extractor = Extractor(target_folder_path=output_dir)
extractor.extract_urls()
100%|███████████████████████████████████████████| 19/19 [00:00<00:00, 94.72it/s]
Extracted data from 19 pages (0 errors) in 0.2 seconds.
[19]:
# Print 200 first characters of the first 10 extracted webpages
extracted_files = sorted((output_dir / "extracted_data").glob("*.ndjson"))
print("extracted files:", [f.name for f in extracted_files])
test_extracted = extracted_files[0]
with test_extracted.open("r", encoding="utf-8", errors="ignore") as f:
for i, line in enumerate(f):
if i >= 10:
break
print(line.rstrip()[:200])
extracted files: ['extracted_data_2026-02-23_0-1000000.ndjson']
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":0,"website":"https://www.gosliga.nl/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/www.goslig
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/home","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/home",
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/transport/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/contactformulier/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/over-ons/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/o
{"domain":"dggrootverbruik.nl","identifier":"dggrootverbruik.nl","level":1,"website":"https://www.dggrootverbruik.nl","date":"2026-02-23","path":"data/research_output/crawled_data/dggrootverbruik.nl/d
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/diensten/opslag-of-loodsen","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/goslig
{"domain":"dggrootverbruik.nl","identifier":"dggrootverbruik.nl","level":0,"website":"https://www.dggrootverbruik.nl/","date":"2026-02-23","path":"data/research_output/crawled_data/dggrootverbruik.nl/
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/home/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/home"
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/diensten/silovervoer","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2
Step 3. Consolidate (defaults)
The consolidator groups page-level records back to one record per domain. It keeps aggregated postcode information (e.g, zipcode counts: most frequent as main postcode, others as additional postcodes) and concatenates text from pages of the same domain.
Input
extracted_data/*.ndjson
Output
consolidated_data/consolidated.ndjson
[ ]:
from websweep.consolidator.consolidator import Consolidator
# Consolidator automatically reads the latest extracted_data/*.ndjson
# and writes to consolidated_data/consolidated.ndjson inside output_dir.
consolidator = Consolidator(target_folder_path=output_dir)
consolidator.consolidate()
consolidated_path = output_dir / "consolidated_data" / "consolidated.ndjson"
print("consolidated file:", consolidated_path)
consolidated_text = consolidated_path.read_text(encoding='utf-8', errors='ignore')
# First 200 characters of the consolidated output
print(consolidated_text[:200])
[21]:
if consolidated_path.exists():
with consolidated_path.open("r", encoding="utf-8") as f:
first_domain_record = json.loads(f.readline())
print("consolidated keys:", sorted(first_domain_record.keys()))
print("example domain:", first_domain_record.get("domain"))
consolidated keys: ['address', 'btw', 'domain', 'email', 'fax', 'identifier', 'kvk', 'phone', 'text', 'zipcode']
example domain: dggrootverbruik.nl
Part B. Extended Usage
This section shows optional advanced patterns after you understand the default loop.
B1. Crawl + Extract in One Pass (save disk) + Extension Filters
Use one-pass mode when you want to skip saving raw HTML zip files. You can also pass allow_extensions / block_extensions to Crawler to control which linked file types are followed.
[ ]:
from websweep.crawler.crawler import Crawler
one_pass_dir = run_dir / "research_output_one_pass"
one_pass_dir.mkdir(parents=True, exist_ok=True)
# Optional extension controls (comma-separated string or list both work).
# Keep PDFs and PNGs discoverable, while skipping common binary/archive types.
one_pass_crawler = Crawler(
target_folder_path=one_pass_dir,
extract=True,
save_html=False,
allow_extensions="pdf,png",
block_extensions="zip,gz,rar,jpg,jpeg",
)
one_pass_crawler.crawl_base_urls(urls)
B2. Custom FileExtractor
By default, the extractor keeps conservative fields. You can add your own fields by subclassing FileExtractor.
[27]:
import re2 as re
from websweep.extractor.extractor import Extractor, FileExtractor
## Repository add-on example:
#- `addons/firmbackbone_extractor.py`
class ResearchFileExtractor(FileExtractor):
def _extract_fax(self) -> list:
pattern = re.compile(
r"(?is)\b(?:faxnumber|fax|f)\b[^0-9\+]{0,12}"
r"([\+]?[0-9][0-9\-\s\(\)]{7,20})\b"
)
return sorted({m.strip() for m in re.findall(pattern, str(self.soup))})
custom_extractor = Extractor(
target_folder_path=output_dir,
file_extractor=ResearchFileExtractor,
)
custom_extractor.extract_urls() # uncomment to run custom extraction
100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 147.48it/s]
Extracted data from 19 pages (0 errors) in 0.2 seconds.
B3. Show Default Function Parameters
This is useful when you want to understand defaults such as max_level, threads_download, and extractor/consolidator defaults.
[23]:
import inspect
from websweep.crawler.crawler import Crawler
from websweep.extractor.extractor import Extractor
from websweep.consolidator.consolidator import Consolidator
print("Crawler.__init__ defaults:")
print(inspect.signature(Crawler.__init__))
print()
print("Extractor.__init__ defaults:")
print(inspect.signature(Extractor.__init__))
print()
print("Consolidator.__init__ defaults:")
print(inspect.signature(Consolidator.__init__))
Crawler.__init__ defaults:
(self, target_folder_path, target_temp_folder_path=None, save_html=True, max_level=3, classification_file_path=None, allow_extensions=None, block_extensions=None, verify_ssl=False, concurrency_base_urls=60, threads_bs4=10, threads_download=120, use_database=True, sock_connect=180, extract=False, headers=None, file_extractor=None, max_pages_per_domain=50, min_days_between_crawls=30, chunk_size=1000000, overview_backend: Optional[str] = None, concurrency_pages: Optional[int] = None, page_batch_size: int = 500, base_url_batch_size: int = 1000, **kwargs)
Extractor.__init__ defaults:
(self, target_folder_path, use_database=True, extractor_delete_files=False, start_date='0000-01-01', end_date='9999-01-01', file_extractor: websweep.extractor.extractor.FileExtractor = None, overview_backend: Optional[str] = None, workers: Optional[int] = None, imap_chunksize: int = 50, maxtasksperchild: int = 1000, extract_timeout_seconds: int = 10, **kwargs)
Consolidator.__init__ defaults:
(self, input_file: str, chunk_size: int = 10000)
Optional add-on module
Repository add-on example:
addons/firmbackbone_extractor.py