WebSweep Example for Researchers

This notebook is split into two parts:

Part A: the smallest default workflow (crawler -> extractor -> consolidator)
Part B: extended usage patterns (one-pass mode, custom FileExtractor, and parameter defaults)

What each pipeline step does:

Crawler: starts from base URLs (domains) and crawls pages following only within-domain links, applies exclusion rules, and stops at depth max_level (default 3).
Extractor: reads crawled pages and extracts page-level fields such as cleaned text (text), metadata (meta_*), and others (e.g., zipcode, address).
Consolidator: merges page-level records to one domain-level record with concatenated text, and aggregated data (e.g., zipcode counts: most frequent can be treated as main postcode, others as additional postcodes).

Input:
  URL list
      |
      v
  [Crawler]
    In:  URL list + crawl rules + max_level (default 3)
    Out: crawled_data/*.zip + overview_urls.{duckdb|db|tsv}
      |
      v
  [Extractor]
    In:  overview file + crawled_data/*.zip
    Out: extracted_data/*.ndjson (text, metadata, postcode/address, ...)
      |
      v
  [Consolidator]
    In:  extracted_data/*.ndjson
    Out: consolidated_data/*.ndjson (domain-level, postcode counts, concatenated text)

This notebook demonstrates the library API. For the CLI workflow (instance setup, recurring runs, and command options), see the README or the User Guide in the docs.

[12]:

from pathlib import Path
import json

urls = [
    "https://www.dggrootverbruik.nl/",
    "https://www.gosliga.nl/",
    "https://www.heeren2.nl/",
]

# Set up the paths
run_dir = Path("./data")
output_dir = run_dir / "research_output"
output_dir.mkdir(parents=True, exist_ok=True)

print("urls:", urls)
print("output_dir:", output_dir)

urls: ['https://www.dggrootverbruik.nl/', 'https://www.gosliga.nl/', 'https://www.heeren2.nl/']
output_dir: data/research_output

Part A. Default Workflow (No Extra Parameters)

Step 1. Crawl (defaults)

The crawler starts from the base URL list, downloads pages, follows links that stay within the same domain, skips excluded URLs/files, and continues up to depth max_level=3 by default.

Input

urls

Output

crawled_data/*.zip
overview_urls.{duckdb|db|tsv}

[13]:

from websweep.crawler.crawler import Crawler

crawler = Crawler(target_folder_path=output_dir)
crawler.crawl_base_urls(urls)

# Note: the code detects if the data exists. If you run it twice, the second time it will skip crawling

100%|█████████████████████████████████████████████| 3/3 [00:11<00:00,  3.74s/it]

Crawled 19 pages from 3 urls to level 3 in 11.2 seconds.

[14]:

# Print what it has downloaded
from pathlib import Path
print('Crawled data files:')
for p in sorted((output_dir / 'crawled_data').rglob('*')):
    if p.is_file():
        print(p.relative_to(output_dir))

Crawled data files:
crawled_data/dggrootverbruik.nl.zip
crawled_data/gosliga.nl.zip
crawled_data/heeren2.nl.zip

Step 2. Extract (defaults)

The extractor reads crawled pages and writes one record per page with structured fields, including cleaned text (text), metadata (meta_*), and location fields (zipcode, address).

Input

overview_urls.*
crawled_data/*.zip

Output

extracted_data/*.ndjson

[15]:

from websweep.extractor.extractor import Extractor

extractor = Extractor(target_folder_path=output_dir)
extractor.extract_urls()

100%|███████████████████████████████████████████| 19/19 [00:00<00:00, 94.72it/s]

Extracted data from 19 pages (0 errors) in 0.2 seconds.

[19]:

# Print 200 first characters of the first 10 extracted webpages
extracted_files = sorted((output_dir / "extracted_data").glob("*.ndjson"))
print("extracted files:", [f.name for f in extracted_files])

test_extracted = extracted_files[0]
with test_extracted.open("r", encoding="utf-8", errors="ignore") as f:
    for i, line in enumerate(f):
        if i >= 10:
            break
        print(line.rstrip()[:200])

extracted files: ['extracted_data_2026-02-23_0-1000000.ndjson']
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":0,"website":"https://www.gosliga.nl/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/www.goslig
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/home","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/home",
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/transport/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/contactformulier/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/over-ons/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/o
{"domain":"dggrootverbruik.nl","identifier":"dggrootverbruik.nl","level":1,"website":"https://www.dggrootverbruik.nl","date":"2026-02-23","path":"data/research_output/crawled_data/dggrootverbruik.nl/d
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/diensten/opslag-of-loodsen","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/goslig
{"domain":"dggrootverbruik.nl","identifier":"dggrootverbruik.nl","level":0,"website":"https://www.dggrootverbruik.nl/","date":"2026-02-23","path":"data/research_output/crawled_data/dggrootverbruik.nl/
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/home/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/home"
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/diensten/silovervoer","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2

Step 3. Consolidate (defaults)

The consolidator groups page-level records back to one record per domain. It keeps aggregated postcode information (e.g, zipcode counts: most frequent as main postcode, others as additional postcodes) and concatenates text from pages of the same domain.

Input

extracted_data/*.ndjson

Output

consolidated_data/consolidated.ndjson

[ ]:

from websweep.consolidator.consolidator import Consolidator

# Consolidator automatically reads the latest extracted_data/*.ndjson
# and writes to consolidated_data/consolidated.ndjson inside output_dir.
consolidator = Consolidator(target_folder_path=output_dir)
consolidator.consolidate()

consolidated_path = output_dir / "consolidated_data" / "consolidated.ndjson"
print("consolidated file:", consolidated_path)

consolidated_text = consolidated_path.read_text(encoding='utf-8', errors='ignore')
# First 200 characters of the consolidated output
print(consolidated_text[:200])

[21]:

if consolidated_path.exists():
    with consolidated_path.open("r", encoding="utf-8") as f:
        first_domain_record = json.loads(f.readline())
    print("consolidated keys:", sorted(first_domain_record.keys()))
    print("example domain:", first_domain_record.get("domain"))

consolidated keys: ['address', 'btw', 'domain', 'email', 'fax', 'identifier', 'kvk', 'phone', 'text', 'zipcode']
example domain: dggrootverbruik.nl

Part B. Extended Usage

This section shows optional advanced patterns after you understand the default loop.

B1. Crawl + Extract in One Pass (save disk) + Extension Filters

Use one-pass mode when you want to skip saving raw HTML zip files. You can also pass allow_extensions / block_extensions to Crawler to control which linked file types are followed.

[ ]:

from websweep.crawler.crawler import Crawler

one_pass_dir = run_dir / "research_output_one_pass"
one_pass_dir.mkdir(parents=True, exist_ok=True)

# Optional extension controls (comma-separated string or list both work).
# Keep PDFs and PNGs discoverable, while skipping common binary/archive types.
one_pass_crawler = Crawler(
    target_folder_path=one_pass_dir,
    extract=True,
    save_html=False,
    allow_extensions="pdf,png",
    block_extensions="zip,gz,rar,jpg,jpeg",
)
one_pass_crawler.crawl_base_urls(urls)

B2. Custom `FileExtractor`

By default, the extractor keeps conservative fields. You can add your own fields by subclassing FileExtractor.

[27]:

import re2 as re
from websweep.extractor.extractor import Extractor, FileExtractor

## Repository add-on example:
#- `addons/firmbackbone_extractor.py`

class ResearchFileExtractor(FileExtractor):
    def _extract_fax(self) -> list:
        pattern = re.compile(
            r"(?is)\b(?:faxnumber|fax|f)\b[^0-9\+]{0,12}"
            r"([\+]?[0-9][0-9\-\s\(\)]{7,20})\b"
        )
        return sorted({m.strip() for m in re.findall(pattern, str(self.soup))})

custom_extractor = Extractor(
    target_folder_path=output_dir,
    file_extractor=ResearchFileExtractor,
)
custom_extractor.extract_urls()  # uncomment to run custom extraction

100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 147.48it/s]

Extracted data from 19 pages (0 errors) in 0.2 seconds.

B3. Show Default Function Parameters

This is useful when you want to understand defaults such as max_level, threads_download, and extractor/consolidator defaults.

[23]:

import inspect
from websweep.crawler.crawler import Crawler
from websweep.extractor.extractor import Extractor
from websweep.consolidator.consolidator import Consolidator

print("Crawler.__init__ defaults:")
print(inspect.signature(Crawler.__init__))
print()
print("Extractor.__init__ defaults:")
print(inspect.signature(Extractor.__init__))
print()
print("Consolidator.__init__ defaults:")
print(inspect.signature(Consolidator.__init__))

Crawler.__init__ defaults:
(self, target_folder_path, target_temp_folder_path=None, save_html=True, max_level=3, classification_file_path=None, allow_extensions=None, block_extensions=None, verify_ssl=False, concurrency_base_urls=60, threads_bs4=10, threads_download=120, use_database=True, sock_connect=180, extract=False, headers=None, file_extractor=None, max_pages_per_domain=50, min_days_between_crawls=30, chunk_size=1000000, overview_backend: Optional[str] = None, concurrency_pages: Optional[int] = None, page_batch_size: int = 500, base_url_batch_size: int = 1000, **kwargs)

Extractor.__init__ defaults:
(self, target_folder_path, use_database=True, extractor_delete_files=False, start_date='0000-01-01', end_date='9999-01-01', file_extractor: websweep.extractor.extractor.FileExtractor = None, overview_backend: Optional[str] = None, workers: Optional[int] = None, imap_chunksize: int = 50, maxtasksperchild: int = 1000, extract_timeout_seconds: int = 10, **kwargs)

Consolidator.__init__ defaults:
(self, input_file: str, chunk_size: int = 10000)

Optional add-on module

Repository add-on example:

addons/firmbackbone_extractor.py

WebSweep Example for Researchers

Part A. Default Workflow (No Extra Parameters)

Step 1. Crawl (defaults)

Step 2. Extract (defaults)

Step 3. Consolidate (defaults)

Part B. Extended Usage

B1. Crawl + Extract in One Pass (save disk) + Extension Filters

B2. Custom FileExtractor

B3. Show Default Function Parameters

Optional add-on module

B2. Custom `FileExtractor`