WebSweep Example for Researchers

This notebook is split into two parts:

  • Part A: the smallest default workflow (crawler -> extractor -> consolidator)

  • Part B: extended usage patterns (one-pass mode, custom FileExtractor, and parameter defaults)

What each pipeline step does:

  • Crawler: starts from base URLs (domains) and crawls pages following only within-domain links, applies exclusion rules, and stops at depth max_level (default 3).

  • Extractor: reads crawled pages and extracts page-level fields such as cleaned text (text), metadata (meta_*), and others (e.g., zipcode, address).

  • Consolidator: merges page-level records to one domain-level record with concatenated text, and aggregated data (e.g., zipcode counts: most frequent can be treated as main postcode, others as additional postcodes).

Input:
  URL list
      |
      v
  [Crawler]
    In:  URL list + crawl rules + max_level (default 3)
    Out: crawled_data/*.zip + overview_urls.{duckdb|db|tsv}
      |
      v
  [Extractor]
    In:  overview file + crawled_data/*.zip
    Out: extracted_data/*.ndjson (text, metadata, postcode/address, ...)
      |
      v
  [Consolidator]
    In:  extracted_data/*.ndjson
    Out: consolidated_data/*.ndjson (domain-level, postcode counts, concatenated text)

This notebook demonstrates the library API. For the CLI workflow (instance setup, recurring runs, and command options), see the README or the User Guide in the docs.

[12]:
from pathlib import Path
import json

urls = [
    "https://www.dggrootverbruik.nl/",
    "https://www.gosliga.nl/",
    "https://www.heeren2.nl/",
]

# Set up the paths
run_dir = Path("./data")
output_dir = run_dir / "research_output"
output_dir.mkdir(parents=True, exist_ok=True)

print("urls:", urls)
print("output_dir:", output_dir)

urls: ['https://www.dggrootverbruik.nl/', 'https://www.gosliga.nl/', 'https://www.heeren2.nl/']
output_dir: data/research_output

Part A. Default Workflow (No Extra Parameters)

Step 1. Crawl (defaults)

The crawler starts from the base URL list, downloads pages, follows links that stay within the same domain, skips excluded URLs/files, and continues up to depth max_level=3 by default.

Input

  • urls

Output

  • crawled_data/*.zip

  • overview_urls.{duckdb|db|tsv}

[13]:
from websweep.crawler.crawler import Crawler

crawler = Crawler(target_folder_path=output_dir)
crawler.crawl_base_urls(urls)

# Note: the code detects if the data exists. If you run it twice, the second time it will skip crawling
100%|█████████████████████████████████████████████| 3/3 [00:11<00:00,  3.74s/it]
Crawled 19 pages from 3 urls to level 3 in 11.2 seconds.

[14]:
# Print what it has downloaded
from pathlib import Path
print('Crawled data files:')
for p in sorted((output_dir / 'crawled_data').rglob('*')):
    if p.is_file():
        print(p.relative_to(output_dir))
Crawled data files:
crawled_data/dggrootverbruik.nl.zip
crawled_data/gosliga.nl.zip
crawled_data/heeren2.nl.zip

Step 2. Extract (defaults)

The extractor reads crawled pages and writes one record per page with structured fields, including cleaned text (text), metadata (meta_*), and location fields (zipcode, address).

Input

  • overview_urls.*

  • crawled_data/*.zip

Output

  • extracted_data/*.ndjson

[15]:
from websweep.extractor.extractor import Extractor

extractor = Extractor(target_folder_path=output_dir)
extractor.extract_urls()

100%|███████████████████████████████████████████| 19/19 [00:00<00:00, 94.72it/s]
Extracted data from 19 pages (0 errors) in 0.2 seconds.

[19]:
# Print 200 first characters of the first 10 extracted webpages
extracted_files = sorted((output_dir / "extracted_data").glob("*.ndjson"))
print("extracted files:", [f.name for f in extracted_files])

test_extracted = extracted_files[0]
with test_extracted.open("r", encoding="utf-8", errors="ignore") as f:
    for i, line in enumerate(f):
        if i >= 10:
            break
        print(line.rstrip()[:200])

extracted files: ['extracted_data_2026-02-23_0-1000000.ndjson']
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":0,"website":"https://www.gosliga.nl/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/www.goslig
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/home","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/home",
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/transport/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/contactformulier/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/over-ons/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/o
{"domain":"dggrootverbruik.nl","identifier":"dggrootverbruik.nl","level":1,"website":"https://www.dggrootverbruik.nl","date":"2026-02-23","path":"data/research_output/crawled_data/dggrootverbruik.nl/d
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/diensten/opslag-of-loodsen","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/goslig
{"domain":"dggrootverbruik.nl","identifier":"dggrootverbruik.nl","level":0,"website":"https://www.dggrootverbruik.nl/","date":"2026-02-23","path":"data/research_output/crawled_data/dggrootverbruik.nl/
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/home/","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2026-02-23/home"
{"domain":"gosliga.nl","identifier":"gosliga.nl","level":1,"website":"https://www.gosliga.nl/diensten/silovervoer","date":"2026-02-23","path":"data/research_output/crawled_data/gosliga.nl/gosliga.nl/2

Step 3. Consolidate (defaults)

The consolidator groups page-level records back to one record per domain. It keeps aggregated postcode information (e.g, zipcode counts: most frequent as main postcode, others as additional postcodes) and concatenates text from pages of the same domain.

Input

  • extracted_data/*.ndjson

Output

  • consolidated_data/consolidated.ndjson

[ ]:
from websweep.consolidator.consolidator import Consolidator

# Consolidator automatically reads the latest extracted_data/*.ndjson
# and writes to consolidated_data/consolidated.ndjson inside output_dir.
consolidator = Consolidator(target_folder_path=output_dir)
consolidator.consolidate()

consolidated_path = output_dir / "consolidated_data" / "consolidated.ndjson"
print("consolidated file:", consolidated_path)

consolidated_text = consolidated_path.read_text(encoding='utf-8', errors='ignore')
# First 200 characters of the consolidated output
print(consolidated_text[:200])

[21]:
if consolidated_path.exists():
    with consolidated_path.open("r", encoding="utf-8") as f:
        first_domain_record = json.loads(f.readline())
    print("consolidated keys:", sorted(first_domain_record.keys()))
    print("example domain:", first_domain_record.get("domain"))

consolidated keys: ['address', 'btw', 'domain', 'email', 'fax', 'identifier', 'kvk', 'phone', 'text', 'zipcode']
example domain: dggrootverbruik.nl

Part B. Extended Usage

This section shows optional advanced patterns after you understand the default loop.

B1. Crawl + Extract in One Pass (save disk) + Extension Filters

Use one-pass mode when you want to skip saving raw HTML zip files. You can also pass allow_extensions / block_extensions to Crawler to control which linked file types are followed.

[ ]:
from websweep.crawler.crawler import Crawler

one_pass_dir = run_dir / "research_output_one_pass"
one_pass_dir.mkdir(parents=True, exist_ok=True)

# Optional extension controls (comma-separated string or list both work).
# Keep PDFs and PNGs discoverable, while skipping common binary/archive types.
one_pass_crawler = Crawler(
    target_folder_path=one_pass_dir,
    extract=True,
    save_html=False,
    allow_extensions="pdf,png",
    block_extensions="zip,gz,rar,jpg,jpeg",
)
one_pass_crawler.crawl_base_urls(urls)

B2. Custom FileExtractor

By default, the extractor keeps conservative fields. You can add your own fields by subclassing FileExtractor.

[27]:
import re2 as re
from websweep.extractor.extractor import Extractor, FileExtractor

## Repository add-on example:
#- `addons/firmbackbone_extractor.py`

class ResearchFileExtractor(FileExtractor):
    def _extract_fax(self) -> list:
        pattern = re.compile(
            r"(?is)\b(?:faxnumber|fax|f)\b[^0-9\+]{0,12}"
            r"([\+]?[0-9][0-9\-\s\(\)]{7,20})\b"
        )
        return sorted({m.strip() for m in re.findall(pattern, str(self.soup))})

custom_extractor = Extractor(
    target_folder_path=output_dir,
    file_extractor=ResearchFileExtractor,
)
custom_extractor.extract_urls()  # uncomment to run custom extraction


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 147.48it/s]
Extracted data from 19 pages (0 errors) in 0.2 seconds.

B3. Show Default Function Parameters

This is useful when you want to understand defaults such as max_level, threads_download, and extractor/consolidator defaults.

[23]:
import inspect
from websweep.crawler.crawler import Crawler
from websweep.extractor.extractor import Extractor
from websweep.consolidator.consolidator import Consolidator

print("Crawler.__init__ defaults:")
print(inspect.signature(Crawler.__init__))
print()
print("Extractor.__init__ defaults:")
print(inspect.signature(Extractor.__init__))
print()
print("Consolidator.__init__ defaults:")
print(inspect.signature(Consolidator.__init__))

Crawler.__init__ defaults:
(self, target_folder_path, target_temp_folder_path=None, save_html=True, max_level=3, classification_file_path=None, allow_extensions=None, block_extensions=None, verify_ssl=False, concurrency_base_urls=60, threads_bs4=10, threads_download=120, use_database=True, sock_connect=180, extract=False, headers=None, file_extractor=None, max_pages_per_domain=50, min_days_between_crawls=30, chunk_size=1000000, overview_backend: Optional[str] = None, concurrency_pages: Optional[int] = None, page_batch_size: int = 500, base_url_batch_size: int = 1000, **kwargs)

Extractor.__init__ defaults:
(self, target_folder_path, use_database=True, extractor_delete_files=False, start_date='0000-01-01', end_date='9999-01-01', file_extractor: websweep.extractor.extractor.FileExtractor = None, overview_backend: Optional[str] = None, workers: Optional[int] = None, imap_chunksize: int = 50, maxtasksperchild: int = 1000, extract_timeout_seconds: int = 10, **kwargs)

Consolidator.__init__ defaults:
(self, input_file: str, chunk_size: int = 10000)

Optional add-on module

Repository add-on example:

  • addons/firmbackbone_extractor.py