Using ClickHouse for OLAP on High-Velocity Web Scrape Streams
Architecture guide: ingest continuous web‑scrape telemetry into ClickHouse for real‑time analytics and tiered retention cost savings.
Stop losing insights to laggy analytics and exploding storage bills
If you operate a fleet of scrapers that emit millions of telemetry events per hour — request timings, HTTP statuses, proxy IDs, fingerprints and error traces — you need an OLAP system that can ingest without blocking your crawlers, give real‑time observability, and cost‑effectively hold months or years of history for trend analysis and model training. In 2026, teams choose ClickHouse for this workload because it delivers sub‑second analytical queries at a fraction of columnar cloud data‑warehouse cost. This guide shows a production architecture and concrete ClickHouse patterns to ingest continuous web‑scrape telemetry, serve real‑time dashboards and keep long‑term retention affordable.
Executive summary (most important recommendations)
- Ingest via streaming buffers: use Kafka (or a Buffer table) plus materialized views to protect ClickHouse from spikes.
- Design for fast inserts: compact small events into micro‑batches (10k–50k rows) and use MergeTree variants with good ORDER BY keys.
- Use tiered storage + TTLs: keep hot day/week data on fast NVMe, move older data to S3 or cold disks via ClickHouse storage policies.
- Pre‑aggregate for dashboards: projections or AggregatingMergeTree / materialized views to avoid scanning large raw tables.
- Control cost with compression and column types: LowCardinality for repeated strings, ZSTD with tuned levels, and sparse columns for optional fields.
- Monitor merge queues and parts: backpressure shows up as many small parts and stalled merges — tune parts/merges and merge throttles.
Why ClickHouse for high‑velocity scraping telemetry in 2026
ClickHouse continues to see rapid enterprise adoption and investment—driving feature velocity in distributed OLAP, tiered storage and streaming integrations. In late 2025 the vendor’s growth spotlighted its role as a high‑throughput analytical engine for observability and telemetry workloads. For telemetry from scrapers, ClickHouse brings:
- Massive insert throughput with MergeTree engines and ingestion patterns tuned for streams.
- Low latency queries using order_by + data skipping indexes and projection pre‑aggregations.
- Flexible tiering and TTL rules to move cold data to object storage and reduce compute footprint.
- Built‑in connectors for Kafka, RabbitMQ, and HTTP that simplify real‑time pipelines.
Reference architecture — components and flow
At a glance, the pipeline has four layers: ingestion, buffer/streaming, ClickHouse cluster (hot+warm tiers), and archive/analytics. Below is a practical architecture that balances reliability, performance and cost.
1) Ingestion layer (scrapers → brokers)
- Scrapers emit lightweight JSON/Avro events to a message broker (Kafka recommended for order guarantees and backpressure handling).
- Event schema: keep a narrow, typed shape (timestamp, scrape_id, domain, status, duration_ms, bytes, proxy_id, error_code, fingerprint, user_agent_hash). Avoid storing full page HTML in telemetry — forward payloads to object storage if needed.
2) Stream buffer (Kafka → ClickHouse)
- Create a ClickHouse Kafka table or use an intermediate consumer that batches into ClickHouse. Materialized views move messages into a MergeTree table atomically and with idempotency when possible.
- Alternative for simple fleets: ClickHouse Buffer table engine in front of MergeTree to absorb spikes and flush in configurable intervals.
3) ClickHouse cluster (hot + warm)
- Hot tier (local NVMe, small retention window) for sub‑second dashboards and anomaly detection. See patterns from edge/backends designs when planning local NVMe and CPU separation.
- Warm tier (cheap disks or remote SSDs) for weekly/monthly analytics.
- Use Distributed tables across shards with ReplicatedMergeTree (or the modern ClickHouse Keeper based replicated engines) for resilience and scale-out.
4) Archive (S3/object store)
- Configure storage policies and TTLs to move old parts to S3 or delete them after retention. This preserves queryability (cold scans are slower) while cutting storage cost.
- Maintain compacted daily/monthly (downsampled) rollups for long‑term trend analysis without the raw event volume.
Concrete table design: telemetry schema and MergeTree settings
Below is a battle‑tested schema and storage pattern. Adjust column selection based on your events, but keep these principles: typed columns, LowCardinality for repeating strings, and a stable ORDER BY that groups common query patterns.
CREATE TABLE scrape_telemetry (
event_time DateTime64(3),
scrape_id UUID,
site_domain LowCardinality(String),
status_code UInt16,
latency_ms UInt32,
bytes UInt32,
proxy_id LowCardinality(String),
ua_id UInt32, -- map user agent to integer in lookup
error_type LowCardinality(String),
fingerprint FixedString(16),
attempt UInt8,
version UInt64 -- for ReplacingMergeTree dedup
)
ENGINE = ReplicatedReplacingMergeTree('/clickhouse/tables/{shard}/scrape_telemetry', '{replica}', version)
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (site_domain, event_time)
TTL event_time + INTERVAL 90 DAY TO VOLUME 'cold'
Key reasoning:
- PARTITION BY daily or hourly depending on write volume — smaller partitions improve parallel writes but increase part count; daily is a good default for high velocity.
- ORDER BY (site_domain, event_time) speeds queries that filter by site and recent time ranges (most dashboards do).
- ReplacingMergeTree + version enables inexpensive deduplication in case scrapers re‑emit events on retries; use the highest version per key.
- LowCardinality reduces memory and storage for repeated dimensions (domain, proxy_id, error_type).
Ingestion patterns and operational knobs
To get clickhouse to sustain millions of small events per second, optimize how rows reach the MergeTree engine and how parts are compacted.
Kafka + materialized view approach (recommended)
- Produce events to Kafka in Avro/JSON with schema registry.
- Create a ClickHouse table with the Kafka engine that reads the topic.
- Create a MATERIALIZED VIEW that consumes Kafka rows and INSERTs into the MergeTree table in batches.
CREATE TABLE kafka_scrapes (
event_time DateTime64(3),
...
) ENGINE = Kafka SETTINGS kafka_broker_list = 'k1:9092,k2:9092', kafka_topic = 'scrape-events', kafka_group_name = 'ch-ingest-v1', format = 'JSONEachRow';
CREATE MATERIALIZED VIEW mv_kafka_to_main TO scrape_telemetry AS
SELECT * FROM kafka_scrapes;
Best practices:
- Keep Kafka consumer batch sizes moderate (10k–50k messages) to reduce overhead of small inserts — a classic trap covered in the serverless vs dedicated crawlers playbook.
- Set clickhouse kafka_skip_broken_messages and proper formats to avoid blocking on malformed events.
- Monitor the Kafka consumer lag; persistent lag means ClickHouse is underprovisioned or batch sizes are too small.
Buffer table approach (intra-cluster smoothing)
When you cannot use Kafka, ClickHouse’s Buffer engine can absorb bursts and write to MergeTree in background intervals. Tune buffer size and flush interval to trade latency for write efficiency.
Real‑time analytics patterns
Real‑time dashboards and alerts must be fast without scanning raw history. Use these patterns:
- Materialized views for rollups: maintain per‑minute aggregates by site, status_code and proxy to serve dashboards instantly — similar ideas appear in edge-first coverage patterns like edge-first live coverage.
- Projections: use ClickHouse projections to store precomputed sorts/aggregates close to the data for ultra‑fast lookups.
- Approx functions for heavy cardinality: uniqCombined or HLL for cardinality where precise counts are unnecessary.
CREATE MATERIALIZED VIEW mv_1m_rollup TO scrape_rollups AS
SELECT
toStartOfMinute(event_time) AS minute,
site_domain,
status_code,
count() AS hits,
avg(latency_ms) AS avg_latency
FROM scrape_telemetry
GROUP BY minute, site_domain, status_code;
Downsampling and long‑term retention (cost control)
Raw telemetry is high cardinality and expensive to keep forever. The common lifecycle splits into hot (7–30 days), warm (30–90 days) and cold (90+ days). Implement these steps:
- TTL moves to object storage: configure storage policies to move parts older than X days to S3 (or a 'cold' disk). Example: TTL event_time + INTERVAL 90 DAY TO VOLUME 'cold'.
- Downsample on move: maintain daily/monthly rollups (AggregatingMergeTree or materialized view) for analytics that tolerate lower granularity.
- Delete raw older than retention: when legal/analytics needs permit, purge raw events older than 365 days to cap storage growth.
Compression, encodings and cost tuning
Small wins add up at telemetry scale. Use column‑level compression and encode repetitive strings efficiently.
- Set compression codec to ZSTD with a tuned level for best space/CPU tradeoff: ZSTD(level=3–5) is usually excellent for telemetry.
- Use LowCardinality(String) for domains, proxies, error types — this reduces memory and index sizes drastically.
- Store optional structured fields as Nullable or use Tuple/Map sparingly to avoid sparse wide columns.
Scaling ClickHouse for write and query throughput
ClickHouse scales horizontally with shards and replicas. Key operational patterns for scrape telemetry:
- Sharding key: choose a shard key that spreads writes (e.g., hash(modified_scrape_id) or domain hash) while keeping common query patterns local where possible.
- Replicas for availability: use at least 2 replicas per shard. Modern ClickHouse deployments use ClickHouse Keeper or external coordination for consensus.
- Distributed tables: expose a Distributed table for application queries that routes to appropriate shards.
- Resharding: plan for adding shards by using consistent hashing or by rebalancing parts — resharding can be operationally heavy; consider using a managed ClickHouse service for resizing.
Monitoring and alerting (avoid silent backpressure)
Telemetry pipelines fail silently when ClickHouse cannot keep up. Monitor these signals:
- system.merges and system.mutations queue size — stalled merges indicate too many small parts.
- system.parts count per table and average part size.
- Insert latency and clickhouse_kafka consumer lag metrics.
- Disk usage by volume and S3 egress for cold reads. For edge and passive monitoring patterns, see Edge Observability and Passive Monitoring.
Operational pitfalls and how to avoid them
- Too many tiny inserts: causes explosion of parts and slow merges. Fix with batching at producer or Buffer engine — a common theme in serverless vs dedicated crawlers guidance.
- Bad order_by key: results in poor compression and slow queries. Align ORDER BY to common filters (site + time).
- Blind TTL deletions: ensure compliance and auditability before automatic purges — keep a legal view of retention rules.
- Storing raw HTML in the table: move payloads to object store and keep only references/sha256 fingerprints in ClickHouse.
Security, privacy and compliance
Telemetry can contain PII (IPs, cookies, headers). Apply these rules:
- Redact or hash PII before inserting into ClickHouse — protecting identifiers helps avoid exposures like those discussed in domain reselling scams.
- Keep consent and robots rules out of your telemetry ingestion policy in your legal playbook.
- Use network controls and column‑level access (via Materialized Views or separate tables) to limit who can query raw events — tie this into modern auth deployments such as MicroAuthJS adoption for enterprise access control.
Example operational checklist before going to production
- Define the event schema and contract; publish to schema registry.
- Choose ingestion pattern (Kafka + MV or Buffer) and configure batch sizes.
- Create MergeTree table with appropriate PARTITION, ORDER BY, LowCardinality and TTL rules.
- Provision hot and cold storage and define storage policies that move parts to S3.
- Implement materialized views / projections for real‑time dashboards.
- Set up alerts for consumer lag, merging backlog and disk pressure.
- Test recovery and replays from Kafka/S3 to validate idempotency and deduplication strategies.
2026 trends to plan for
Two trends are shaping telemetry architectures in 2026 and should influence your ClickHouse designs:
- Tiered compute + cheaper object storage: more teams are separating CPU for hot queries and cheap S3 for cold parts. ClickHouse’s storage policies and low‑cost object tiers make this practical. See related patterns in Cloud‑Native Observability for Trading Firms.
- AI thirst for reliable telemetry: enterprise research continues to show (see industry reports in late 2025) that weak data management blocks AI value. High‑quality, well‑retained scrape telemetry is becoming a first‑class dataset for model training and observability.
"Treat your scrape telemetry as a product: define SLAs for freshness, completeness and retention. This pays back when teams rely on the dataset for detection, forecasting and model features."
Real world example: rolling up 1B events/day with predictable cost
A scraping platform we worked with in 2025 handled ~1B telemetry events/day. Key outcomes after moving to ClickHouse:
- Ingested via Kafka with consumers batching to 25k messages per insert — peak sustained writes of 70k rows/sec per shard.
- Hot retention: 14 days on NVMe; warm: 76 days on cheaper SSD; cold: >90 days on S3 via TTL moves. Storage cost dropped ~60% vs keeping everything on SSD.
- Dashboards used minute‑level rollups in AggregatingMergeTree, shaving query times to sub‑second without scanning raw partitions.
- Data scientists trained models from downsampled monthly rollups, reducing feature engineering time and avoiding legal exposure to raw payloads.
Actionable next steps (start now)
- Map your current telemetry schema and estimate event cardinality (distinct sites, proxies, UAs).
- Pilot a pipeline: set up Kafka topic → ClickHouse Kafka table → materialized view → MergeTree. Validate latency and consumer lag under load.
- Measure part sizes and merge rates; tune batch sizes to keep average part > 8–32 MB to optimize merges.
- Implement TTL rules to move >30 day data to cheaper storage and build daily rollups for historic queries.
Further reading and references
- ClickHouse docs: MergeTree engines, Kafka engine, storage policies and TTL (refer to your ClickHouse version docs for exact syntax).
- Industry reports (2025–2026) on telemetry and data management that show enterprises need better data pipelines to scale AI and analytics.
Conclusion & call to action
ClickHouse is an excellent fit for high‑velocity web scraping telemetry when you build a pipeline that buffers writes, uses MergeTree patterns tuned for analytics, and applies TTL/tiering to control long‑term costs. Start small with Kafka + materialized views, measure parts and merges, and evolve toward projections and downsampled rollups for long‑term analytics.
Ready to prototype? Webscraper.cloud provides ClickHouse starter templates, prebuilt Kafka ingestion materialized views and retention presets tuned for scraping workloads. Contact us to get a production‑ready repo and a 30‑minute architecture review tailored to your scale.
Related Reading
- Serverless vs Dedicated Crawlers: Cost and Performance Playbook (2026)
- Edge Observability and Passive Monitoring: The New Backbone of Bitcoin Infrastructure in 2026
- Cloud‑Native Observability for Trading Firms: Protecting Your Edge (2026)
- Designing Resilient Edge Backends for Live Sellers: Serverless Patterns & SSR
- Family-Friendly Nightlife: Designing Immersive Evenings for Parents in Dubai (2026)
- Preparing for Storms When Geopolitics Disrupt Energy: Practical Backup Plans for Commuters and Travelers
- Top 5 Executor Weapon & Armor Combos After the Nightreign Buff
- Security Risks of Abandoned VR Services: What to Do When a SaaS/Hardware Vendor Exits
- Monte Carlo for Macro: Adapting 10,000-Simulation Betting Models to Economic Forecasting
Related Topics
webscraper
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
From Our Network
Trending stories across our publication group