This document explains the security architecture of web-csv-toolbox and the reasoning behind its design decisions.
When building applications that process user-uploaded CSV files, you face several security threats:
Attack Scenario: An attacker uploads multiple large CSV files simultaneously to overwhelm your server.
Attacker → 100 concurrent requests × 100MB CSV each
↓
Server → Spawns 100 workers
→ Consumes 10GB+ memory
→ CPU exhaustion
↓
Result → Application crashes or becomes unresponsive
Impact:
Attack Scenario: An attacker uploads a CSV with extremely long fields or an enormous number of records.
Impact:
Attack Scenario: An attacker uploads maliciously crafted CSV files that are computationally expensive to parse (e.g., deeply nested quotes, complex escaping).
Impact:
Attack Scenario: An attacker uploads a small compressed file that expands to enormous size when decompressed.
Example:
Input: small.csv.gz (100KB)
Output: 10GB uncompressed data
Impact:
web-csv-toolbox implements a defense-in-depth approach with multiple security layers.
The library provides built-in limits that protect against basic attacks:
maxBufferSizeDefault: 10M characters (10 × 1024 × 1024)
Purpose: Prevents memory exhaustion from unbounded input accumulation.
How It Works:
// Internal buffer management
if (buffer.length > maxBufferSize) {
throw new RangeError('Buffer size exceeded maximum limit');
}
Why 10M:
maxFieldCountDefault: 100,000 fields per record
Purpose: Prevents attacks that create records with millions of columns.
How It Works:
// Field counting during parsing
if (fieldCount > maxFieldCount) {
throw new RangeError('Field count exceeded maximum limit');
}
Why 100k:
maxBinarySizeDefault: 100MB (100 × 1024 × 1024 bytes)
Purpose: Prevents processing of excessively large binary inputs.
How It Works:
// Size check before processing
if (buffer.byteLength > maxBinarySize) {
throw new RangeError('Binary size exceeded maximum limit');
}
Why 100MB:
The Problem: Without resource management, each CSV processing request could spawn a new worker, leading to:
Request 1 → Worker 1 (uses CPU + memory)
Request 2 → Worker 2 (uses CPU + memory)
Request 3 → Worker 3 (uses CPU + memory)
...
Request 100 → Worker 100 (💥 system overwhelmed)
The Solution: WorkerPool
const pool = new ReusableWorkerPool({ maxWorkers: 4 });
How It Works:
Pool Initialization:
Pool: [Empty] maxWorkers=4
Request Arrives:
Request 1 → Pool creates Worker 1
Pool: [Worker 1] (1/4 workers)
Multiple Requests:
Request 2 → Pool creates Worker 2
Request 3 → Pool creates Worker 3
Request 4 → Pool creates Worker 4
Pool: [Worker 1, Worker 2, Worker 3, Worker 4] (4/4 workers)
Pool Full:
Request 5 → Pool is full (4/4 workers)
→ Request reuses existing worker (round-robin)
→ OR rejected early (see Layer 3)
Key Benefits:
✅ Bounded Resource Usage:
4 workers × ~50MB each = ~200MB maximum
Instead of unbounded growth.
✅ Worker Reuse: Workers are shared across requests, reducing initialization overhead.
✅ Predictable Performance: System performance remains consistent regardless of request volume.
Alternative 1: One Worker Per Request
// ❌ Dangerous
for each request {
const worker = new Worker(); // Unbounded growth
await parse(csv, { worker });
worker.terminate();
}
Problems:
Alternative 2: Single Shared Worker
// ❌ Bottleneck
const worker = new Worker();
for each request {
await parse(csv, { worker }); // Sequential processing
}
Problems:
Our Solution: WorkerPool (Goldilocks Approach)
// ✅ Balanced
const pool = new ReusableWorkerPool({ maxWorkers: 4 });
for each request {
await parse(csv, { workerPool: pool });
}
Advantages:
The Problem: Even with WorkerPool limits, accepting requests when the pool is saturated causes:
The Solution: isFull() Check
if (pool.isFull()) {
return c.json({ error: 'Service busy' }, 503);
}
How It Works:
class WorkerPool {
isFull(): boolean {
// Counts both active workers and pending worker creations
const totalWorkers = this.workers.length + this.pendingWorkerCreations.size;
return totalWorkers >= this.maxWorkers;
}
}
Why This Matters:
Without Early Rejection:
Time: 0s
Request 1-4 → Processing (pool full)
Request 5 → Queued, waits 30s → Timeout (poor UX)
User Experience: 30s wait → 408 Timeout Error
With Early Rejection:
Time: 0s
Request 1-4 → Processing (pool full)
Request 5 → Immediate 503 response
User Experience: Instant feedback → Retry with backoff
Benefits:
✅ Immediate Feedback: Clients receive instant 503 response instead of waiting for timeout.
✅ Resource Protection: Prevents queuing of requests that will fail anyway.
✅ Better UX: Enables clients to implement intelligent retry logic:
// Client-side retry with exponential backoff
async function uploadCSV(file, retries = 3) {
for (let i = 0; i < retries; i++) {
const response = await fetch('/validate-csv', {
method: 'POST',
body: file
});
if (response.status === 503) {
// Server busy, wait and retry
await sleep(1000 * Math.pow(2, i)); // Exponential backoff
continue;
}
return response;
}
throw new Error('Service unavailable after retries');
}
✅ Load Shedding: Protects backend services from cascading failures.
Alternative: Request Queuing
// ❌ Queue requests when pool is full
if (pool.isFull()) {
await queueRequest(request); // Wait for worker to become available
}
Problems:
Our Solution: Fail Fast
// ✅ Reject immediately
if (pool.isFull()) {
return 503; // Instant feedback
}
Advantages:
Multiple validation layers protect against malicious input:
const contentType = c.req.header('Content-Type');
if (!contentType?.startsWith('text/csv')) {
return c.json({ error: 'Content-Type must be text/csv' }, 415);
}
Why:
startsWith() ensures the media type is at the beginning (e.g., text/csv; charset=utf-8 is valid, but application/json; text/csv is not)const contentLength = c.req.header('Content-Length');
if (contentLength && Number.parseInt(contentLength) > MAX_SIZE) {
return c.json({ error: 'Request too large' }, 413);
}
Why:
class SizeLimitStream extends TransformStream {
constructor(maxBytes) {
let bytesRead = 0;
super({
transform(chunk, controller) {
bytesRead += chunk.length;
if (bytesRead > maxBytes) {
controller.error(new Error('Size limit exceeded'));
} else {
controller.enqueue(chunk);
}
}
});
}
}
Why:
const signal = AbortSignal.timeout(30000); // 30 seconds
for await (const record of parseStringStream(csvStream, {
signal,
// ...
})) {
// Processing
}
How It Works:
// Internal signal handling
if (signal.aborted) {
throw new DOMException('Operation timed out', 'AbortError');
}
Why Timeouts Matter:
Without Timeout:
Malicious CSV → Complex escaping → 10 minutes to parse
↓
Server resources tied up for 10 minutes
↓
DoS achieved
With Timeout:
Malicious CSV → Complex escaping → 30 seconds
↓
AbortError thrown
↓
Resources freed immediately
Benefits:
✅ Predictable Resource Usage: No request can consume resources indefinitely.
✅ DoS Prevention: Limits impact of CPU-intensive attack payloads.
✅ Better UX: Clients receive timely error responses.
After parsing, validate data with schema validation:
import { z } from 'zod';
const recordSchema = z.object({
name: z.string().min(1).max(100),
email: z.string().email(),
age: z.coerce.number().int().min(0).max(150),
});
for await (const record of parse(csv)) {
try {
const validated = recordSchema.parse(record);
// Use validated data
} catch (error) {
// Handle validation error
}
}
Why:
Putting it all together:
┌─────────────────────────────────────────────────────────┐
│ Client Request │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Layer 1: Early Rejection (pool.isFull()) │
│ Status: 503 if saturated │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Layer 2: Content-Type Verification │
│ Status: 415 if not text/csv │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Layer 3: Content-Length Check │
│ Status: 413 if too large │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Layer 4: Stream Processing with Timeout │
│ - WorkerPool manages workers (max 4) │
│ - AbortSignal enforces timeout (30s) │
│ - maxBufferSize limits memory (10M chars) │
│ - maxFieldCount limits fields (100k) │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Layer 5: Data Validation (Zod schema) │
│ - Validates each record │
│ - Reports errors via SSE │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Success Response (SSE) │
│ - Validation errors sent in real-time │
│ - Summary sent at end │
└─────────────────────────────────────────────────────────┘
Multiple layers ensure that if one layer fails, others provide protection.
Reject invalid requests as early as possible to minimize resource consumption.
All resources (memory, CPU, workers) have explicit upper bounds.
When limits are reached, provide clear error messages instead of crashing.
Workers run in isolated contexts with minimal privileges.
For advanced configuration options, refer to the EngineConfig type documentation in your IDE or the API Reference.