web-csv-toolbox - v0.14.0
    Preparing search index...

    WebAssembly Architecture

    This document explains the WebAssembly (WASM) implementation in web-csv-toolbox and how it achieves high-performance CSV parsing.

    web-csv-toolbox includes an optional WebAssembly module that provides improved CSV parsing performance compared to the JavaScript implementation. The WASM module is a compiled version of optimized parsing code that runs at near-native speed.

    The library provides two entry points for WASM functionality:

    • Main entry point (web-csv-toolbox): Automatic WASM initialization with embedded binary
    • Slim entry point (web-csv-toolbox/slim): Manual initialization with external WASM loading
    ┌─────────────────────────────────────────────────────────────┐
    │ High-Level API (parse, parseString, etc.)                   │
    └─────────────────────────────────────────────────────────────┘
                              ↓
    ┌─────────────────────────────────────────────────────────────┐
    │ Execution Router                                             │
    │ - Selects execution strategy based on EngineConfig          │
    └─────────────────────────────────────────────────────────────┘
                              ↓
            ┌─────────────────┴─────────────────┐
            ↓                                    ↓
    ┌──────────────────┐              ┌──────────────────┐
    │ JavaScript       │              │ WebAssembly      │
    │ Implementation   │              │ Implementation   │
    │                  │              │                  │
    │ - All features   │              │ - Compiled code  │
    │ - All encodings  │              │ - UTF-8 only     │
    │ - All options    │              │ - Limited options│
    └──────────────────┘              └──────────────────┘
    

    Key Points:

    • WASM is optional (JavaScript fallback always available)
    • Initialization is automatic when using WASM-enabled features
    • Can be combined with Worker Threads for non-blocking parsing
    • Compiled from Rust code using LLVM optimization

    Entry Points

    This project ships two entry points (Main and Slim) that differ only in how WebAssembly is initialized and delivered. For a practical comparison and guidance on when to use each:

    → See: Main vs Slim Entry Points


    Performance:

    • Compiled to machine code
    • Efficient memory management
    • Optimized by LLVM compiler

    Portability:

    • Runs on all major browsers
    • Supported in Node.js and Deno
    • Single binary for all platforms

    Safety:

    • Memory-safe by design
    • Sandboxed execution environment
    • No access to system resources

    The WASM module is compiled from Rust code because:

    Performance:

    • Zero-cost abstractions
    • Optimized by LLVM compiler
    • Minimal runtime overhead

    Memory Safety:

    • No null pointer dereferences
    • No buffer overflows
    • Memory managed at compile time

    WASM Support:

    • First-class WASM support via wasm-bindgen
    • Easy JavaScript interop
    • Automatic TypeScript definitions

    WASM is opt-in rather than always-on because:

    Trade-offs:

    • Size: WASM binary adds to bundle size
    • Initialization: Module loading adds overhead
    • Limitations: UTF-8 only, double-quote only, object output only (array tuples require the JavaScript engine)

    Flexibility:

    • Users can choose based on their needs
    • Automatic fallback to JavaScript for unsupported features
    • JavaScript parser provides full feature compatibility

    // loadWASM.ts (conceptual)
    import init, { type InitInput } from 'web-csv-toolbox-wasm';
    // In the web-csv-toolbox distribution, the WASM asset is exported as `web-csv-toolbox/csv.wasm`.
    import wasmUrl from 'web-csv-toolbox/csv.wasm';

    export async function loadWASM(input?: InitInput) {
    await init({ module_or_path: input ?? wasmUrl });
    }

    How it works:

    1. The WASM binary is distributed as a separate asset (csv.wasm)
    2. init() loads and instantiates the module (via URL or Buffer)
    3. Module is cached globally for reuse
    4. Subsequent calls are instant (already initialized)

    // parseStringToArraySyncWASM.ts
    import { parseStringToArraySync } from "web-csv-toolbox-wasm";

    export function parseStringToArraySyncWASM<Header>(
    csv: string,
    options?: CommonOptions
    ): CSVRecord<Header>[] {
    // Validate options
    if (quotation !== '"') {
    throw new RangeError("Invalid quotation, must be double quote on WASM.");
    }

    // Call WASM function
    const delimiterCode = delimiter.charCodeAt(0);
    return JSON.parse(parseStringToArraySync(csv, delimiterCode));
    }

    Key implementation details:

    • WASM function returns JSON string (not JavaScript objects)
    • JSON parsing happens in JavaScript (efficient for object creation)
    • Single-character delimiter passed as char code (u8 in Rust)

    ┌──────────────────┐                    ┌──────────────────┐
    │ JavaScript Heap  │                    │ WASM Linear      │
    │                  │                    │ Memory           │
    │ - JS Objects     │  Copy data         │                  │
    │ - Strings        │ ────────────────>  │ - CSV String     │
    │ - Arrays         │                    │ - Parsing State  │
    │                  │  Copy result       │ - Output Buffer  │
    │                  │ <────────────────  │                  │
    └──────────────────┘                    └──────────────────┘
    

    Data Flow:

    1. Input: JavaScript string copied to WASM linear memory
    2. Processing: WASM parses CSV entirely in linear memory
    3. Output: JSON string copied back to JavaScript heap
    4. Cleanup: WASM memory automatically freed after parsing

    Memory Efficiency:

    • Input string: Temporary copy in WASM memory
    • Parsing state: Small, constant size
    • Output: JSON string (similar size to input)
    • Total overhead: Approximately 3x input size during parsing

    WASM respects the same maxBufferSize limit as JavaScript:

    const lexer = new FlexibleStringCSVLexer({ maxBufferSize: 10 * 1024 * 1024 }); // Example: 10MB
    

    Why:

    • Prevents memory exhaustion
    • Consistent behavior across implementations
    • Protection against malicious input

    Performance depends on many factors:

    • CSV structure and size
    • Runtime environment (browser, Node.js, Deno)
    • System capabilities

    Theoretical advantages of WASM:

    • Compiled to machine code (vs interpreted JavaScript)
    • Efficient memory access patterns
    • Optimized by LLVM compiler

    Actual performance: For measured performance in various scenarios, see CodSpeed benchmarks.


    // First call - module loading
    await loadWASM();

    // Subsequent calls - instant (module cached)
    await loadWASM();

    Considerations:

    • Module loading adds initial overhead
    • Once loaded, module is cached for subsequent use
    • Performance trade-offs depend on file size and parsing frequency
    • Benchmark your specific use case to determine the best approach

    Both implementations have similar memory usage:

    Stage JavaScript WASM
    Input String (in heap) String (copied to linear memory)
    Parsing CSVLexer buffer (configurable) Parsing state (configurable)
    Output Objects (in heap) JSON string → Objects

    Total: Both implementations use approximately 2x input size temporarily during parsing.


    for await (const record of parse(csv, {
    engine: { wasm: true }
    })) {
    console.log(record);
    }

    Architecture:

    Main Thread:
      1. Load CSV string
      2. Call WASM function
      3. Parse CSV in WASM
      4. Return results
      5. Yield records
    

    Characteristics:

    • ✅ Uses compiled WASM code
    • ✅ No worker communication overhead
    • ❌ Blocks main thread during parsing
    • Performance trade-off: Faster execution (no communication cost) but UI becomes unresponsive
    • Use case: Server-side parsing, scenarios where blocking is acceptable

    for await (const record of parse(csv, {
    engine: { worker: true, wasm: true }
    })) {
    console.log(record);
    }

    Architecture:

    Main Thread:                 Worker Thread:
      1. Transfer CSV data  -->    1. Receive CSV data
      2. Wait for results          2. Call WASM function
      3. Receive records    <--    3. Parse CSV in WASM
      4. Yield records             4. Send results back
    

    Characteristics:

    • ✅ Non-blocking UI
    • ✅ Uses compiled WASM code
    • ✅ Offloads parsing to worker thread
    • ⚠️ Worker communication adds overhead (data transfer between threads)
    • Performance trade-off: Execution time may increase due to communication cost, but UI remains responsive
    • Use case: Browser applications, scenarios requiring UI responsiveness

    Limitation: WASM parser only supports UTF-8 encoded strings.

    Why:

    • Simplifies implementation
    • UTF-8 is the web standard
    • Smaller WASM binary size

    Workaround: For non-UTF-8 encodings, the router automatically falls back to JavaScript:

    // Automatic fallback for Shift-JIS
    for await (const record of parse(csv, {
    engine: { wasm: true },
    charset: 'shift-jis' // Falls back to JavaScript
    })) {
    console.log(record);
    }

    Limitation: WASM parser only supports double-quote (") as quotation character.

    Why:

    • Simplifies state machine
    • Double-quote is CSV standard (RFC 4180)
    • Smaller WASM binary size

    Workaround: For single-quote CSVs, use JavaScript parser:

    for await (const record of parse(csv, {
    engine: { wasm: false },
    quotation: "'"
    })) {
    console.log(record);
    }

    ---

    ### Object Output Only

    **Limitation:**
    WASM parser always emits object-shaped records. `outputFormat: 'array'` (named tuples) currently runs only on the JavaScript engine.

    **Why:**
    - WASM returns JSON that maps headers to values (object form)
    - Supporting tuple output would require a different serialization path and additional memory copying

    **Workaround:**
    Force the JavaScript engine whenever you need array output or `includeHeader`:

    ```typescript
    const rows = await parse.toArray(csv, {
    header: ["name", "age"] as const,
    outputFormat: "array",
    includeHeader: true,
    engine: { wasm: false }, // Skip WASM, use JS implementation
    });

    Limitation: WASM parser processes the entire CSV string at once.

    Why:

    • Simpler implementation
    • Optimized for complete strings
    • Avoids complex state management across calls

    Impact:

    • Memory usage proportional to file size
    • Not suitable for unbounded streams
    • Choose appropriate approach based on your file size and memory constraints

    Workaround: For incremental parsing, the JavaScript implementation supports chunk-by-chunk processing:

    const lexer = new FlexibleStringCSVLexer();

    for (const chunk of chunks) {
    for (const token of lexer.lex(chunk, true)) {
    // Process tokens incrementally
    }
    }

    lexer.flush();

    The execution router automatically falls back to JavaScript when WASM is unavailable or incompatible:

    ┌─────────────────────────────────────────────────────────────┐
    │ User requests WASM execution                                 │
    └─────────────────────────────────────────────────────────────┘
                              ↓
    ┌─────────────────────────────────────────────────────────────┐
    │ Check: Is WASM loaded?                                       │
    └─────────────────────────────────────────────────────────────┘
            ↓ No                              ↓ Yes
    ┌──────────────────┐              ┌──────────────────┐
    │ Fallback to JS   │              │ Check: UTF-8?    │
    └──────────────────┘              └──────────────────┘
                                              ↓ No         ↓ Yes
                                      ┌──────────────┐  ┌──────────────┐
                                      │ Fallback     │  │ Check:       │
                                      │ to JS        │  │ Double-quote?│
                                      └──────────────┘  └──────────────┘
                                                              ↓ No      ↓ Yes
                                                        ┌──────────┐  ┌──────────┐
                                                        │ Fallback │  │ Use WASM │
                                                        │ to JS    │  └──────────┘
                                                        └──────────┘
    

    Fallback scenarios:

    1. WASM initialization failed or module could not be loaded
    2. Non-UTF-8 encoding specified
    3. Single-quote quotation character specified
    4. WASM not supported in runtime (rare)

    WASM runs in a sandboxed environment:

    Isolation:

    • No access to file system
    • No access to network
    • No access to system calls
    • Cannot escape sandbox

    Memory Safety:

    • No buffer overflows (Rust guarantees)
    • No null pointer dereferences
    • Bounds checking on all memory access

    WASM respects the same resource limits as JavaScript:

    // maxBufferSize applies to both JS and WASM
    const lexer = new FlexibleStringCSVLexer({ maxBufferSize: 10 * 1024 * 1024 }); // Example

    Why:

    • Prevents memory exhaustion
    • Protection against CSV bombs
    • Consistent security model

    WASM features in this library depend on your runtime’s native WebAssembly support. Verify your environment before relying on WASM acceleration.

    If your runtime doesn’t support WebAssembly or you choose not to use it, the JavaScript parser remains available as a fallback.

    The WASM binary is bundled with the npm package:

    web-csv-toolbox/
    ├── dist/
    │   ├── main.web.js / main.node.js            # Main entry points
    │   ├── slim.web.js / slim.node.js            # Slim entry points
    │   ├── csv.wasm                               # WASM binary (exported as web-csv-toolbox/csv.wasm)
    │   ├── _virtual/                              # Build-time virtual modules for inlined WASM (main entry)
    │   └── wasm/
    │       └── loaders/                           # loadWASM / loadWASMSync loaders
    

    Bundler support:

    • Webpack: Automatically handles WASM
    • Vite: Built-in WASM support
    • Rollup: Requires @rollup/plugin-wasm


    web-csv-toolbox's WebAssembly implementation provides:

    1. Compiled Execution: Uses WASM compiled from Rust code
    2. Portability: Runs on all modern browsers and runtimes
    3. Safety: Memory-safe, sandboxed execution
    4. Flexibility: Optional, automatic fallback to JavaScript
    5. Integration: Works with Worker Threads for non-blocking parsing

    Trade-offs:

    • UTF-8 only (no Shift-JIS, EUC-JP, etc.)
    • Double-quote only (no single-quote)
    • Processes entire string at once (not incremental)
    • Module loading adds initial overhead

    When to use WASM:

    • Evaluate performance for your specific use case
    • Consider WASM when:
      • Working with UTF-8 CSV files
      • Using standard double-quote quotation
      • Processing complete CSV strings
    • Benchmark your actual data to make informed decisions

    Performance: See CodSpeed benchmarks for actual measured performance across different scenarios.