Zero-copy is a term borrowed from operating systems programming that has found its way into data engineering vocabulary. Like many technical terms adopted by marketing, it has accumulated imprecision. This post explains what zero-copy actually means, where it applies in a data pipeline, and why it matters for CRM sync latency.
Origins: Zero-Copy in Operating Systems
In OS terms, zero-copy refers to a technique where data passes from one memory location to another without being copied through CPU registers. The benefit is fewer CPU cycles and lower memory bandwidth consumption.
Zero-Copy in Data Pipelines
Applied to data pipelines, zero-copy refers to architectures where data moves from source to destination without being materialized to intermediate storage at any step. Traditional ETL copies data into a staging area, transforms it there, then loads it to the destination. Zero-copy ETL performs the transform in memory as the data streams through, with no write to disk at intermediate stages.
Why It Matters for Latency
Every time data is written to disk in a pipeline, latency accumulates. The write itself takes time. If the next stage must wait for the write to complete before processing, additional latency is added. In a multi-stage pipeline with three intermediate materialization steps, the total latency is the sum of every write-wait cycle.
Zero-copy eliminates these wait cycles. Data enters the pipeline, transforms happen in memory as a streaming computation, and the output is written once — to the destination. For CRM sync, this is the difference between 100ms end-to-end latency and several seconds of latency.
The Storage Cost Angle
Intermediate materialization also has a cost dimension. Every staging table in a traditional ETL architecture is storage that must be provisioned, maintained, backed up, and eventually cleaned up. Zero-copy architectures eliminate this overhead entirely.
How Salmon Labs Implements Zero-Copy
Salmon Labs' pipeline processes events entirely in memory from CDC capture through transform to CRM delivery. No staging tables, no intermediate files, no storage writes between stages. The result is sub-100ms end-to-end latency at 500 million events per day.