The Full Rewrite Anti-Pattern in Data Lakes

Your pipeline appends 50 GB of new data. It still rewrites 5 TB every run.

That is the full rewrite anti-pattern.

It usually looks respectable on the surface: fetch a delta, create a replacement table by unioning history with the delta, drop the old table, rename the new table. The job calls itself incremental because the input was incremental.

The output is not incremental. The output photocopies the table.

Why it feels attractive

The rename dance is easy to reason about. You can inspect a complete table before replacing the old one. You avoid thinking about deletes, partition replacement, and idempotency. The code is often simple enough to survive for years.

That is exactly why it is dangerous.

The cost model is hidden inside the write path. A small logical change produces a full physical rewrite. S3 fills with old files. Snapshot history grows. Compute spends most of its time rewriting rows that did not change.

The number to check

Look at write bytes per run versus total table size.

If they are close, your incremental pipeline is probably lying.

The delta size is not the number that matters. The number that matters is what the platform actually writes.

Better patterns

For partitioned data, use partition-level replacement where possible. With Spark and Iceberg, dynamic overwrite can replace the affected partitions in one atomic commit. For engines that do not support that shape cleanly, a delete plus insert can still be better than a full rewrite, but it needs idempotency and failure handling.

The right design depends on table shape, engine support, and failure semantics. The wrong design is approving a 5 TB rewrite because the code path was easy.

Review checklist

Before approving a rewrite, ask:

What fraction of the table actually changes?
Does write volume shrink when the delta shrinks?
What files become unreachable after each run?
Can the job fail between delete and insert?
Is compaction cleaning up a real problem or masking the bad write pattern?

Full rewrites are sometimes justified. Backfills, schema corrections, and physical layout changes can need them.

But if the pipeline rewrites the lake every night because nobody modeled the update, that is not architecture. That is a recurring invoice.