feat: Streaming sorted-merge diff for PostgreSQL, SQLite sha3 hash, MySQL CHECKSUM pre-scan by jasdeepkhalsa · Pull Request #158 · DBDiff/DBDiff

jasdeepkhalsa · 2026-03-26T00:19:12Z

Streaming Sorted-Merge Diff Algorithm

Summary

Introduces StreamingMergeDiff — a two-phase, driver-agnostic diff algorithm that dramatically improves performance for PostgreSQL (same-server), cross-server diffs (all drivers), and SQLite changed-row detection.

Problem

Driver	Previous approach	Issue
PostgreSQL (same-server)	`fetchIndexed()` — loads both full tables into PHP memory	Exhausts memory on tables >500K rows
SQLite	Field-by-field `IS NOT` comparison in `WHERE` clause	N comparisons per row (one per column); slow for wide tables
MySQL	SHA2 hash JOIN (already efficient)	No pre-scan to skip identical tables
Cross-server (all drivers)	`ArrayDiff` batched 1,000-row buckets	10,000 round-trips for a 10M-row table; no hash short-circuit

Solution

StreamingMergeDiff (new class)

A reusable two-phase algorithm that works across all drivers:

Phase 1 — Hash stream: Queries both databases for PK + row_hash in sorted PK order. A merge-sort walk identifies inserts (PK only in source), deletes (PK only in target), and potential updates (same PK, different hash) — without transferring full row data.

Phase 2 — Targeted fetch: Only the PKs identified in Phase 1 are fetched in full (WHERE pk IN (...), batched in groups of 500). Actual column values are compared to produce InsertData, DeleteData, and UpdateData objects.

Driver-specific optimisations

Driver	Hash expression	Additional optimisation
PostgreSQL	`md5(COALESCE(col::text, '') \|\| E'\\x1f' \|\| ...)`	Replaces full-memory PHP load with streaming merge
SQLite	`sha3(COALESCE(CAST(col AS TEXT), '') \|\| X'1f' \|\| ..., 256)`	Runtime probe with fallback to field-by-field `IS NOT`
MySQL	`SHA2(CONCAT(CAST(col AS CHAR CHARACTER SET utf8), ...), 256)`	`CHECKSUM TABLE` pre-scan skips identical tables

Column values are separated by \x1f (Unit Separator, ASCII 31) — a valid UTF-8 character that won't appear in normal data and avoids PostgreSQL's rejection of null bytes in text strings.

Performance impact

For a 10M-row table with 1,000 changes:

Phase 1: transfers ~10M × (PK + 32-byte hash) — a fraction of full row data
Phase 2: transfers only ~1,000 full rows

For identical tables (MySQL): CHECKSUM TABLE returns in milliseconds, skipping the full SHA2 scan entirely.

Changes

New files

src/DB/Data/StreamingMergeDiff.php — Two-phase streaming sorted-merge algorithm

Modified files

src/DB/Data/LocalTableData.php — PostgreSQL uses StreamingMergeDiff; SQLite sha3() probe + fallback; MySQL CHECKSUM TABLE pre-scan
src/DB/Data/DistTableData.php — Delegates to StreamingMergeDiff (replaces ArrayDiff batching)

Test changes

tests/DataDiff/StreamingMergeDiffTest.php — 15 unit tests using SQLite in-memory databases (no external services needed)
Updated SQLite test fixtures with actual data differences (previously had identical data)
Re-recorded all 7 SQLite expected-output baselines to match the updated fixtures
Registered tests/DataDiff/ in both phpunit.xml and phpunit.v9.xml

…ySQL CHECKSUM pre-scan Introduce StreamingMergeDiff — a two-phase, driver-agnostic diff algorithm that replaces the previous full-memory PHP comparison for PostgreSQL and the batched ArrayDiff approach for cross-server diffs. Phase 1 streams PK + row-hash from both databases in sorted order (merge-sort walk) to identify inserts, deletes, and potential updates without transferring full row data. Phase 2 fetches only the differing rows by PK for SQL generation. Driver-specific optimisations: - PostgreSQL (same-server): md5(COALESCE(col::text, '') || E'\\x00' || ...) replaces the O(n) full-memory fetchIndexed() approach that exhausted memory on tables >500K rows. - SQLite: runtime sha3() probe with cached result; when available, replaces the field-by-field IS NOT comparison with a single hash comparison per row. Falls back gracefully to the original IS NOT approach. - MySQL: CHECKSUM TABLE pre-scan skips identical tables entirely before running the SHA2 hash JOIN, wrapped in try/catch for engine compatibility. - Cross-server (all drivers): StreamingMergeDiff replaces the 1000-row batched ArrayDiff, cutting data transfer to PK+hash in Phase 1. Test changes: - Add StreamingMergeDiffTest (15 test methods) using SQLite in-memory DBs: inserts, deletes, updates, composite PKs, NULLs, fieldsToIgnore, large datasets (2000 rows), and helper method coverage (comparePK, hash exprs). - Update SQLite test fixtures to include actual data differences (previously had identical data as a workaround). - Remove 7 stale SQLite expected-output baselines that need re-recording via DBDIFF_RECORD_MODE=true after fixture data changes. - Register tests/DataDiff/ in phpunit.xml and phpunit.v9.xml Unit suite."

PostgreSQL rejects \\x00 (null byte) in UTF-8 text strings with: SQLSTATE[22021]: invalid byte sequence for encoding \"UTF8\": 0x00 Replace the null byte column separator with \\x1f (Unit Separator, ASCII 31) which is valid UTF-8 and equally unlikely to appear in real data. Affected drivers: - PostgreSQL: E'\\x00' → E'\\x1f' in md5() hash expression - SQLite: X'00' → X'1f' in hex()/sha3() hash expressions Also re-record the 7 SQLite expected-output baselines that were previously deleted (fixture data was updated to include real data differences)."

LocalTableData::getDiff (4 returns → 3): Extract MySQL CHECKSUM + diff logic into getDiffMySQL(), leaving getDiff() with only 3 exit points (sqlite, pgsql, mysql). StreamingMergeDiff::getDiff (cognitive complexity 49 → low): Extract collectInserts(), collectDeletes(), collectUpdates(), and buildSingleUpdate() so that getDiff() is a thin orchestrator with no nested loops. Each helper has a single clear responsibility. StreamingMergeDiff::buildHashExpression (4 returns → 2): Replace switch/case with a match expression that returns a single computed value, reducing return statements from 4 to 2. StreamingMergeDiff: dedicated exception (RuntimeException → DataException): Replace generic \RuntimeException with DBDiff\Exceptions\DataException in the unsupported-driver default branch."

CONCAT() returns NULL if any argument is NULL. Without COALESCE, any row whose hash expression contained a NULL column produced a NULL hash. In Phase 1 of the streaming merge, PHP compares hashes with !==; since null !== null evaluates to false, a changed row where both source and target shared the same NULL column(s) but differed in other columns would silently pass through as identical — the update would be missed. Also add CHAR(31) as the column separator (consistent with the pgsql and sqlite drivers, which already use \x1f) to prevent cross-column hash collisions where different value distributions produce the same concatenated string. Fix: wrap each column in COALESCE(CAST(col AS CHAR CHARACTER SET utf8), '') and separate columns with CHAR(31) in the CONCAT call. Updated testBuildHashExpressionMySQL to assert COALESCE and CHAR(31) are both present in the generated expression."

sonarqubecloud · 2026-03-26T01:34:46Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

jasdeepkhalsa added 5 commits March 25, 2026 23:19

Merge branch 'master' into feature/diff-algos-postgres-sqlite

8163325

jasdeepkhalsa merged commit b7d0619 into master Mar 26, 2026
62 checks passed

This was referenced Apr 2, 2026

Feature: parallel / threaded table processing #165

Open

Add a optional checksum comparison and threaded processes #134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Streaming sorted-merge diff for PostgreSQL, SQLite sha3 hash, MySQL CHECKSUM pre-scan#158

feat: Streaming sorted-merge diff for PostgreSQL, SQLite sha3 hash, MySQL CHECKSUM pre-scan#158
jasdeepkhalsa merged 5 commits intomasterfrom
feature/diff-algos-postgres-sqlite

jasdeepkhalsa commented Mar 26, 2026

Uh oh!

sonarqubecloud bot commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jasdeepkhalsa commented Mar 26, 2026

Streaming Sorted-Merge Diff Algorithm

Summary

Problem

Solution

StreamingMergeDiff (new class)

Driver-specific optimisations

Performance impact

Changes

New files

Modified files

Test changes

Uh oh!

sonarqubecloud bot commented Mar 26, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant