Skip to content

feat: Streaming sorted-merge diff for PostgreSQL, SQLite sha3 hash, MySQL CHECKSUM pre-scan#158

Merged
jasdeepkhalsa merged 5 commits intomasterfrom
feature/diff-algos-postgres-sqlite
Mar 26, 2026
Merged

feat: Streaming sorted-merge diff for PostgreSQL, SQLite sha3 hash, MySQL CHECKSUM pre-scan#158
jasdeepkhalsa merged 5 commits intomasterfrom
feature/diff-algos-postgres-sqlite

Conversation

@jasdeepkhalsa
Copy link
Copy Markdown
Member

Streaming Sorted-Merge Diff Algorithm

Summary

Introduces StreamingMergeDiff — a two-phase, driver-agnostic diff algorithm that dramatically improves performance for PostgreSQL (same-server), cross-server diffs (all drivers), and SQLite changed-row detection.

Problem

Driver Previous approach Issue
PostgreSQL (same-server) fetchIndexed() — loads both full tables into PHP memory Exhausts memory on tables >500K rows
SQLite Field-by-field IS NOT comparison in WHERE clause N comparisons per row (one per column); slow for wide tables
MySQL SHA2 hash JOIN (already efficient) No pre-scan to skip identical tables
Cross-server (all drivers) ArrayDiff batched 1,000-row buckets 10,000 round-trips for a 10M-row table; no hash short-circuit

Solution

StreamingMergeDiff (new class)

A reusable two-phase algorithm that works across all drivers:

Phase 1 — Hash stream: Queries both databases for PK + row_hash in sorted PK order. A merge-sort walk identifies inserts (PK only in source), deletes (PK only in target), and potential updates (same PK, different hash) — without transferring full row data.

Phase 2 — Targeted fetch: Only the PKs identified in Phase 1 are fetched in full (WHERE pk IN (...), batched in groups of 500). Actual column values are compared to produce InsertData, DeleteData, and UpdateData objects.

Driver-specific optimisations

Driver Hash expression Additional optimisation
PostgreSQL md5(COALESCE(col::text, '') || E'\\x1f' || ...) Replaces full-memory PHP load with streaming merge
SQLite sha3(COALESCE(CAST(col AS TEXT), '') || X'1f' || ..., 256) Runtime probe with fallback to field-by-field IS NOT
MySQL SHA2(CONCAT(CAST(col AS CHAR CHARACTER SET utf8), ...), 256) CHECKSUM TABLE pre-scan skips identical tables

Column values are separated by \x1f (Unit Separator, ASCII 31) — a valid UTF-8 character that won't appear in normal data and avoids PostgreSQL's rejection of null bytes in text strings.

Performance impact

For a 10M-row table with 1,000 changes:

  • Phase 1: transfers ~10M × (PK + 32-byte hash) — a fraction of full row data
  • Phase 2: transfers only ~1,000 full rows

For identical tables (MySQL): CHECKSUM TABLE returns in milliseconds, skipping the full SHA2 scan entirely.

Changes

New files

  • src/DB/Data/StreamingMergeDiff.php — Two-phase streaming sorted-merge algorithm

Modified files

  • src/DB/Data/LocalTableData.php — PostgreSQL uses StreamingMergeDiff; SQLite sha3() probe + fallback; MySQL CHECKSUM TABLE pre-scan
  • src/DB/Data/DistTableData.php — Delegates to StreamingMergeDiff (replaces ArrayDiff batching)

Test changes

  • tests/DataDiff/StreamingMergeDiffTest.php — 15 unit tests using SQLite in-memory databases (no external services needed)
  • Updated SQLite test fixtures with actual data differences (previously had identical data)
  • Re-recorded all 7 SQLite expected-output baselines to match the updated fixtures
  • Registered tests/DataDiff/ in both phpunit.xml and phpunit.v9.xml

…ySQL CHECKSUM pre-scan

Introduce StreamingMergeDiff — a two-phase, driver-agnostic diff algorithm
that replaces the previous full-memory PHP comparison for PostgreSQL and the
batched ArrayDiff approach for cross-server diffs.

Phase 1 streams PK + row-hash from both databases in sorted order (merge-sort
walk) to identify inserts, deletes, and potential updates without transferring
full row data. Phase 2 fetches only the differing rows by PK for SQL generation.

Driver-specific optimisations:
- PostgreSQL (same-server): md5(COALESCE(col::text, '') || E'\\x00' || ...)
  replaces the O(n) full-memory fetchIndexed() approach that exhausted memory
  on tables >500K rows.
- SQLite: runtime sha3() probe with cached result; when available, replaces
  the field-by-field IS NOT comparison with a single hash comparison per row.
  Falls back gracefully to the original IS NOT approach.
- MySQL: CHECKSUM TABLE pre-scan skips identical tables entirely before
  running the SHA2 hash JOIN, wrapped in try/catch for engine compatibility.
- Cross-server (all drivers): StreamingMergeDiff replaces the 1000-row
  batched ArrayDiff, cutting data transfer to PK+hash in Phase 1.

Test changes:
- Add StreamingMergeDiffTest (15 test methods) using SQLite in-memory DBs:
  inserts, deletes, updates, composite PKs, NULLs, fieldsToIgnore, large
  datasets (2000 rows), and helper method coverage (comparePK, hash exprs).
- Update SQLite test fixtures to include actual data differences (previously
  had identical data as a workaround).
- Remove 7 stale SQLite expected-output baselines that need re-recording
  via DBDIFF_RECORD_MODE=true after fixture data changes.
- Register tests/DataDiff/ in phpunit.xml and phpunit.v9.xml Unit suite."
PostgreSQL rejects \\x00 (null byte) in UTF-8 text strings with:
  SQLSTATE[22021]: invalid byte sequence for encoding \"UTF8\": 0x00

Replace the null byte column separator with \\x1f (Unit Separator, ASCII 31)
which is valid UTF-8 and equally unlikely to appear in real data.

Affected drivers:
- PostgreSQL: E'\\x00' → E'\\x1f' in md5() hash expression
- SQLite: X'00' → X'1f' in hex()/sha3() hash expressions

Also re-record the 7 SQLite expected-output baselines that were previously
deleted (fixture data was updated to include real data differences)."
LocalTableData::getDiff (4 returns → 3):
  Extract MySQL CHECKSUM + diff logic into getDiffMySQL(), leaving
  getDiff() with only 3 exit points (sqlite, pgsql, mysql).

StreamingMergeDiff::getDiff (cognitive complexity 49 → low):
  Extract collectInserts(), collectDeletes(), collectUpdates(), and
  buildSingleUpdate() so that getDiff() is a thin orchestrator with no
  nested loops. Each helper has a single clear responsibility.

StreamingMergeDiff::buildHashExpression (4 returns → 2):
  Replace switch/case with a match expression that returns a single
  computed value, reducing return statements from 4 to 2.

StreamingMergeDiff: dedicated exception (RuntimeException → DataException):
  Replace generic \RuntimeException with DBDiff\Exceptions\DataException
  in the unsupported-driver default branch."
CONCAT() returns NULL if any argument is NULL. Without COALESCE, any
row whose hash expression contained a NULL column produced a NULL hash.
In Phase 1 of the streaming merge, PHP compares hashes with !==; since
null !== null evaluates to false, a changed row where both source and
target shared the same NULL column(s) but differed in other columns
would silently pass through as identical — the update would be missed.

Also add CHAR(31) as the column separator (consistent with the pgsql
and sqlite drivers, which already use \x1f) to prevent cross-column
hash collisions where different value distributions produce the same
concatenated string.

Fix: wrap each column in COALESCE(CAST(col AS CHAR CHARACTER SET utf8), '')
and separate columns with CHAR(31) in the CONCAT call.

Updated testBuildHashExpressionMySQL to assert COALESCE and CHAR(31)
are both present in the generated expression."
@sonarqubecloud
Copy link
Copy Markdown

@jasdeepkhalsa jasdeepkhalsa merged commit b7d0619 into master Mar 26, 2026
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant