feat: Streaming sorted-merge diff for PostgreSQL, SQLite sha3 hash, MySQL CHECKSUM pre-scan#158
Merged
jasdeepkhalsa merged 5 commits intomasterfrom Mar 26, 2026
Merged
Conversation
…ySQL CHECKSUM pre-scan Introduce StreamingMergeDiff — a two-phase, driver-agnostic diff algorithm that replaces the previous full-memory PHP comparison for PostgreSQL and the batched ArrayDiff approach for cross-server diffs. Phase 1 streams PK + row-hash from both databases in sorted order (merge-sort walk) to identify inserts, deletes, and potential updates without transferring full row data. Phase 2 fetches only the differing rows by PK for SQL generation. Driver-specific optimisations: - PostgreSQL (same-server): md5(COALESCE(col::text, '') || E'\\x00' || ...) replaces the O(n) full-memory fetchIndexed() approach that exhausted memory on tables >500K rows. - SQLite: runtime sha3() probe with cached result; when available, replaces the field-by-field IS NOT comparison with a single hash comparison per row. Falls back gracefully to the original IS NOT approach. - MySQL: CHECKSUM TABLE pre-scan skips identical tables entirely before running the SHA2 hash JOIN, wrapped in try/catch for engine compatibility. - Cross-server (all drivers): StreamingMergeDiff replaces the 1000-row batched ArrayDiff, cutting data transfer to PK+hash in Phase 1. Test changes: - Add StreamingMergeDiffTest (15 test methods) using SQLite in-memory DBs: inserts, deletes, updates, composite PKs, NULLs, fieldsToIgnore, large datasets (2000 rows), and helper method coverage (comparePK, hash exprs). - Update SQLite test fixtures to include actual data differences (previously had identical data as a workaround). - Remove 7 stale SQLite expected-output baselines that need re-recording via DBDIFF_RECORD_MODE=true after fixture data changes. - Register tests/DataDiff/ in phpunit.xml and phpunit.v9.xml Unit suite."
PostgreSQL rejects \\x00 (null byte) in UTF-8 text strings with: SQLSTATE[22021]: invalid byte sequence for encoding \"UTF8\": 0x00 Replace the null byte column separator with \\x1f (Unit Separator, ASCII 31) which is valid UTF-8 and equally unlikely to appear in real data. Affected drivers: - PostgreSQL: E'\\x00' → E'\\x1f' in md5() hash expression - SQLite: X'00' → X'1f' in hex()/sha3() hash expressions Also re-record the 7 SQLite expected-output baselines that were previously deleted (fixture data was updated to include real data differences)."
LocalTableData::getDiff (4 returns → 3): Extract MySQL CHECKSUM + diff logic into getDiffMySQL(), leaving getDiff() with only 3 exit points (sqlite, pgsql, mysql). StreamingMergeDiff::getDiff (cognitive complexity 49 → low): Extract collectInserts(), collectDeletes(), collectUpdates(), and buildSingleUpdate() so that getDiff() is a thin orchestrator with no nested loops. Each helper has a single clear responsibility. StreamingMergeDiff::buildHashExpression (4 returns → 2): Replace switch/case with a match expression that returns a single computed value, reducing return statements from 4 to 2. StreamingMergeDiff: dedicated exception (RuntimeException → DataException): Replace generic \RuntimeException with DBDiff\Exceptions\DataException in the unsupported-driver default branch."
CONCAT() returns NULL if any argument is NULL. Without COALESCE, any row whose hash expression contained a NULL column produced a NULL hash. In Phase 1 of the streaming merge, PHP compares hashes with !==; since null !== null evaluates to false, a changed row where both source and target shared the same NULL column(s) but differed in other columns would silently pass through as identical — the update would be missed. Also add CHAR(31) as the column separator (consistent with the pgsql and sqlite drivers, which already use \x1f) to prevent cross-column hash collisions where different value distributions produce the same concatenated string. Fix: wrap each column in COALESCE(CAST(col AS CHAR CHARACTER SET utf8), '') and separate columns with CHAR(31) in the CONCAT call. Updated testBuildHashExpressionMySQL to assert COALESCE and CHAR(31) are both present in the generated expression."
|
This was referenced Apr 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Streaming Sorted-Merge Diff Algorithm
Summary
Introduces
StreamingMergeDiff— a two-phase, driver-agnostic diff algorithm that dramatically improves performance for PostgreSQL (same-server), cross-server diffs (all drivers), and SQLite changed-row detection.Problem
fetchIndexed()— loads both full tables into PHP memoryIS NOTcomparison inWHEREclauseArrayDiffbatched 1,000-row bucketsSolution
StreamingMergeDiff (new class)
A reusable two-phase algorithm that works across all drivers:
Phase 1 — Hash stream: Queries both databases for
PK + row_hashin sorted PK order. A merge-sort walk identifies inserts (PK only in source), deletes (PK only in target), and potential updates (same PK, different hash) — without transferring full row data.Phase 2 — Targeted fetch: Only the PKs identified in Phase 1 are fetched in full (
WHERE pk IN (...), batched in groups of 500). Actual column values are compared to produceInsertData,DeleteData, andUpdateDataobjects.Driver-specific optimisations
md5(COALESCE(col::text, '') || E'\\x1f' || ...)sha3(COALESCE(CAST(col AS TEXT), '') || X'1f' || ..., 256)IS NOTSHA2(CONCAT(CAST(col AS CHAR CHARACTER SET utf8), ...), 256)CHECKSUM TABLEpre-scan skips identical tablesColumn values are separated by
\x1f(Unit Separator, ASCII 31) — a valid UTF-8 character that won't appear in normal data and avoids PostgreSQL's rejection of null bytes in text strings.Performance impact
For a 10M-row table with 1,000 changes:
For identical tables (MySQL):
CHECKSUM TABLEreturns in milliseconds, skipping the full SHA2 scan entirely.Changes
New files
src/DB/Data/StreamingMergeDiff.php— Two-phase streaming sorted-merge algorithmModified files
src/DB/Data/LocalTableData.php— PostgreSQL uses StreamingMergeDiff; SQLite sha3() probe + fallback; MySQL CHECKSUM TABLE pre-scansrc/DB/Data/DistTableData.php— Delegates to StreamingMergeDiff (replaces ArrayDiff batching)Test changes
tests/DataDiff/StreamingMergeDiffTest.php— 15 unit tests using SQLite in-memory databases (no external services needed)tests/DataDiff/in bothphpunit.xmlandphpunit.v9.xml