adapters: fix OOM in ad-hoc query hashing for large data by swanandx · Pull Request #6016 · feldera/feldera

swanandx · 2026-04-10T09:58:59Z

Replace arrow-digest's RecordDigestV0 (order-dependent, required sorting the entire DataFrame) with a streaming, order-independent BatchHasher.

The old approach sorted rows by every column before hashing. For large views it lead to OOM. The new approach uses DataFusion's create_hashes with two independent ahash seeds, accumulating row hashes via wrapping addition (commutative — no sort needed).

Memory usage is O(batch_size).

Describe Manual Test Plan

tested by adding tests 💡

Checklist

Unit tests added/updated

Breaking Changes?

Others (specify)

Describe Incompatible Changes

Yes, it changes the HASH generation, invalidating older hashes for same data. But we only use it internally so should be fine.

mythical-fred

One blocker: please squash the [ci] apply automatic fixes commit. Dirty history is still a hard no for ready PRs.

gz · 2026-04-10T14:42:57Z

+    fn new() -> Self {
+        Self {
+            rs1: ahash::RandomState::with_seeds('M' as u64, 'U' as u64, 'A' as u64, 'Y' as u64),
+            rs2: ahash::RandomState::with_seeds('T' as u64, 'H' as u64, 'A' as u64, 'I' as u64),


mythical-fred

LGTM

Replace arrow-digest's RecordDigestV0 (order-dependent, required sorting the entire DataFrame) with a streaming, order-independent BatchHasher. The old approach sorted rows by every column before hashing. For large views it lead to OOM. The new approach uses DataFusion's create_hashes with two independent ahash seeds, accumulating row hashes via wrapping addition (commutative — no sort needed). Memory usage is O(batch_size). Signed-off-by: Swanand Mulay <73115739+swanandx@users.noreply.github.com>

Karakatiza666 · 2026-04-17T10:11:23Z

@swanandx Does this hash rows, or record log (inserts + deletes)?

mythical-fred suggested changes Apr 10, 2026

View reviewed changes

gz approved these changes Apr 10, 2026

View reviewed changes

swanandx force-pushed the query-hash-fix branch 2 times, most recently from 4aaaacf to e12ebf1 Compare April 13, 2026 08:47

swanandx requested a review from mythical-fred April 13, 2026 08:48

mythical-fred approved these changes Apr 13, 2026

View reviewed changes

swanandx added this pull request to the merge queue Apr 14, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 14, 2026

swanandx force-pushed the query-hash-fix branch from e12ebf1 to 2851e9c Compare April 14, 2026 06:43

swanandx force-pushed the query-hash-fix branch from 2851e9c to 4f83680 Compare April 14, 2026 06:44

swanandx enabled auto-merge April 14, 2026 06:44

swanandx added this pull request to the merge queue Apr 14, 2026

Merged via the queue into main with commit dc44a5f Apr 14, 2026
1 check passed

swanandx deleted the query-hash-fix branch April 14, 2026 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adapters: fix OOM in ad-hoc query hashing for large data#6016

adapters: fix OOM in ad-hoc query hashing for large data#6016
swanandx merged 1 commit intomainfrom
query-hash-fix

swanandx commented Apr 10, 2026

Uh oh!

mythical-fred left a comment

Uh oh!

gz Apr 10, 2026

Uh oh!

mythical-fred left a comment

Uh oh!

Uh oh!

Uh oh!

Karakatiza666 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

swanandx commented Apr 10, 2026

Describe Manual Test Plan

Checklist

Breaking Changes?

Describe Incompatible Changes

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

gz Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Karakatiza666 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants