Skip to content

Conversation

@isha822
Copy link

@isha822 isha822 commented Dec 7, 2025

The nan_euclidean_distances function produced slightly asymmetric matrices (on the order of 1e-15) when computing distances between X and itself, causing issues with downstream tools like scipy.spatial.distance.squareform.
The Fix:
When Y is None or Y is X, the code now explicitly symmetrizes the result matrix (D + D.T) / 2 and sets the diagonal to 0, ensuring perfect symmetry.
Closes #32848

@ogrisel
Copy link
Member

ogrisel commented Dec 8, 2025

This PR has some broken tests.

Furthermore, I suspect that a similar bug happens when Y is not None. I think I would prefer to identify the cause of the bad NaN handling in the masking logic instead of trying to do a postprocessing fix for the special case of Y is not None.

@ogrisel
Copy link
Member

ogrisel commented Dec 8, 2025

Also, please always include a non-regression test in bugfix PRs.

@github-actions github-actions bot added the CI:Linter failure The linter CI is failing on this PR label Dec 8, 2025
@github-actions github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Dec 9, 2025
@isha822
Copy link
Author

isha822 commented Dec 9, 2025

@ogrisel Thanks for the feedback!
I have updated the PR with the following changes:
Linting: Fixed the line length issue in pairwise.py (split the long comment).
Tests: Added a non-regression test test_nan_euclidean_distances_is_symmetric to sklearn/metrics/tests/test_pairwise.py as requested.
All code tests are passing. I noticed the Check Changelog failed—could you let me know if I need to add an entry to doc/whats_new/, or if this check can be bypassed for this fix?

@abhiyushS
Copy link

abhiyushS commented Dec 9, 2025

Hi @isha822! 👋

I also worked on this issue independently today (started before seeing your PR update) and wanted to share my findings in case they’re helpful.

What I observed while testing:

Using a random matrix with around 10% NaNs and calling nan_euclidean_distances(X) with Y=None, the distance matrix still has a few asymmetric pairs with differences on the order of ~4×10⁻¹⁶. These are tiny, but they make a strict symmetry check like (dist == dist.. T).all() fail.

To address this, my approach enforces exact symmetry after the sqrt step:

if X is Y:
distances = (distances + distances.T) / 2

Preserve NaN for fully missing rows, zero out others

diag_indices = np.arange(len(distances))
diag_vals = distances[diag_indices, diag_indices]
non_nan_mask = ~np.isnan(diag_vals)
distances[diag_indices[non_nan_mask], diag_indices[non_nan_mask]] = 0.0

This keeps NaN on the diagonal only for rows that are completely missing, and sets other self‑distances to 0.

Tests I used:

  • A non‑regression test with ~10% NaN entries
  • Strict symmetry check with assert_array_equal(dist, dist.T)
  • Explicit check that the diagonal is zero for non‑all‑NaN rows

All existing nan_euclidean_distances tests pass with this change.

Question for maintainers (@ogrisel): Is np.allclose-based symmetry (as in your new test) sufficient here, or would you prefer enforcing bitwise symmetry as above? I’m happy to adjust my approach or close my PR if your solution is preferred.

For reference, I’ve opened PR #32874 with this alternative implementation and test so you can compare both approaches side by side.

@isha822
Copy link
Author

isha822 commented Dec 10, 2025

Hi @abhiyushS, thanks for the detailed investigation!
@ogrisel — As noted, my current implementation satisfies the standard np.allclose check (which is consistent with how floating-point operations are usually handled in the library). However, if you prefer enforcing exact bitwise symmetry (using the (dist + dist.T) / 2 method mentioned above) to cover those edge cases, I am happy to update this PR to include that immediately.
I'll wait for your guidance on which approach you prefer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nan_euclidean_distances producing distance matrix not symmetrical due to floating point precision

3 participants