Skip to content

Conversation

@samruddhibaviskar11
Copy link

Summary

Enhancement: Add groups parameter to TargetEncoder
This PR extends TargetEncoder with an optional groups argument:
If groups is provided → use GroupKFold.
Otherwise → fall back to existing behavior (KFold for continuous targets, StratifiedKFold for classification).
Also adds _validate_cv_no_overlap utility to raise if custom CV splitters produce overlapping validation sets.

Added tests for:

Using groups with binary, continuous, and multiclass targets.
Behavior difference with/without groups.
Error when using a custom CV splitter that overlaps.

Motivation

Currently, TargetEncoder always uses KFold or StratifiedKFold for internal cross-fitting.
This can cause data leakage when samples are not independent (e.g., repeated measures, clustered patients, or time-series grouped by entity).
By adding a groups parameter, users can ensure that samples from the same group always appear in the same fold, preventing leakage and producing more reliable encodings.

Remaining work

Update the documentation (user guide + TargetEncoder docstring).

Related Issue
Closes: #32076

Thanks to @MatthiasLoefflerQC for opening the issue and @adrinjalali & @thomasjpfan for guiding the implementation.

@github-actions
Copy link

github-actions bot commented Sep 21, 2025

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


ruff check

ruff detected issues. Please run ruff check --fix --output-format=full locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.11.7.

Details

sklearn/preprocessing/_target_encoder.py:329:12: F821 Undefined name `groups`
    |
327 |         # overlapping validation folds, otherwise the fit_transform output will
328 |         # not be well-specified.
329 |         if groups is not None:
    |            ^^^^^^ F821
330 |             cv = GroupKFold(n_splits=self.cv)
331 |         elif self.target_type_ == "continuous":
    |

sklearn/preprocessing/_target_encoder.py:339:47: F821 Undefined name `groups`
    |
337 |         # Only validate non-approved splitters to avoid overhead
338 |         if not isinstance(cv, APPROVED_CV_SPLITTERS):
339 |             _validate_cv_no_overlap(cv, X, y, groups)
    |                                               ^^^^^^ F821
340 |
341 |         # If 'multiclass' multiply axis=1 by num classes else keep shape the same
    |

sklearn/preprocessing/_target_encoder.py:350:51: F821 Undefined name `groups`
    |
348 |             X_out = np.empty_like(X_ordinal, dtype=np.float64)
349 |
350 |         for train_idx, test_idx in cv.split(X, y, groups):
    |                                                   ^^^^^^ F821
351 |             X_train, y_train = X_ordinal[train_idx, :], y_encoded[train_idx]
352 |             y_train_mean = np.mean(y_train, axis=0)
    |

Found 3 errors.

Generated for commit: bb0918d. Link to the linter CI: here

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be using metadata routing to route groups, and isn't at the moment.

We also decided if the user wants to user groups, they need to explicitly pass a grouped cv object, and we won't do that for the user.

Note that handling metadata routing is not necessarily easy and it's okay if you don't want to continue the work.

val_set = set(val_idx)

# Check for overlap with previous validation sets
if all_val_indices.intersection(val_set):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can avoid using iterative intersection operations by simply creating a union, and counting the number of indices returned at each iteration. In the end the set's size needs to be exactly n_samples, and the sum of counts should be the same number.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hii, I’ve updated the overlap validation logic as suggested, replacing the iterative intersection check with a union-based approach.

@adrinjalali adrinjalali added the Closing candidate indicating we might close this one soon / at a later review label Sep 22, 2025
@samruddhibaviskar11
Copy link
Author

This needs to be using metadata routing to route groups, and isn't at the moment.

We also decided if the user wants to user groups, they need to explicitly pass a grouped cv object, and we won't do that for the user.

Note that handling metadata routing is not necessarily easy and it's okay if you don't want to continue the work.

I understand that the preferred direction is to handle groups through metadata routing and require users to explicitly pass a grouped CV object, rather than switching automatically inside TargetEncoder.

I’d be very interested in contributing to this area, but I’ll need to study the metadata routing system more deeply before I can contribute meaningfully here. In the meantime, others are of course welcome to continue building on this PR.

Thanks again to @adrinjalali, and @thomasjpfan for the guidance!

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samruddhibaviskar11 if you like to get a handle on metadata routing, once you go through the documentation and checking some of our historical PRs in this area, let us know and we're happy to help you out.

@samruddhibaviskar11
Copy link
Author

Thanks, @adrinjalali! I’ll go through the metadata routing documentation and past PRs to get a better handle on it. I’ll reach out if I need any guidance.

@samruddhibaviskar11 samruddhibaviskar11 force-pushed the add-groups-to-target-encoder branch from bb0918d to 606b0e1 Compare October 26, 2025 13:45
@adrinjalali
Copy link
Member

Thanks for all the work here @samruddhibaviskar11 but from what it looks like, it seems it's best if you tackle easier issues before getting to this one. So closing the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Closing candidate indicating we might close this one soon / at a later review module:preprocessing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TargetEncoder should take groups as an argument

2 participants