-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
ENH Adds groups parameter to TargetEncoder with validation for non-overlapping CV splits #32239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds groups parameter to TargetEncoder with validation for non-overlapping CV splits #32239
Conversation
…erlapping CV splits
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
adrinjalali
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be using metadata routing to route groups, and isn't at the moment.
We also decided if the user wants to user groups, they need to explicitly pass a grouped cv object, and we won't do that for the user.
Note that handling metadata routing is not necessarily easy and it's okay if you don't want to continue the work.
| val_set = set(val_idx) | ||
|
|
||
| # Check for overlap with previous validation sets | ||
| if all_val_indices.intersection(val_set): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can avoid using iterative intersection operations by simply creating a union, and counting the number of indices returned at each iteration. In the end the set's size needs to be exactly n_samples, and the sum of counts should be the same number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hii, I’ve updated the overlap validation logic as suggested, replacing the iterative intersection check with a union-based approach.
I understand that the preferred direction is to handle groups through metadata routing and require users to explicitly pass a grouped CV object, rather than switching automatically inside TargetEncoder. I’d be very interested in contributing to this area, but I’ll need to study the metadata routing system more deeply before I can contribute meaningfully here. In the meantime, others are of course welcome to continue building on this PR. Thanks again to @adrinjalali, and @thomasjpfan for the guidance! |
adrinjalali
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samruddhibaviskar11 if you like to get a handle on metadata routing, once you go through the documentation and checking some of our historical PRs in this area, let us know and we're happy to help you out.
|
Thanks, @adrinjalali! I’ll go through the metadata routing documentation and past PRs to get a better handle on it. I’ll reach out if I need any guidance. |
bb0918d to
606b0e1
Compare
|
Thanks for all the work here @samruddhibaviskar11 but from what it looks like, it seems it's best if you tackle easier issues before getting to this one. So closing the PR. |
Summary
Enhancement: Add groups parameter to TargetEncoder
This PR extends TargetEncoder with an optional groups argument:
If groups is provided → use GroupKFold.
Otherwise → fall back to existing behavior (KFold for continuous targets, StratifiedKFold for classification).
Also adds _validate_cv_no_overlap utility to raise if custom CV splitters produce overlapping validation sets.
Added tests for:
Using groups with binary, continuous, and multiclass targets.
Behavior difference with/without groups.
Error when using a custom CV splitter that overlaps.
Motivation
Currently, TargetEncoder always uses KFold or StratifiedKFold for internal cross-fitting.
This can cause data leakage when samples are not independent (e.g., repeated measures, clustered patients, or time-series grouped by entity).
By adding a groups parameter, users can ensure that samples from the same group always appear in the same fold, preventing leakage and producing more reliable encodings.
Remaining work
Update the documentation (user guide + TargetEncoder docstring).
Related Issue
Closes: #32076
Thanks to @MatthiasLoefflerQC for opening the issue and @adrinjalali & @thomasjpfan for guiding the implementation.