-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
ENH Add grouped splitters to be used in TargetEncoder cross fitting #32843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH Add grouped splitters to be used in TargetEncoder cross fitting #32843
Conversation
| # The cv splitter is voluntarily restricted to a pre-specified subset to enforce | ||
| # non overlapping validation folds, otherwise the fit_transform output will not | ||
| # be well-specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current understanding on non-overlap: each index from validation set should appear across all the validation sets exactly once. This also implies that all the indices from X have to appear in a validation fold.
I am doing an attempt to re-write this comment.
groups input param to TargetEncoder| groups : array-like of shape (n_samples,), default=None | ||
| Always ignored, exists for API compatibility. | ||
| .. versionadded:: 1.9 | ||
| **fit_params : dict | ||
| Always ignored, exists for API compatibility. | ||
| .. versionadded:: 1.9 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we actually need this here.
| cv_ : str | ||
| Class name of the cv splitter used in `fit_transform` for :term:`cross fitting`. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the cv strategy is internally determined and (now with the expansions on what the cv can be and passing groups into fit_transform) also quite complex, I think it is useful to expose a cv_ attribute so users can inspect what has happened internally.
| raise ValueError( | ||
| "Expected `cv` as an integer, a cross-validation object " | ||
| "(from sklearn.model_selection), or an iterable yielding " | ||
| f"(train, test) splits as arrays of indices. Got {self.cv}." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to remove that, though I am wondering: When I write as test for this, I can see that validate_parameter_constraints already catches it.
So, it seems that
X, y = make_regression(n_samples=100, n_features=3, random_state=0)
encoder = TargetEncoder(target_type="continuous", cv="lolo")
msg = "Expected `cv` as an integer, a cross-validation object"
with pytest.raises(ValueError, match=msg):
encoder.fit_transform(X, y)
is not necessary.
But why then does codecov not complain that lines 414-418 never get executed?
| "Expected `cv` as an integer, a cross-validation object " | ||
| "(from sklearn.model_selection), or an iterable yielding (train, test) " | ||
| f"splits as arrays of indices. Got {cv}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improving the error message as a side quest, since strings are also iterables.
| class ConsumingSplitterInheritingFromGroupKFold(ConsumingSplitter, GroupKFold): | ||
| """Helper class that can be used to test TargetEncoder, that only takes specific | ||
| splitters.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative, ConsumingSplitter itself could inherit from GroupKFold?
Reference Issues/PRs
closes #32076
supersedes #32239
What does this implement/fix? Explain your changes.
groupsparam toTargetEncoder.fit_transformthat then internally picks a suitable non-overlapping cross-validation strategy (in its cross-fitting).cvinit param (as discussed hereTargetEncodershould takegroupsas an argument #32076 (comment))TargetEncoderinto a metadata router that routesgroupsto the internal splitterAI usage disclosure
I used AI assistance for:
Any other comments?
I also fixed some minor documentation issues across cross-validation, sorry I couldn't hold back.