Skip to content

Conversation

@StefanieSenger
Copy link
Member

@StefanieSenger StefanieSenger commented Dec 5, 2025

Reference Issues/PRs

closes #32076
supersedes #32239

What does this implement/fix? Explain your changes.

  • adds a groups param to TargetEncoder.fit_transform that then internally picks a suitable non-overlapping cross-validation strategy (in its cross-fitting).
  • adds more input options for cv init param (as discussed here TargetEncoder should take groups as an argument #32076 (comment))
  • exposes cv_ attribute
  • turns TargetEncoder into a metadata router that routes groups to the internal splitter

AI usage disclosure

I used AI assistance for:

  • Code generation (e.g., when writing an implementation or fixing a bug)
  • Test/benchmark generation
  • Documentation (including examples)
  • Research and understanding

Any other comments?

I also fixed some minor documentation issues across cross-validation, sorry I couldn't hold back.

Comment on lines 351 to 353
# The cv splitter is voluntarily restricted to a pre-specified subset to enforce
# non overlapping validation folds, otherwise the fit_transform output will not
# be well-specified.
Copy link
Member Author

@StefanieSenger StefanieSenger Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current understanding on non-overlap: each index from validation set should appear across all the validation sets exactly once. This also implies that all the indices from X have to appear in a validation fold.

I am doing an attempt to re-write this comment.

@StefanieSenger StefanieSenger changed the title ENH Add groups input param to TargetEncoder ENH Add grouped splitters to be used in TargetEncoder cross fitting Dec 5, 2025
Comment on lines 274 to 283
groups : array-like of shape (n_samples,), default=None
Always ignored, exists for API compatibility.
.. versionadded:: 1.9
**fit_params : dict
Always ignored, exists for API compatibility.
.. versionadded:: 1.9
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we actually need this here.

@StefanieSenger StefanieSenger marked this pull request as ready for review December 11, 2025 07:41
Comment on lines +183 to +185
cv_ : str
Class name of the cv splitter used in `fit_transform` for :term:`cross fitting`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the cv strategy is internally determined and (now with the expansions on what the cv can be and passing groups into fit_transform) also quite complex, I think it is useful to expose a cv_ attribute so users can inspect what has happened internally.

Comment on lines +414 to +418
raise ValueError(
"Expected `cv` as an integer, a cross-validation object "
"(from sklearn.model_selection), or an iterable yielding "
f"(train, test) splits as arrays of indices. Got {self.cv}."
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to remove that, though I am wondering: When I write as test for this, I can see that validate_parameter_constraints already catches it.

So, it seems that

X, y = make_regression(n_samples=100, n_features=3, random_state=0)
encoder = TargetEncoder(target_type="continuous", cv="lolo")
msg = "Expected `cv` as an integer, a cross-validation object"
with pytest.raises(ValueError, match=msg):
    encoder.fit_transform(X, y)

is not necessary.

But why then does codecov not complain that lines 414-418 never get executed?

Comment on lines +2748 to +2750
"Expected `cv` as an integer, a cross-validation object "
"(from sklearn.model_selection), or an iterable yielding (train, test) "
f"splits as arrays of indices. Got {cv}."
Copy link
Member Author

@StefanieSenger StefanieSenger Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improving the error message as a side quest, since strings are also iterables.

Comment on lines +483 to +485
class ConsumingSplitterInheritingFromGroupKFold(ConsumingSplitter, GroupKFold):
"""Helper class that can be used to test TargetEncoder, that only takes specific
splitters."""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative, ConsumingSplitter itself could inherit from GroupKFold?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TargetEncoder should take groups as an argument

1 participant