ENH Add grouped splitters to be used in TargetEncoder cross fitting #32843

StefanieSenger · 2025-12-05T08:51:48Z

Reference Issues/PRs

closes #32076
supersedes #32239

What does this implement/fix? Explain your changes.

adds a groups param to TargetEncoder.fit_transform that then internally picks a suitable non-overlapping cross-validation strategy (in its cross-fitting).
adds more input options for cv init param (as discussed here TargetEncoder should take groups as an argument #32076 (comment))
exposes cv_ attribute
turns TargetEncoder into a metadata router that routes groups to the internal splitter

AI usage disclosure

I used AI assistance for:

Code generation (e.g., when writing an implementation or fixing a bug)
Test/benchmark generation
Documentation (including examples)
Research and understanding

Any other comments?

I also fixed some minor documentation issues across cross-validation, sorry I couldn't hold back.

StefanieSenger · 2025-12-05T11:35:27Z

sklearn/preprocessing/_target_encoder.py

+        # The cv splitter is voluntarily restricted to a pre-specified subset to enforce
+        # non overlapping validation folds, otherwise the fit_transform output will not
+        # be well-specified.


Current understanding on non-overlap: each index from validation set should appear across all the validation sets exactly once. This also implies that all the indices from X have to appear in a validation fold.

I am doing an attempt to re-write this comment.

StefanieSenger · 2025-12-08T10:13:16Z

sklearn/preprocessing/_target_encoder.py

+        groups : array-like of shape (n_samples,), default=None
+            Always ignored, exists for API compatibility.
+
+            .. versionadded:: 1.9
+
+        **fit_params : dict
+            Always ignored, exists for API compatibility.
+
+            .. versionadded:: 1.9
+


I am not sure if we actually need this here.

StefanieSenger · 2025-12-11T08:05:02Z

sklearn/preprocessing/_target_encoder.py

+    cv_ : str
+        Class name of the cv splitter used in `fit_transform` for :term:`cross fitting`.
+


Since the cv strategy is internally determined and (now with the expansions on what the cv can be and passing groups into fit_transform) also quite complex, I think it is useful to expose a cv_ attribute so users can inspect what has happened internally.

StefanieSenger · 2025-12-11T08:51:30Z

sklearn/preprocessing/_target_encoder.py

+                    raise ValueError(
+                        "Expected `cv` as an integer, a cross-validation object "
+                        "(from sklearn.model_selection), or an iterable yielding "
+                        f"(train, test) splits as arrays of indices. Got {self.cv}."
+                    )


I tend to remove that, though I am wondering: When I write as test for this, I can see that validate_parameter_constraints already catches it.

So, it seems that

X, y = make_regression(n_samples=100, n_features=3, random_state=0) encoder = TargetEncoder(target_type="continuous", cv="lolo") msg = "Expected `cv` as an integer, a cross-validation object" with pytest.raises(ValueError, match=msg): encoder.fit_transform(X, y)

is not necessary.

But why then does codecov not complain that lines 414-418 never get executed?

StefanieSenger · 2025-12-11T08:52:01Z

sklearn/model_selection/_split.py

+                "Expected `cv` as an integer, a cross-validation object "
+                "(from sklearn.model_selection), or an iterable yielding (train, test) "
+                f"splits as arrays of indices. Got {cv}."


Improving the error message as a side quest, since strings are also iterables.

StefanieSenger · 2025-12-11T08:53:37Z

sklearn/tests/metadata_routing_common.py

+class ConsumingSplitterInheritingFromGroupKFold(ConsumingSplitter, GroupKFold):
+    """Helper class that can be used to test TargetEncoder, that only takes specific
+    splitters."""


As an alternative, ConsumingSplitter itself could inherit from GroupKFold?

StefanieSenger added 6 commits December 1, 2025 15:06

define what we want in cv param docs

efdaeea

define desired behavior in docstring

28ac9cd

comment on intended behaviour

12eec48

experiments with check_cv

cd97cb9

add check for groups and non-overlapping indices in fit_transform

6878fd8

add check for overlapping splits and test

9c0fe0e

StefanieSenger commented Dec 5, 2025

View reviewed changes

StefanieSenger and others added 2 commits December 5, 2025 13:23

fix check for overlap and test

dfd598e

Merge branch 'main' into TargetEncoder_groups

fb3d259

StefanieSenger changed the title ~~ENH Add groups input param to TargetEncoder~~ ENH Add grouped splitters to be used in TargetEncoder cross fitting Dec 5, 2025

StefanieSenger added 2 commits December 5, 2025 14:57

add routing groups when routing not enabled

7828718

add routing test and fix circular import

1b51ce2

StefanieSenger commented Dec 8, 2025

View reviewed changes

StefanieSenger and others added 12 commits December 8, 2025 12:20

add changelog

bcaf4c4

fix error message

e5de9e6

fix passing instance instead of class for cv

3d2d556

fix input validation for metadata routing

bcb8516

Merge branch 'main' into TargetEncoder_groups

9994b6a

fix allow direct groups input if metadata routing is enabled

097dae9

fix groups not routed and rename fit_params into params

bf0e717

test for GroupKFold instead for ConsumingSplitter

9bc785a

Merge branch 'main' into TargetEncoder_groups

e43ceca

fix rendering issues in changelog

0109828

fix docstring test for cv_ attribute

3c6c917

fix renamed changelog file

5786312

adrinjalali assigned StefanieSenger Dec 10, 2025

StefanieSenger marked this pull request as ready for review December 11, 2025 07:41

StefanieSenger added 2 commits December 11, 2025 08:53

wording in docs

e0323e9

little renaming

880eaf1

StefanieSenger commented Dec 11, 2025

View reviewed changes

StefanieSenger added 2 commits December 11, 2025 09:29

add raising error if groups are not passed but cv is a group splitter

e7ec619

improve error message

3a24732

StefanieSenger commented Dec 11, 2025

View reviewed changes

Merge branch 'main' into TargetEncoder_groups

530631a

adrinjalali unassigned StefanieSenger Dec 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH Add grouped splitters to be used in TargetEncoder cross fitting #32843

ENH Add grouped splitters to be used in TargetEncoder cross fitting #32843

StefanieSenger commented Dec 5, 2025 •

edited

Loading

Uh oh!

StefanieSenger Dec 5, 2025 •

edited

Loading

Uh oh!

StefanieSenger Dec 8, 2025

Uh oh!

StefanieSenger Dec 11, 2025

Uh oh!

StefanieSenger Dec 11, 2025

Uh oh!

StefanieSenger Dec 11, 2025 •

edited

Loading

Uh oh!

StefanieSenger Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		cv_ : str
		Class name of the cv splitter used in `fit_transform` for :term:`cross fitting`.

Uh oh!

ENH Add grouped splitters to be used in TargetEncoder cross fitting #32843

Are you sure you want to change the base?

ENH Add grouped splitters to be used in TargetEncoder cross fitting #32843

Conversation

StefanieSenger commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

AI usage disclosure

Any other comments?

Uh oh!

StefanieSenger Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

StefanieSenger commented Dec 5, 2025 •

edited

Loading

StefanieSenger Dec 5, 2025 •

edited

Loading

StefanieSenger Dec 11, 2025 •

edited

Loading