Fix #30310: Allow sparse intermediate outputs in ColumnTransformer with pandas output #32892

saijithendr · 2025-12-12T04:51:43Z

Problem

When using ColumnTransformer.set_output(transform='pandas'), scikit-learn raises a ValueError if OneHotEncoder is configured with sparse_output=True, even when downstream transformers (like TruncatedSVD) convert the sparse output to dense.

ValueError: Pandas output does not support sparse data.
Set sparse_output=False to output pandas dataframes or
disable Pandas output via ohe.set_output(transform="default").

Root Cause

The sparse compatibility check in OneHotEncoder.transform() happens at the start and doesn't account for pipeline composition where sparse intermediate outputs are normal.

Solution

Moved the check to the END of the method (after sparse matrix construction)

Auto-convert sparse to dense using .toarray() instead of raising error

Added UserWarning guiding users to use sparse_output=False for better performance

Reference Issues/PRs

Fix #30310

What does this implement/fix? Explain your changes.

File: sklearn/preprocessing/_encoders.py

Removed: Error-raising code that checked if transform_output != "default" and self.sparse_output:
Added (after sparse matrix construction): Auto-conversion logic that converts sparse to dense when pandas output is configured
Added: UserWarning with guidance on using sparse_output=False for better performance

File: sklearn/preprocessing/tests/test_encoders.py
Added 5 comprehensive regression tests for issue #30310:

test_onehotencoder_sparse_output_with_pandas_set_output() - Standalone OneHotEncoder with sparse + pandas output
test_onehotencoder_in_pipeline_with_sparse_and_pandas_output() - OneHotEncoder in Pipeline
test_onehotencoder_in_columntransformer_with_sparse_and_pandas_output() - Main use case from issue (ColumnTransformer)
test_onehotencoder_sparse_false_no_warning() - No warning when sparse_output=False
test_onehotencoder_sparse_output_default_transform_output() - Behavior comparison between pandas and default output

AI usage disclosure

I used AI assistance for:

Code generation (e.g., when writing an implementation or fixing a bug)
Test/benchmark generation
Documentation (including examples)
Research and understanding

Any other comments?

…ansformer with pandas output

…ransformer with pandas output

github-actions · 2025-12-12T04:57:55Z

⏰ This pull request might be automatically closed in two weeks from now.

Thank you for your contribution to scikit-learn and for the effort you have put into this PR. This pull request does not yet meet the quality and clarity needed for an effective review. Reviewing time is limited, and our goal is to prioritize well-prepared contributions to keep scikit-learn maintainable. Unless this PR is improved, it will be automatically closed after two weeks.

To avoid autoclose and increase the chance of a productive review, please:

Ensure your contribution aligns with our contribution guide.
Include a clear motivation and concise explanation in the pull request description of why you chose this solution.
Make sure the code runs and passes tests locally (pytest) and in the CI.
Submit only code you can explain and maintain; reviewers will ask for clarifications and changes. Disclose any AI assistance per our Automated Contributions Policy.
Keep the changes minimal and directly relevant to the described issue or enhancement.

We cannot provide one-to-one guidance on every PR, though we encourage you to ask focused, actionable questions that show you have tried to explore the problem and are interested to engage with the project. 💬 Sometimes a maintainer or someone else from the community might be able to offer pointers.

If you improve your PR within the two-week window, the autoclose label can be removed by maintainers.

- Remove unused imports - Organize imports alphabetically

saijithendr added 2 commits December 12, 2025 05:18

Fix scikit-learn#30310: Allow sparse intermediate outputs in ColumnTr…

86b72a3

…ansformer with pandas output

Fix scikit-learn#30310: Allows sparse intermediate outputs in ColumnT…

beece03

…ransformer with pandas output

github-actions bot added module:preprocessing CI:Linter failure The linter CI is failing on this PR labels Dec 12, 2025

Fix scikit-learn#30310: linting issues format fixed

210fb60

lucyleeow added the autoclose PR automatically closed 14 days after setting the label label Dec 12, 2025

saijithendr added 2 commits December 12, 2025 06:58

Fix linting errors in test_encoders.py

c4f78fa

- Remove unused imports - Organize imports alphabetically

Fix I001: linting, import organization - auto-fixed with ruff

6fbdf95

github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Dec 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix #30310: Allow sparse intermediate outputs in ColumnTransformer with pandas output #32892

Fix #30310: Allow sparse intermediate outputs in ColumnTransformer with pandas output #32892

saijithendr commented Dec 12, 2025

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix #30310: Allow sparse intermediate outputs in ColumnTransformer with pandas output #32892

Are you sure you want to change the base?

Fix #30310: Allow sparse intermediate outputs in ColumnTransformer with pandas output #32892

Conversation

saijithendr commented Dec 12, 2025

Problem

Root Cause

Solution

Reference Issues/PRs

What does this implement/fix? Explain your changes.

AI usage disclosure

Any other comments?

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants