Skip to content

Conversation

@saijithendr
Copy link

Problem

When using ColumnTransformer.set_output(transform='pandas'), scikit-learn raises a ValueError if OneHotEncoder is configured with sparse_output=True, even when downstream transformers (like TruncatedSVD) convert the sparse output to dense.

ValueError: Pandas output does not support sparse data.
Set sparse_output=False to output pandas dataframes or
disable Pandas output via ohe.set_output(transform="default").

Root Cause

The sparse compatibility check in OneHotEncoder.transform() happens at the start and doesn't account for pipeline composition where sparse intermediate outputs are normal.

Solution

Moved the check to the END of the method (after sparse matrix construction)

Auto-convert sparse to dense using .toarray() instead of raising error

Added UserWarning guiding users to use sparse_output=False for better performance

Reference Issues/PRs

Fix #30310

What does this implement/fix? Explain your changes.

File: sklearn/preprocessing/_encoders.py

  • Removed: Error-raising code that checked if transform_output != "default" and self.sparse_output:

  • Added (after sparse matrix construction): Auto-conversion logic that converts sparse to dense when pandas output is configured

  • Added: UserWarning with guidance on using sparse_output=False for better performance

File: sklearn/preprocessing/tests/test_encoders.py
Added 5 comprehensive regression tests for issue #30310:

  • test_onehotencoder_sparse_output_with_pandas_set_output() - Standalone OneHotEncoder with sparse + pandas output

  • test_onehotencoder_in_pipeline_with_sparse_and_pandas_output() - OneHotEncoder in Pipeline

  • test_onehotencoder_in_columntransformer_with_sparse_and_pandas_output() - Main use case from issue (ColumnTransformer)

  • test_onehotencoder_sparse_false_no_warning() - No warning when sparse_output=False

  • test_onehotencoder_sparse_output_default_transform_output() - Behavior comparison between pandas and default output

AI usage disclosure

I used AI assistance for:

  • Code generation (e.g., when writing an implementation or fixing a bug)
  • Test/benchmark generation
  • Documentation (including examples)
  • Research and understanding

Any other comments?

@github-actions github-actions bot added module:preprocessing CI:Linter failure The linter CI is failing on this PR labels Dec 12, 2025
@lucyleeow lucyleeow added the autoclose PR automatically closed 14 days after setting the label label Dec 12, 2025
@github-actions
Copy link

⏰ This pull request might be automatically closed in two weeks from now.

Thank you for your contribution to scikit-learn and for the effort you have put into this PR. This pull request does not yet meet the quality and clarity needed for an effective review. Reviewing time is limited, and our goal is to prioritize well-prepared contributions to keep scikit-learn maintainable. Unless this PR is improved, it will be automatically closed after two weeks.

To avoid autoclose and increase the chance of a productive review, please:

  • Ensure your contribution aligns with our contribution guide.
  • Include a clear motivation and concise explanation in the pull request description of why you chose this solution.
  • Make sure the code runs and passes tests locally (pytest) and in the CI.
  • Submit only code you can explain and maintain; reviewers will ask for clarifications and changes. Disclose any AI assistance per our Automated Contributions Policy.
  • Keep the changes minimal and directly relevant to the described issue or enhancement.

We cannot provide one-to-one guidance on every PR, though we encourage you to ask focused, actionable questions that show you have tried to explore the problem and are interested to engage with the project. 💬 Sometimes a maintainer or someone else from the community might be able to offer pointers.

If you improve your PR within the two-week window, the autoclose label can be removed by maintainers.

@github-actions github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autoclose PR automatically closed 14 days after setting the label module:preprocessing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error with set_output(transform='pandas') in ColumnTransformer when using OneHotEncoder with sparse output in intermediate steps

2 participants