Fix #30310: Allow sparse intermediate outputs in ColumnTransformer with pandas output #32892
+226
−26
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
When using ColumnTransformer.set_output(transform='pandas'), scikit-learn raises a ValueError if OneHotEncoder is configured with sparse_output=True, even when downstream transformers (like TruncatedSVD) convert the sparse output to dense.
Root Cause
The sparse compatibility check in OneHotEncoder.transform() happens at the start and doesn't account for pipeline composition where sparse intermediate outputs are normal.
Solution
Moved the check to the END of the method (after sparse matrix construction)
Auto-convert sparse to dense using .toarray() instead of raising error
Added UserWarning guiding users to use sparse_output=False for better performance
Reference Issues/PRs
Fix #30310
What does this implement/fix? Explain your changes.
File: sklearn/preprocessing/_encoders.py
Removed: Error-raising code that checked if transform_output != "default" and self.sparse_output:
Added (after sparse matrix construction): Auto-conversion logic that converts sparse to dense when pandas output is configured
Added: UserWarning with guidance on using sparse_output=False for better performance
File: sklearn/preprocessing/tests/test_encoders.py
Added 5 comprehensive regression tests for issue #30310:
test_onehotencoder_sparse_output_with_pandas_set_output() - Standalone OneHotEncoder with sparse + pandas output
test_onehotencoder_in_pipeline_with_sparse_and_pandas_output() - OneHotEncoder in Pipeline
test_onehotencoder_in_columntransformer_with_sparse_and_pandas_output() - Main use case from issue (ColumnTransformer)
test_onehotencoder_sparse_false_no_warning() - No warning when sparse_output=False
test_onehotencoder_sparse_output_default_transform_output() - Behavior comparison between pandas and default output
AI usage disclosure
I used AI assistance for:
Any other comments?