[quantization] Quantize` lm_head` by stamalakhov · Pull Request #631 · Samsung/TICO

stamalakhov · 2026-04-14T13:50:43Z

This PR quantizes lm_head in GPTQ to imorove accuracy.

./ccex test --include-internal -k quantization.algorithm.test_gptq


RUN unit tests with -k quantization.algorithm.test_gptq ...
test_gptq_config_validate_rejects_non_positive_weight_bits_override (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_gptq_config_validate_weight_bits_overrides (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_groupwise_conv1d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_groupwise_conv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_model (quantization.algorithm.test_gptq.GPTQTest) ... <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
ok
test_net (quantization.algorithm.test_gptq.GPTQTest) ... No specialized wrapper found for ModuleList; applying recursive wrapping.
ok
test_net_on_zero_inputs (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv1d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv1d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv2d_on_zero_inputs (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv2d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv3d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv3d_on_zero_inputs (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv3d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_paddednormconv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_paddednormconv3d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_resolve_weight_bits_priority (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_transposed_conv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_transposed_conv2d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_weight_bits_overrides_are_applied_per_module (quantization.algorithm.test_gptq.GPTQTest) ... ok

----------------------------------------------------------------------
Ran 21 tests in 119.973s

OK

Value tests:

HuggingFaceTB/SmolLM2-135M-Instruct

Config ID	PPL
FP32	17.40
GPTQ_MSE_w4A16_head4	27.74
GPTQ_MSE_w4A16_head_GPTQ_4	25.01
GPTQ_SMSE_w4A16_head4	27.19
GPTQ_SMSE_w4A16_head_GPTQ_4	24.14

TinyLlama/TinyLlama-1.1B-Chat-v1.0:

Config ID	PPL
FP32	7.97
GPTQ_MSE_w4A16_head4	8.66
GPTQ_MSE_w4A16_head_GPTQ_4	8.54
GPTQ_SMSE_w4A16_head4	8.52
GPTQ_SMSE_w4A16_head_GPTQ_4	8.42

unsloth/Llama-3.2-1B-Instruct:

Config ID	PPL
FP32	13.17
GPTQ_MSE_w4A16_head4	18.59
GPTQ_MSE_w4A16_head_GPTQ_4	18.30
GPTQ_SMSE_w4A16_head4	15.26
GPTQ_SMSE_w4A16_head_GPTQ_4	15.00

unsloth/Llama-3.2-3B-Instruct:

Config ID	PPL
FP32	11.05
GPTQ_MSE_w4A16_head4	12.96
GPTQ_MSE_w4A16_head_GPTQ_4	12.67
GPTQ_SMSE_w4A16_head4	12.31
GPTQ_SMSE_w4A16_head_GPTQ_4	12.17

Related: as a fallback to #624
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

This PR quantizes `lm_head` in GPTQ to improve accuracy. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov self-assigned this Apr 14, 2026

stamalakhov force-pushed the GPTQ_lm_head branch 6 times, most recently from d18e64e to 23a40c6 Compare April 15, 2026 08:58

stamalakhov marked this pull request as ready for review April 15, 2026 08:59

stamalakhov requested a review from mhs4670go April 15, 2026 09:01

[quantization] Quantize lm_head

a737b08

This PR quantizes `lm_head` in GPTQ to improve accuracy. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov force-pushed the GPTQ_lm_head branch from 23a40c6 to a737b08 Compare April 15, 2026 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quantization] Quantize `lm_head`#631

[quantization] Quantize `lm_head`#631
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:GPTQ_lm_head

stamalakhov commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stamalakhov commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stamalakhov commented Apr 14, 2026 •

edited

Loading