Skip to content

[quantization] Quantize lm_head#631

Open
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:GPTQ_lm_head
Open

[quantization] Quantize lm_head#631
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:GPTQ_lm_head

Conversation

@stamalakhov
Copy link
Copy Markdown
Contributor

@stamalakhov stamalakhov commented Apr 14, 2026

This PR quantizes lm_head in GPTQ to imorove accuracy.

./ccex test --include-internal -k quantization.algorithm.test_gptq

RUN unit tests with -k quantization.algorithm.test_gptq ...
test_gptq_config_validate_rejects_non_positive_weight_bits_override (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_gptq_config_validate_weight_bits_overrides (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_groupwise_conv1d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_groupwise_conv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_model (quantization.algorithm.test_gptq.GPTQTest) ... <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
ok
test_net (quantization.algorithm.test_gptq.GPTQTest) ... No specialized wrapper found for ModuleList; applying recursive wrapping.
ok
test_net_on_zero_inputs (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv1d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv1d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv2d_on_zero_inputs (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv2d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv3d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv3d_on_zero_inputs (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_normconv3d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_paddednormconv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_paddednormconv3d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_resolve_weight_bits_priority (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_transposed_conv2d (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_transposed_conv2d_with_logits (quantization.algorithm.test_gptq.GPTQTest) ... ok
test_weight_bits_overrides_are_applied_per_module (quantization.algorithm.test_gptq.GPTQTest) ... ok

----------------------------------------------------------------------
Ran 21 tests in 119.973s

OK

Value tests:

HuggingFaceTB/SmolLM2-135M-Instruct

Config ID PPL
FP32 17.40
GPTQ_MSE_w4A16_head4 27.74
GPTQ_MSE_w4A16_head_GPTQ_4 25.01
GPTQ_SMSE_w4A16_head4 27.19
GPTQ_SMSE_w4A16_head_GPTQ_4 24.14

TinyLlama/TinyLlama-1.1B-Chat-v1.0:

Config ID PPL
FP32 7.97
GPTQ_MSE_w4A16_head4 8.66
GPTQ_MSE_w4A16_head_GPTQ_4 8.54
GPTQ_SMSE_w4A16_head4 8.52
GPTQ_SMSE_w4A16_head_GPTQ_4 8.42

unsloth/Llama-3.2-1B-Instruct:

Config ID PPL
FP32 13.17
GPTQ_MSE_w4A16_head4 18.59
GPTQ_MSE_w4A16_head_GPTQ_4 18.30
GPTQ_SMSE_w4A16_head4 15.26
GPTQ_SMSE_w4A16_head_GPTQ_4 15.00

unsloth/Llama-3.2-3B-Instruct:

Config ID PPL
FP32 11.05
GPTQ_MSE_w4A16_head4 12.96
GPTQ_MSE_w4A16_head_GPTQ_4 12.67
GPTQ_SMSE_w4A16_head4 12.31
GPTQ_SMSE_w4A16_head_GPTQ_4 12.17

Related: as a fallback to #624
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

@stamalakhov stamalakhov self-assigned this Apr 14, 2026
@stamalakhov stamalakhov force-pushed the GPTQ_lm_head branch 6 times, most recently from d18e64e to 23a40c6 Compare April 15, 2026 08:58
@stamalakhov stamalakhov marked this pull request as ready for review April 15, 2026 08:59
@stamalakhov stamalakhov requested a review from mhs4670go April 15, 2026 09:01
This PR quantizes `lm_head` in GPTQ to improve accuracy.

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant