Background
Currently, we maintain separate quantization wrappers for prefill and decode paths (e.g., QuantLlamaDecoderLayerPrefill, etc.).
However, this introduces several issues:
- Code duplication across wrappers (observer registration, module wiring, quant logic)
- Risk of inconsistent observer placement between prefill and decode
- Increased maintenance and testing overhead
- Tight coupling between:
- calibration/evaluation logic (HF-compatible API)
- and export requirements (accelerator-specific API)
At the same time, export requires a different input/output contract, such as:
- Removing control-flow inputs like
use_cache
- Fixing return types (e.g., always returning KV cache or not)
- Providing accelerator-friendly static inputs
Problem
We are currently mixing two different concerns into the same abstraction:
-
Runtime / Calibration Interface
- HF-compatible
forward
- Used for:
prepare
- calibration
- evaluation
- regression checks
-
Export Interface
- Accelerator-specific static graph
- Requires:
- fixed input/output signature
- no control-flow flags (
use_cache, output_attentions, etc.)
This leads to:
- Unnecessary complexity in
forward
- Export graph pollution (unused inputs like
use_cache)
- Difficulty maintaining consistency between prefill/decode variants
Proposal
1. Unify Quantization Wrappers (Single Wrapper)
Use a single quantization wrapper for both prefill and decode:
- Example:
QuantLlamaDecoderLayer
QuantLlamaAttention
Key ideas:
- Keep one unified
forward, similar to the original HF model
- Use:
past_key_value is None → prefill behavior
past_key_value is not None → decode behavior
- Maintain full HF compatibility for:
- calibration
- evaluation
- integration
2. Separate Export via Thin Adapters
Introduce export-specific adapter modules instead of modifying the main wrapper.
Examples:
DecoderLayerPrefillExportAdapter
DecoderLayerDecodeExportAdapter
Responsibilities of adapters:
- Fix input/output contract
- Remove control-flow arguments:
use_cache
output_attentions
- Ensure deterministic return structure
- Provide accelerator-friendly signatures
Example:
class DecoderLayerPrefillExportAdapter(nn.Module):
def __init__(self, layer):
super().__init__()
self.layer = layer
def forward(self, hidden_states):
out = self.layer(
hidden_states=hidden_states,
use_cache=False,
output_attentions=False,
)
return out[0] if isinstance(out, tuple) else out
Export usage:
export_layer = DecoderLayerPrefillExportAdapter(qlayer)
cm = tico.convert(
export_layer,
(example_hidden,),
)
3. Keep Calibration Workflow Unchanged
No changes to existing workflow:
prepare → calibration → convert → export
- Calibration continues to use the unified wrapper
- Observers are collected consistently
- Export is the only step that introduces specialization
Considerations
- Do we need mode-specific observers for decode (due to different activation distributions)?
- Should we expose a helper API like:
qlayer.as_export_module(mode="prefill")
for easier usage?
- Are there cases where decode requires fundamentally different computation (beyond I/O)?
Background
Currently, we maintain separate quantization wrappers for prefill and decode paths (e.g.,
QuantLlamaDecoderLayerPrefill, etc.).However, this introduces several issues:
At the same time, export requires a different input/output contract, such as:
use_cacheProblem
We are currently mixing two different concerns into the same abstraction:
Runtime / Calibration Interface
forwardprepareExport Interface
use_cache,output_attentions, etc.)This leads to:
forwarduse_cache)Proposal
1. Unify Quantization Wrappers (Single Wrapper)
Use a single quantization wrapper for both prefill and decode:
QuantLlamaDecoderLayerQuantLlamaAttentionKey ideas:
forward, similar to the original HF modelpast_key_value is None→ prefill behaviorpast_key_value is not None→ decode behavior2. Separate Export via Thin Adapters
Introduce export-specific adapter modules instead of modifying the main wrapper.
Examples:
DecoderLayerPrefillExportAdapterDecoderLayerDecodeExportAdapterResponsibilities of adapters:
use_cacheoutput_attentionsExample:
Export usage:
3. Keep Calibration Workflow Unchanged
No changes to existing workflow:
Considerations