Skip to content

Refactor PTQ Wrapper Design for Unified Calibration and Export Separation #612

@mhs4670go

Description

@mhs4670go

Background

Currently, we maintain separate quantization wrappers for prefill and decode paths (e.g., QuantLlamaDecoderLayerPrefill, etc.).

However, this introduces several issues:

  • Code duplication across wrappers (observer registration, module wiring, quant logic)
  • Risk of inconsistent observer placement between prefill and decode
  • Increased maintenance and testing overhead
  • Tight coupling between:
    • calibration/evaluation logic (HF-compatible API)
    • and export requirements (accelerator-specific API)

At the same time, export requires a different input/output contract, such as:

  • Removing control-flow inputs like use_cache
  • Fixing return types (e.g., always returning KV cache or not)
  • Providing accelerator-friendly static inputs

Problem

We are currently mixing two different concerns into the same abstraction:

  1. Runtime / Calibration Interface

    • HF-compatible forward
    • Used for:
      • prepare
      • calibration
      • evaluation
      • regression checks
  2. Export Interface

    • Accelerator-specific static graph
    • Requires:
      • fixed input/output signature
      • no control-flow flags (use_cache, output_attentions, etc.)

This leads to:

  • Unnecessary complexity in forward
  • Export graph pollution (unused inputs like use_cache)
  • Difficulty maintaining consistency between prefill/decode variants

Proposal

1. Unify Quantization Wrappers (Single Wrapper)

Use a single quantization wrapper for both prefill and decode:

  • Example:
    • QuantLlamaDecoderLayer
    • QuantLlamaAttention

Key ideas:

  • Keep one unified forward, similar to the original HF model
  • Use:
    • past_key_value is None → prefill behavior
    • past_key_value is not None → decode behavior
  • Maintain full HF compatibility for:
    • calibration
    • evaluation
    • integration

2. Separate Export via Thin Adapters

Introduce export-specific adapter modules instead of modifying the main wrapper.

Examples:

  • DecoderLayerPrefillExportAdapter
  • DecoderLayerDecodeExportAdapter

Responsibilities of adapters:

  • Fix input/output contract
  • Remove control-flow arguments:
    • use_cache
    • output_attentions
  • Ensure deterministic return structure
  • Provide accelerator-friendly signatures

Example:

class DecoderLayerPrefillExportAdapter(nn.Module):
    def __init__(self, layer):
        super().__init__()
        self.layer = layer

    def forward(self, hidden_states):
        out = self.layer(
            hidden_states=hidden_states,
            use_cache=False,
            output_attentions=False,
        )
        return out[0] if isinstance(out, tuple) else out

Export usage:

export_layer = DecoderLayerPrefillExportAdapter(qlayer)

cm = tico.convert(
    export_layer,
    (example_hidden,),
)

3. Keep Calibration Workflow Unchanged

No changes to existing workflow:

prepare → calibration → convert → export
  • Calibration continues to use the unified wrapper
  • Observers are collected consistently
  • Export is the only step that introduces specialization

Considerations

  • Do we need mode-specific observers for decode (due to different activation distributions)?
  • Should we expose a helper API like:
    qlayer.as_export_module(mode="prefill")
    for easier usage?
  • Are there cases where decode requires fundamentally different computation (beyond I/O)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions