Refactor PTQ Wrapper Design for Unified Calibration and Export Separation

## Background

Currently, we maintain separate quantization wrappers for **prefill** and **decode** paths (e.g., `QuantLlamaDecoderLayerPrefill`, etc.).

However, this introduces several issues:

- **Code duplication** across wrappers (observer registration, module wiring, quant logic)
- Risk of **inconsistent observer placement** between prefill and decode
- Increased **maintenance and testing overhead**
- Tight coupling between:
  - **calibration/evaluation logic (HF-compatible API)**
  - and **export requirements (accelerator-specific API)**

At the same time, export requires a **different input/output contract**, such as:
- Removing control-flow inputs like `use_cache`
- Fixing return types (e.g., always returning KV cache or not)
- Providing accelerator-friendly static inputs

## Problem

We are currently mixing two different concerns into the same abstraction:

1. **Runtime / Calibration Interface**
   - HF-compatible `forward`
   - Used for:
     - `prepare`
     - calibration
     - evaluation
     - regression checks

2. **Export Interface**
   - Accelerator-specific static graph
   - Requires:
     - fixed input/output signature
     - no control-flow flags (`use_cache`, `output_attentions`, etc.)

This leads to:
- Unnecessary complexity in `forward`
- Export graph pollution (unused inputs like `use_cache`)
- Difficulty maintaining consistency between prefill/decode variants

## Proposal

### 1. Unify Quantization Wrappers (Single Wrapper)

Use a **single quantization wrapper** for both prefill and decode:

- Example:
  - `QuantLlamaDecoderLayer`
  - `QuantLlamaAttention`

Key ideas:
- Keep **one unified `forward`**, similar to the original HF model
- Use:
  - `past_key_value is None` → prefill behavior
  - `past_key_value is not None` → decode behavior
- Maintain full HF compatibility for:
  - calibration
  - evaluation
  - integration

### 2. Separate Export via Thin Adapters

Introduce **export-specific adapter modules** instead of modifying the main wrapper.

Examples:

- `DecoderLayerPrefillExportAdapter`
- `DecoderLayerDecodeExportAdapter`

Responsibilities of adapters:
- Fix input/output contract
- Remove control-flow arguments:
  - `use_cache`
  - `output_attentions`
- Ensure deterministic return structure
- Provide accelerator-friendly signatures

Example:

```python
class DecoderLayerPrefillExportAdapter(nn.Module):
    def __init__(self, layer):
        super().__init__()
        self.layer = layer

    def forward(self, hidden_states):
        out = self.layer(
            hidden_states=hidden_states,
            use_cache=False,
            output_attentions=False,
        )
        return out[0] if isinstance(out, tuple) else out
```

Export usage:

```python
export_layer = DecoderLayerPrefillExportAdapter(qlayer)

cm = tico.convert(
    export_layer,
    (example_hidden,),
)
```

### 3. Keep Calibration Workflow Unchanged

No changes to existing workflow:

```
prepare → calibration → convert → export
```

- Calibration continues to use the unified wrapper
- Observers are collected consistently
- Export is the only step that introduces specialization

## Considerations

- Do we need **mode-specific observers** for decode (due to different activation distributions)?
- Should we expose a helper API like:
  ```python
  qlayer.as_export_module(mode="prefill")
  ```
  for easier usage?
- Are there cases where decode requires fundamentally different computation (beyond I/O)?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor PTQ Wrapper Design for Unified Calibration and Export Separation #612

Background

Problem

Proposal

1. Unify Quantization Wrappers (Single Wrapper)

2. Separate Export via Thin Adapters

3. Keep Calibration Workflow Unchanged

Considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor PTQ Wrapper Design for Unified Calibration and Export Separation #612

Description

Background

Problem

Proposal

1. Unify Quantization Wrappers (Single Wrapper)

2. Separate Export via Thin Adapters

3. Keep Calibration Workflow Unchanged

Considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions