Added column-major storage of weights and scales in INT4 quantization for model load time improvement in TRT-RTX #811

hthadicherla · 2026-01-23T11:42:49Z

What does this PR do?

Type of change: ? New feature

Overview:
TensorRT-RTX requires the weights and scales in the ONNX models to be in column-major format. So whenever the model loads TRT-RTX JIT transposes the weights and scales during load time, causing increased load time.

Proposed feature is after quantization, transpose the weights and scales in DQ node and add a transpose node right after i.e,
A × B = A × ((Bᵀ)ᵀ)

The transformation is post processing step and is disabled by default. It can be enabled by quantizing with --use_column_major

Usage

python -m modelopt.onnx.quantization --onnx_path "model.onnx" --output_path "model_quant.onnx" --quantize_mode int4 --calibration_method awq_lite --use_column_major --skip_shared_constants_duplication

Testing

Tested a few LLM's and their MMLU scores with and without this transformation. No degradations were observed.

Summary by CodeRabbit

Release Notes

New Features
- Added --use_column_major command-line flag to ONNX quantization script for enabling column-major weight storage optimization compatible with NvTensorRtRtx execution provider. This optimization applies to DQ-only quantization modes (rtn_dq, awq_lite, awq_clip).

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…improvement in TRT-RTX Signed-off-by: Hrishith Thadicherla <[email protected]>

coderabbitai · 2026-01-23T11:43:10Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

📝 Walkthrough

Walkthrough

This PR introduces a column-major storage optimization feature for ONNX INT4 quantization targeting the NvTensorRtRtx execution provider. It adds a CLI flag to the quantization script, integrates it through the quantization pipeline, and provides utility functions for applying column-major transformations to GEMM weights and inserting transpose operations in DQ-only quantization modes.

Changes

Cohort / File(s)	Summary
CLI & API Integration `examples/windows/onnx_ptq/genai_llm/quantize.py`, `modelopt/onnx/quantization/int4.py`	Adds `--use_column_major` CLI argument and threads it through quantization function signature. Integrates flag handling into `quantize_rtn`, `quantize`, `_quantize_awq_clip`, and `_quantize_awq_lite` pathways. When enabled, branches control flow to apply column-major transformation to GEMM weights prior to DQ node creation. Flag is logged and guarded to avoid usage in incompatible modes (e.g., QDQ mode).
Transformation Utilities `modelopt/onnx/quantization/qdq_utils.py`	Adds three new public functions: `_apply_transpose_perm_to_shape()` for computing transposed shapes, `apply_column_major_transformation()` to transpose quantized weights/scales in-place and return DQ attributes with axis set to 1, and `add_transpose_nodes_for_column_major()` to conditionally insert Transpose nodes after DQ nodes feeding MatMul/Gemm and update graph connections. Includes safeguards to skip already-processed nodes and avoid altering Gemm when transB is set.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as quantize.py<br/>(CLI)
    participant API as int4.py<br/>(quantize)
    participant Transform as qdq_utils.py<br/>(apply_column_major)
    participant Graph as Graph<br/>(ONNX)
    
    User->>CLI: --use_column_major flag
    CLI->>API: quantize(...,<br/>use_column_major=True)
    API->>Transform: apply_column_major_transformation(<br/>weights, scales, ...)
    Transform->>Transform: Transpose weights &<br/>scales in-place
    Transform->>API: Return DQ attributes<br/>(axis=1)
    API->>Graph: Create DQ nodes with<br/>column-major attributes
    API->>Transform: add_transpose_nodes_for_column_major(graph)
    Transform->>Graph: Insert Transpose nodes<br/>after DQ nodes
    Transform->>Graph: Update MatMul/Gemm<br/>inputs
    Graph-->>User: Optimized ONNX model

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and specifically describes the main change: adding column-major storage of weights and scales in INT4 quantization for TRT-RTX model load time improvement. It directly summarizes the primary objective and is well-suited to the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@modelopt/onnx/quantization/qdq_utils.py`:
- Around line 1083-1091: The current Gemm handling in
apply_column_major_transformation (qdq_utils.py) skips nodes with node.op ==
"Gemm" when node.attrs contains transB=1, which breaks semantics for
column-major weights; instead, when encountering a Gemm with transB set, update
the node.attrs transB to 0 (or remove/normalize it to zero) so the graph expects
B^T (matching the earlier weight transpose) and do not skip inserting the
transpose-back; locate the Gemm handling block (check for node.op == "Gemm" and
the transB logic) and replace the early continue with logic that flips
node.attrs["transB"] to 0 (or deletes the attr) so outputs remain correct while
keeping the transpose-back insertion.

coderabbitai · 2026-01-23T11:50:26Z

modelopt/onnx/quantization/qdq_utils.py

+            # For Gemm nodes, check if transB is already set
+            if node.op == "Gemm":
+                trans_b = False
+                if hasattr(node, "attrs") and "transB" in node.attrs:
+                    trans_b = node.attrs["transB"] > 0
+                if trans_b:
+                    logger.debug(f"Gemm node {node.name} already has transB=1, skipping")
+                    continue
+


⚠️ Potential issue | 🔴 Critical

Gemm transB=1 skip breaks correctness with column‑major weights.
apply_column_major_transformation already transposes weights. If a Gemm has transB=1, skipping the transpose‑back makes Gemm consume B instead of B^T, changing outputs. Either always insert the transpose‑back or flip transB to 0 so Gemm consumes B^T directly.

🐛 Proposed fix (flip transB to 0 and keep semantics)

- # For Gemm nodes, check if transB is already set - if node.op == "Gemm": - trans_b = False - if hasattr(node, "attrs") and "transB" in node.attrs: - trans_b = node.attrs["transB"] > 0 - if trans_b: - logger.debug(f"Gemm node {node.name} already has transB=1, skipping") - continue + # For Gemm nodes with transB=1, flip to 0 since weights are already transposed + if node.op == "Gemm": + trans_b = bool((node.attrs or {}).get("transB", 0)) + if trans_b: + node.attrs = node.attrs or {} + node.attrs["transB"] = 0 + logger.debug( + f"Gemm node {node.name} has transB=1; setting transB=0 for column-major weights" + ) + continue

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# For Gemm nodes, check if transB is already set

if node.op == "Gemm":

trans_b = False

if hasattr(node, "attrs") and "transB" in node.attrs:

trans_b = node.attrs["transB"] > 0

if trans_b:

logger.debug(f"Gemm node {node.name} already has transB=1, skipping")

continue

# For Gemm nodes with transB=1, flip to 0 since weights are already transposed

if node.op == "Gemm":

trans_b = bool((node.attrs or {}).get("transB", 0))

if trans_b:

node.attrs = node.attrs or {}

node.attrs["transB"] = 0

logger.debug(

f"Gemm node {node.name} has transB=1; setting transB=0 for column-major weights"

)

continue

🤖 Prompt for AI Agents

In `@modelopt/onnx/quantization/qdq_utils.py` around lines 1083 - 1091, The current Gemm handling in apply_column_major_transformation (qdq_utils.py) skips nodes with node.op == "Gemm" when node.attrs contains transB=1, which breaks semantics for column-major weights; instead, when encountering a Gemm with transB set, update the node.attrs transB to 0 (or remove/normalize it to zero) so the graph expects B^T (matching the earlier weight transpose) and do not skip inserting the transpose-back; locate the Gemm handling block (check for node.op == "Gemm" and the transB logic) and replace the early continue with logic that flips node.attrs["transB"] to 0 (or deletes the attr) so outputs remain correct while keeping the transpose-back insertion.

codecov · 2026-01-23T11:53:17Z

Codecov Report

❌ Patch coverage is 24.28571% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.84%. Comparing base (4f4558a) to head (73252a0).
⚠️ Report is 11 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/onnx/quantization/qdq_utils.py	6.38%	44 Missing ⚠️
modelopt/onnx/quantization/int4.py	60.86%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #811      +/-   ##
==========================================
- Coverage   74.17%   73.84%   -0.33%     
==========================================
  Files         192      192              
  Lines       19246    19731     +485     
==========================================
+ Hits        14276    14571     +295     
- Misses       4970     5160     +190

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

galagam · 2026-01-25T08:09:07Z

@tcherckez-nvidia - do you mind reviewing?

vishalpandya1990 · 2026-01-28T04:17:00Z

modelopt/onnx/quantization/int4.py

    dimension, and in ONNX, weights are always plugged into the RHS (i.e. y = x @ W).
+
+    Args:
+        use_column_major: If True, apply column-major storage optimization for NvTensorRtRtx.


I suggest keeping this feature/documentation generic, without tying with TRTRTX EP - viewing it like any EP that needs this feature can use it. Today, it is TRTRTX EP who might be needing it, but there is nothing specific about this TRTRTX EP in the implementation.

Similar suggestion for other such comments.

vishalpandya1990 · 2026-01-28T04:19:27Z

examples/windows/onnx_ptq/genai_llm/quantize.py

+        action="store_true",
+        help=(
+            "Apply column-major storage optimization for NvTensorRtRtx execution provider. "
+            "Only applicable for DQ-only quantization (e.g., rtn_dq, awq_lite, awq_clip)."


We need to update README as well, for this new option.

vishalpandya1990 · 2026-01-28T04:20:22Z

examples/windows/onnx_ptq/genai_llm/quantize.py

        f"\n--Quantize-Script-- algo={args.algo}, dataset={args.dataset}, calib_size={args.calib_size}, "
        f"batch_size={args.batch_size}, block_size={args.block_size}, add-position-ids={args.add_position_ids}, "
        f"past-kv={args.add_past_kv_inputs}, rcalib={args.use_random_calib}, device={args.device}, "
        f"use_zero_point={args.use_zero_point}, use_fp32={args.use_fp32} enable_mixed_quant={args.enable_mixed_quant}, "


We should dump this new setting param here.

vishalpandya1990 · 2026-01-28T04:47:43Z

modelopt/onnx/quantization/int4.py

    )
+    # Add transpose nodes for column-major if needed
+    if use_column_major:
+        qdq.add_transpose_nodes_for_column_major(graph_gs)


Minor: May be rename this utility to insert_*** in line with existing insert_dq_nodes and other qdq utilities.

vishalpandya1990 · 2026-01-28T04:58:33Z

modelopt/onnx/quantization/int4.py

            gemm_weights_quantized[name] = numpy.asarray(qw)
+        # Apply column-major optimization if flag is set
+        if use_column_major:
+            dq_node_attributes = qdq.apply_column_major_transformation(


(a) We should perhaps add a one-liner comment here, something like: "transpose the weights and scales inline" etc.

(b) I feel - w.r.t. apply() method - we should decouple (i) transpose operation and (ii) dq-attrs setting - later can be handled explicitly, outside of apply(). What do you think?

(b) The problem is apply_column_major_transformation() is a very small function. So i feel it might be unneccesary to decouple the function for transpose and dq-attrs setting.

I'm curious as to why you think we should do it.

vishalpandya1990 · 2026-01-28T05:01:52Z

Please add a unit test for this.

For reference: https://github.com/NVIDIA/Model-Optimizer/blob/main/tests/unit/onnx

Besides, also check/compare create_test_model_with_int4_dq_reshape_transpose_matmul() in https://github.com/NVIDIA/Model-Optimizer/blob/main/tests/unit/onnx/test_qdq_utils.py.

hthadicherla · 2026-01-28T17:10:12Z

Please add a unit test for this.

For reference: https://github.com/NVIDIA/Model-Optimizer/blob/main/tests/unit/onnx

Besides, also check/compare create_test_model_with_int4_dq_reshape_transpose_matmul() in https://github.com/NVIDIA/Model-Optimizer/blob/main/tests/unit/onnx/test_qdq_utils.py.

The pattern in the test case you mentioned seems to be DequantizeLinear -> Reshape -> Transpose -> MatMul . I'm not sure why this is being tested, i saw that reshape and transpose nodes are being removed by int4quantexporter later anyway. See https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/onnx/export/int4_exporter.py#L33-L121

regardless it is different from our pattern which is DequantizeLinear(W^T) -> Transpose ->Matmul. But what would the test case be though we create the pattern and then what ? One test case i'm thinking of is have dummy weight values and activation/layernorm values and create DequantizeLinear(W^T)->Transpose->Matmul pattern and DequantizeLinear(W) ->Matmul and see if the matmul output is the same or not.

…viders that need it), added use_column_major to log output and README, and renamed add_transpose_nodes_for_column_major to insert_transpose_nodes_for_column_major with inline comments. Signed-off-by: Hrishith Thadicherla <[email protected]>

hthadicherla · 2026-01-28T17:16:18Z

@vishalpandya1990 I addressed most of the comments , can you look at the new changes i made and also look at some of the questions that i had regarding some of the changes you suggested ?

Added column-major storage of weights and scales for model load time …

dc4096d

…improvement in TRT-RTX Signed-off-by: Hrishith Thadicherla <[email protected]>

hthadicherla requested review from a team as code owners January 23, 2026 11:42

hthadicherla requested review from galagam and ynankani January 23, 2026 11:42

hthadicherla requested a review from vishalpandya1990 January 23, 2026 11:48

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

vishalpandya1990 reviewed Jan 28, 2026

View reviewed changes

Added column-major storage of weights and scales in INT4 quantization for model load time improvement in TRT-RTX #811

Are you sure you want to change the base?

Added column-major storage of weights and scales in INT4 quantization for model load time improvement in TRT-RTX #811

Conversation

hthadicherla commented Jan 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

galagam commented Jan 25, 2026

Uh oh!

vishalpandya1990 Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vishalpandya1990 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

vishalpandya1990 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

vishalpandya1990 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

vishalpandya1990 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

hthadicherla Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

vishalpandya1990 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hthadicherla commented Jan 28, 2026

Uh oh!

hthadicherla commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hthadicherla commented Jan 23, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 23, 2026 •

edited

Loading

codecov bot commented Jan 23, 2026 •

edited

Loading

vishalpandya1990 Jan 28, 2026 •

edited

Loading

vishalpandya1990 commented Jan 28, 2026 •

edited

Loading