Skip to content

Conversation

@Kh4ster
Copy link
Contributor

@Kh4ster Kh4ster commented Jan 27, 2026

Closes #791

This PR allows PDLP to solve a batch of problems. Currently the only supported difference among problems is one variable bound being different per climber.
This PR also includes:

  • batch_pdlp_solve API to be leveraged within strong branching
  • Additional cuopt_cli parameter to enable Batch PDLP over Dual Simplex for strong branching
  • Better warm start for Stable3
  • Properly handling the PDLP solver mode hyper parameters on the C++ side
  • New infeasiblity detection based of cuPDLPx method (not fully supported in batch mode)

This PR is not ready to be merged:

  • Need collaborators approval
  • Need to make sure the non-batch PDLP has no regression

Summary by CodeRabbit

Release Notes

  • New Features

    • Added batch PDLP solver for efficiently solving multiple optimization problems in parallel
    • Added configurable initial step size and primal weight settings for customized optimization behavior
    • Added support for dynamic adjustment of problem bounds during solving
  • Improvements

    • Refined solver parameter configuration through structured settings interface
    • Updated default tolerance values for enhanced convergence behavior
    • Improved multi-mode solver support with additional strategy variants

✏️ Tip: You can customize this high-level summary in your review settings.

@Kh4ster Kh4ster requested review from a team as code owners January 27, 2026 18:19
@Kh4ster Kh4ster mentioned this pull request Jan 27, 2026
@Kh4ster Kh4ster added feature request New feature or request non-breaking Introduces a non-breaking change pdlp mip labels Jan 27, 2026
@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive batch PDLP solve capabilities, refactoring hyper-parameter handling from static globals to runtime-configurable structs, and implementing multi-climber context management. The changes enable solving multiple problems with different variable bounds simultaneously while maintaining backward compatibility for single-problem scenarios.

Changes

Cohort / File(s) Summary
PDLP Hyper-parameters Refactoring
cpp/include/cuopt/linear_programming/pdlp/pdlp_hyper_params.cuh, cpp/src/linear_programming/pdlp_hyper_params.cu, benchmarks/linear_programming/cuopt/benchmark_helper.hpp, benchmarks/linear_programming/cuopt/run_pdlp.cu
Migrates from extern global hyper-parameters to struct-based pdlp_hyper_params_t with member initialization. Updates benchmark helper and run_pdlp to accept hyper_params as constructor arguments. Removes static global variables.
Batch Solver Infrastructure
cpp/src/linear_programming/pdhg.hpp, cpp/src/linear_programming/pdhg.cu, cpp/src/linear_programming/saddle_point.hpp, cpp/src/linear_programming/saddle_point.cu, cpp/src/linear_programming/pdlp.cuh
Extends constructors with batch-mode parameters (climber_strategies, hyper_params, new_bounds). Adds swap_context() and resize_context() methods. Refactors scalar step-size/weight storage to device_uvector for per-batch operations. Introduces batch-aware buffer management and GPU graph support.
Restart & Termination Strategies
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.*, cpp/src/linear_programming/restart_strategy/localized_duality_gap_container.*, cpp/src/linear_programming/termination_strategy/termination_strategy.*, cpp/src/linear_programming/termination_strategy/convergence_information.*, cpp/src/linear_programming/termination_strategy/infeasibility_information.*
Expands constructor signatures to include climber_strategies and hyper_params. Replaces scalar termination/convergence state with per-climber device_uvector storage. Adds segmented-reduction paths for batch computations. Introduces swap/resize context APIs.
Step-Size & Scaling Strategies
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.*, cpp/src/linear_programming/cusparse_view.*, cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.*
Refactors from device_scalar to device_uvector for primal_weight, step_size, and related quantities. Adds batch-mode branching logic and per-batch CUDA kernel variants. Introduces new cuSPARSE descriptor wrapper and preprocessing utilities for CUDA 12.4+.
Solver Settings & Solutions
cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp, cpp/include/cuopt/linear_programming/mip/solver_settings.hpp, cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp, cpp/src/linear_programming/solver_settings.cu, cpp/src/linear_programming/solver_solution.cu, cpp/src/linear_programming/restart_strategy/weighted_average_solution.*
Adds hyper_params member and optional initial-step-size/primal-weight getters/setters. Refactors termination/convergence data from scalars to vectors. Updates constructors to accept rvalue vectors for batch termination info. Tightens default tolerances to 1e-10.
Main Solve Flow & Batch APIs
cpp/include/cuopt/linear_programming/solve.hpp, cpp/src/linear_programming/solve.cuh, cpp/src/linear_programming/solve.cu
Adds batch_pdlp_solve() overloads and simplex_problem_to_mps_data_model() conversion. Updates set_pdlp_solver_mode() to accept non-const settings with parameterized hyper_params. Introduces batch memory estimator and multi-climber warm-start integration.
Batch Utilities & Handlers
cpp/src/linear_programming/optimal_batch_size_handler/*, cpp/src/linear_programming/utilities/ping_pong_graph.cuh, cpp/src/linear_programming/pdlp_climber_strategy.hpp, cpp/src/linear_programming/pdlp_constants.hpp, cpp/src/linear_programming/swap_and_resize_helper.cuh, cpp/src/linear_programming/utils.cuh
Introduces SpMM-based batch-size heuristic, swap/resize helper utilities, climber strategy struct, and batch-aware functional objects (a_times_scalar, batch_safe_div, etc.). Renames is_batch_mode to is_legacy_batch_mode in graph utilities.
Strong Branching & Dual Simplex
cpp/src/dual_simplex/pseudo_costs.*, cpp/src/dual_simplex/branch_and_bound.cpp, cpp/src/dual_simplex/simplex_solver_settings.hpp
Extends strong_branching() to accept user_problem as first parameter and leverage batch_pdlp_solve. Adds mip_batch_pdlp_strong_branching flag to settings. Integrates MPS data model conversion for batch mode dispatch.
MIP Integration
cpp/src/mip/solve.cu, cpp/src/mip/solver.cu, cpp/src/math_optimization/solver_settings.cu, cpp/src/mip/diversity/recombiners/sub_mip.cuh
Removes setup_device_symbols call, propagates hyper_params to scaling constructors, and adds mip_batch_pdlp_strong_branching parameter. Updates sub_mip to use context.settings.hyper_params instead of static constants.
Support & Metadata
cpp/include/cuopt/linear_programming/constants.h, cpp/src/linear_programming/CMakeLists.txt, cpp/src/utilities/copy_helpers.hpp, cpp/src/linear_programming/optimization_problem.cu, cpp/src/mip/diversity/diversity_manager.cu
Reflowed LP/MIP parameter strings for alignment. Replaces pdlp_hyper_params.cu with optimal_batch_size_handler.cu in build. Adds thrust::universal_host_pinned_vector support in copy helpers. Minor type casting consistency updates.
Test Coverage
cpp/tests/linear_programming/pdlp_test.cu, cpp/tests/linear_programming/utilities/pdlp_test_utilities.cuh, cpp/tests/mip/*
Adds extensive batch PDLP test scenarios (simple and complex batch problems, new-bounds validation). Refactors test signatures to use solver_settings.hyper_params and per-batch accessors. Updates termination info extraction to batch-aware API. Removes setup_pdlp helper functions.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.35% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Batch PDLP' is concise and directly reflects the main feature being added: batch processing capability for the PDLP solver.
Linked Issues check ✅ Passed The PR successfully implements all primary objectives from issue #791: batch_pdlp_solve API, PDLP solver mode hyperparameter configuration, per-solution termination info, context swap/resize APIs, and expanded test coverage.
Out of Scope Changes check ✅ Passed All changes directly support batch PDLP functionality, including core solver enhancements, API updates, test utilities, and configuration infrastructure. No unrelated refactoring or scope creep detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)
cpp/src/linear_programming/utilities/problem_checking.cu (1)

59-81: Restore bounds validation for non-batch modes.

The bounds check is fully disabled, so non-batch PDLP and MIP warm starts can slip through with out-of-bounds values, which can corrupt solver state or yield incorrect results. Please gate this by an explicit batch-mode flag rather than removing the check globally.

🔧 Suggested fix (gate by a batch-mode flag)
-    // TODO batch mode: Akif/Alice why?
-    // This clashes with batch PDLP warm start which by design will not be within the bounds
-    /*cuopt_expects(!thrust::any_of(op_problem.get_handle_ptr()->get_thrust_policy(),
-                                  thrust::make_counting_iterator(0),
-                                  thrust::make_counting_iterator(0) + op_problem.get_n_variables(),
-                                  [lower_bounds = make_span(op_problem.get_variable_lower_bounds()),
-                                   upper_bounds = make_span(op_problem.get_variable_upper_bounds()),
-                                   assignment_span = make_span(primal_initial_solution),
-                                   int_tol         = 1e-8] __device__(i_t idx) -> bool {
-                                    return assignment_span[idx] < lower_bounds[idx] - int_tol ||
-                                           assignment_span[idx] > upper_bounds[idx] + int_tol;
-                                  }),
-                  error_type_t::ValidationError,
-                  "Initial solution violates variable bounds.");*/
+    // Allow out-of-bounds only when batch mode explicitly requests it.
+    if (!allow_out_of_bounds) {
+      cuopt_expects(!thrust::any_of(op_problem.get_handle_ptr()->get_thrust_policy(),
+                                    thrust::make_counting_iterator(0),
+                                    thrust::make_counting_iterator(0) +
+                                      op_problem.get_n_variables(),
+                                    [lower_bounds = make_span(
+                                       op_problem.get_variable_lower_bounds()),
+                                     upper_bounds = make_span(
+                                       op_problem.get_variable_upper_bounds()),
+                                     assignment_span = make_span(primal_initial_solution),
+                                     int_tol         = 1e-8] __device__(i_t idx) -> bool {
+                                      return assignment_span[idx] < lower_bounds[idx] - int_tol ||
+                                             assignment_span[idx] > upper_bounds[idx] + int_tol;
+                                    }),
+                    error_type_t::ValidationError,
+                    "Initial solution violates variable bounds.");
+    }

You’ll need to thread allow_out_of_bounds from the caller (e.g., batch PDLP settings) so non-batch paths keep the validation. Based on learnings, validate correct initialization of variable bounds before solving.

cpp/src/linear_programming/pdhg.cu (1)

1020-1052: Missing batch matrix view updates after solution buffer swaps in batch mode.

The update_solution method swaps solution buffers and updates single-vector cuSparse views, but does not update the batch matrix views (batch_dual_solutions, batch_current_AtYs, batch_next_AtYs) when batch mode is enabled. Since these batch descriptors point to the underlying buffers that are being swapped, they become stale and point to invalid pointers after the swaps complete (lines 1030-1032).

Add corresponding cusparseDnMatSetValues calls for batch matrix descriptors when batch_mode_ is true, following the pattern used elsewhere in pdlp.cu (e.g., lines 1715, 1756).

cpp/src/mip/diversity/diversity_manager.cu (1)

90-101: config_id is parsed but never used.

The variable config_id is read from the CUOPT_CONFIG_ID environment variable and logged, but it's never actually used within the function or passed to any other code. The [[maybe_unused]] attribute suppresses the compiler warning, indicating the developer was aware of the unused variable. While a function get_local_search_and_lm_from_config in multi_armed_bandit.cuh accepts a config_id parameter, the parsed value is never provided to it.

Complete the implementation if configuration ID is intended to affect solver behavior, or remove the unused parsing if not needed. If this is placeholder code for future work, add a // TODO comment clarifying the intention.

cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cuh (1)

45-54: Fix dangling reference to hyper_params.

Line 98 passes a reference to the constructor parameter settings.hyper_params, but settings_ (the member copy created at line 89) should be used instead. After construction, the parameter reference becomes dangling. Change line 98 from settings.hyper_params to settings_.hyper_params.

cpp/src/linear_programming/pdlp.cuh (1)

88-90: Fix parameter name mismatch for clarity.

The declaration uses initial_primal_weight for set_initial_step_size, which is confusing and inconsistent with the method intent.

🧹 Proposed fix
-  void set_initial_step_size(f_t initial_primal_weight);
+  void set_initial_step_size(f_t initial_step_size);
cpp/src/linear_programming/termination_strategy/termination_strategy.cu (1)

318-338: Batch + per_constraint_residual uses a single residual for all climbers.

In batch mode, the per-constraint residual path reads a scalar residual (no per-climber indexing), which can incorrectly mark termination when climbers diverge. Please either compute per-climber residuals or explicitly disallow this mode in batch.

🛡️ Suggested guard (disallow until per-climber residuals exist)
 void pdlp_termination_strategy_t<i_t, f_t>::check_termination_criteria()
 {
 `#ifdef` PDLP_VERBOSE_MODE
   RAFT_CUDA_TRY(cudaDeviceSynchronize());
 `#endif`
+  cuopt_assert(!(settings_.per_constraint_residual && climber_strategies_.size() > 1),
+               "per_constraint_residual is not supported in batch mode yet");
   const auto [grid_size, block_size] = kernel_config_from_batch_size(climber_strategies_.size());
   check_termination_criteria_kernel<i_t, f_t>
cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp (1)

243-259: Add Doxygen for new batch accessors

get_terminations_status() and the vector-returning get_additional_termination_informations() are new public APIs but lack Doxygen and batch semantics/ownership notes. Please add brief docs (including id bounds and batch-size expectations). As per coding guidelines, please add Doxygen for new public API.

cpp/src/linear_programming/solver_solution.cu (1)

25-43: Ensure termination_stats_ is non-empty and aligned with termination_status_

The constructors that only set termination_status_ leave termination_stats_ empty, but getters index termination_stats_[id] after only checking termination_status_.size(). That can pass the assert and still read out of bounds (e.g., get_objective_value(0) on an error-status instance). Initialize a default entry and enforce size parity in getters.

🐛 Proposed fix
optimization_problem_solution_t<i_t, f_t>::optimization_problem_solution_t(
  pdlp_termination_status_t termination_status, rmm::cuda_stream_view stream_view)
  : primal_solution_{0, stream_view},
    dual_solution_{0, stream_view},
    reduced_cost_{0, stream_view},
-    termination_status_{termination_status},
+    termination_status_{termination_status},
+    termination_stats_{additional_termination_information_t{}},
    error_status_(cuopt::logic_error("", cuopt::error_type_t::Success))
{
}

optimization_problem_solution_t<i_t, f_t>::optimization_problem_solution_t(
  cuopt::logic_error error_status_, rmm::cuda_stream_view stream_view)
  : primal_solution_{0, stream_view},
    dual_solution_{0, stream_view},
    reduced_cost_{0, stream_view},
-    termination_status_{pdlp_termination_status_t::NoTermination},
+    termination_status_{pdlp_termination_status_t::NoTermination},
+    termination_stats_{additional_termination_information_t{}},
    error_status_(error_status_)
{
}

f_t optimization_problem_solution_t<i_t, f_t>::get_objective_value(i_t id) const
{
  cuopt_assert(id < termination_status_.size(), "id too big for batch size");
+  cuopt_assert(termination_stats_.size() == termination_status_.size(),
+               "termination_stats_ size mismatch");
  return termination_stats_[id].primal_objective;
}

Also applies to: 288-293, 320-331, 386-391

🤖 Fix all issues with AI agents
In `@cpp/include/cuopt/linear_programming/solve.hpp`:
- Around line 17-19: The public header solve.hpp currently includes an
internal-only header <dual_simplex/user_problem.hpp>, preventing external users
from resolving dual_simplex::user_problem_t; either move/export user_problem.hpp
into the public include tree (e.g., place it under the public include path such
as cpp/include/cuopt/linear_programming/dual_simplex/user_problem.hpp) and
update the project's install/export targets so the header is installed, or
change the include in solve.hpp to a public-facing path that will be installed;
ensure the declaration/definition of dual_simplex::user_problem_t remains
available to any translation unit that includes solve.hpp and update
build/install CMake/EXPORT rules accordingly.

In `@cpp/include/cuopt/linear_programming/utilities/segmented_sum_handler.cuh`:
- Around line 30-31: The default constructor segmented_sum_handler_t() leaves
member stream_view_ uninitialized causing undefined behavior when
segmented_sum_helper or segmented_reduce_helper are later called; fix by
initializing stream_view_ to a safe default (e.g., rmm::cuda_stream_default or
equivalent rmm::cuda_stream_view{}) in the default constructor (or alternatively
delete the default constructor if you want to forbid use), and/or add an
assert/check in segmented_sum_helper/segmented_reduce_helper that stream_view_
is valid before use; update the segmented_sum_handler_t() definition and any
constructors to ensure stream_view_ is always initialized.

In `@cpp/src/dual_simplex/simplex_solver_settings.hpp`:
- Around line 158-159: The member variable mip_batch_pdlp_strong_branching is
declared but not initialized, causing indeterminate values; update the class
constructor (the initializer list in simplex_solver_settings constructor) to
initialize mip_batch_pdlp_strong_branching (e.g., to 0 or 1 as appropriate) or
give it an in-class default at its declaration so the flag has a well-defined
value whenever code checks mip_batch_pdlp_strong_branching.

In
`@cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu`:
- Around line 25-27: The allocation sizes for x, y (and
buffer_non_transpose_batch init) use int multiplication (primal_size *
current_batch_size and dual_size * current_batch_size) which can overflow;
change the computation to cast operands to size_t before multiplying (e.g.,
(size_t)primal_size * (size_t)current_batch_size) and/or compute into a size_t
temp and add an overflow guard (check against SIZE_MAX or compare
current_batch_size > SIZE_MAX / primal_size) before constructing x, y or
buffer_non_transpose_batch; update any local variable types used for these sizes
to size_t and ensure the constructor calls for x, y, buffer_non_transpose_batch
use the size_t result.
- Around line 191-213: In optimal_batch_size_handler, guard against non-positive
max_batch_size before computing log2/pow: add a precondition like "if
(max_batch_size <= 0) return 1;" (or clamp max_batch_size = 1) near the top of
the function before computing current_batch_size so std::log2(std::min(...))
never receives 0 or negative values; this fixes potential NaNs/UB for
current_batch_size derived from initial_batch_size and max_batch_size.

In `@cpp/src/linear_programming/pdhg.hpp`:
- Around line 27-33: The constructor pdhg_solver_t currently stores const
references to incoming climber_strategies and hyper_params which can dangle
(e.g., when callers pass temporaries or default pdlp_solver_settings_t{} is
used); change the class members to own copies instead of references by replacing
the reference members with value members (e.g.,
pdlp_hyper_params::pdlp_hyper_params_t hyper_params_; and
std::vector<pdlp_climber_strategy_t> climber_strategies_); update the
pdhg_solver_t constructor initialization to copy the passed-in hyper_params and
climber_strategies into these members (remove const& member usage), and ensure
any places that relied on the reference types (including pdlp_solver_t defaults)
now work with the owned copies.

In `@cpp/src/linear_programming/pdlp_constants.hpp`:
- Around line 17-22: The function kernel_config_from_batch_size can divide by
zero when batch_size == 0 because block_size becomes 0 and cuda::ceil_div is
called; add an early guard in kernel_config_from_batch_size to handle batch_size
== 0 (either assert(batch_size > 0) or return a safe pair like {0,0} or {1,0}
per project convention) before computing block_size and calling cuda::ceil_div,
and ensure the change references kernel_config_from_batch_size, block_size,
grid_size, and cuda::ceil_div so reviewers can find the fix.

In `@cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu`:
- Around line 277-295: The calls to cub::DeviceSegmentedReduce::Sum in
pdlp_restart_strategy.cu (the three usages that pass nullptr, byte_needed,
thrust::make_transform_iterator(..., power_two_func_t<f_t>{}),
reusable_device_scalar_1_.data(), climber_strategies_.size(),
primal_size_h_/dual_size_h_, stream_view_) ignore the returned cudaError_t; wrap
each call with RAFT_CUDA_TRY so failures are checked and propagated (i.e.,
replace the bare cub::DeviceSegmentedReduce::Sum(...) invocations with
RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(...)) for the segments computing
dot_product_bytes).

In `@cpp/src/linear_programming/solve.cu`:
- Around line 562-572: The mapping of row senses to constraint bounds in the
loop handling user_problem.row_sense is inverted: for 'G' (Ax >= b) set
constraint_lower[i] = user_problem.rhs[i] and constraint_upper[i] =
std::numeric_limits<f_t>::infinity(), and for 'L' (Ax <= b) set
constraint_lower[i] = -std::numeric_limits<f_t>::infinity() and
constraint_upper[i] = user_problem.rhs[i]; update the assignments for
constraint_lower and constraint_upper accordingly where those variables and
user_problem.rhs/row_sense are used.
- Around line 711-719: The code declares an empty std::vector named info of type
optimization_problem_solution_t<i_t, f_t>::additional_termination_information_t
and then accesses info[0], causing undefined behavior; fix by ensuring the
vector contains at least one element before accessing it — e.g., resize info to
1 or construct the element via info.emplace_back() or info.push_back(...) prior
to setting info[0].primal_objective, info[0].number_of_steps_taken, and
info[0].solve_time; reference the symbols info,
optimization_problem_solution_t::additional_termination_information_t,
vertex_solution, and start_solver to locate where to add the resize/emplace.

In `@cpp/src/linear_programming/swap_and_resize_helper.cuh`:
- Around line 48-84: matrix_swap currently calls cub::DeviceTransform with
in_zip and out_zip built from the same permutation iterators
(left_index/right_index), which can produce overlapping partial-addresses and
violates CUB's no-overlap precondition; fix by making the transform
non-overlapping: allocate a temporary rmm::device_uvector<f_t> temp(total_items)
(or a temp buffer sized to vector_size * swap_count), run DeviceTransform
writing into temp (using left_index/right_index-derived input and a linear
output), then copy/scatter the swapped values from temp back into matrix at the
target permuted positions (or implement a custom CUDA kernel that performs the
swaps in-place safely); update references in this function (matrix_swap,
left_index, right_index, in_zip, out_zip) accordingly and ensure matrix.stream()
and proper synchronization are preserved.

In `@cpp/src/mip/solve.cu`:
- Around line 62-65: You modified a local copy hyper_params but never propagated
those flag changes into the PDLP solvers; update the flow so the PDLP flags
actually reach the solvers: either assign the modified hyper_params back into
the global settings before constructing mip_solver_t (e.g.,
settings.hyper_params = hyper_params) so mip_solver_t and downstream code uses
the updated flags, or change the PDLP settings initialization path
(pdlp_solver_settings_t and set_pdlp_solver_mode) to accept and preserve these
hyper_params (pass hyper_params into pdlp_solver_settings_t or have
set_pdlp_solver_mode merge/prefer explicit hyper_params instead of overwriting
them). Ensure the chosen fix covers all solver creation sites (mip_solver_t
construction and places that create pdlp_solver_settings_t) so
update_primal_weight_on_initial_solution and
update_step_size_on_initial_solution are honored by PDLP.
🟡 Minor comments (13)
cpp/src/linear_programming/pdhg.cu-764-773 (1)

764-773: Missing CUDA error check after kernel launch.

Per coding guidelines, every CUDA kernel launch should have error checking. The refine_initial_primal_projection_kernel launch at line 765 is missing RAFT_CUDA_TRY(cudaPeekAtLastError()) that is present for other kernel launches (e.g., line 175).

🐛 Proposed fix
   refine_initial_primal_projection_kernel<<<grid_size, block_size, 0, stream_view_>>>(
     (i_t)climber_strategies_.size(),
     problem_ptr->n_variables,
     raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()),
     raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()),
     raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()),
     raft::device_span<f_t>(current_saddle_point_state_.get_primal_solution().data(),
                            current_saddle_point_state_.get_primal_solution().size()));
+  RAFT_CUDA_TRY(cudaPeekAtLastError());
 }
cpp/src/linear_programming/pdhg.cu-929-943 (1)

929-943: Missing CUDA error check after kernel launch.

Same issue as above - the refine_primal_projection_batch_kernel launch is missing RAFT_CUDA_TRY(cudaPeekAtLastError()).

🐛 Proposed fix
         refine_primal_projection_batch_kernel<<<grid_size, block_size, 0, stream_view_>>>(
           (i_t)climber_strategies_.size(),
           problem_ptr->n_variables,
           raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()),
           raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()),
           raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()),
           raft::device_span<const f_t>(current_saddle_point_state_.get_primal_solution().data(),
                                        current_saddle_point_state_.get_primal_solution().size()),
           raft::device_span<const f_t>(problem_ptr->objective_coefficients.data(),
                                        problem_ptr->objective_coefficients.size()),
           raft::device_span<const f_t>(current_saddle_point_state_.get_current_AtY().data(),
                                        current_saddle_point_state_.get_current_AtY().size()),
           raft::device_span<const f_t>(primal_step_size.data(), primal_step_size.size()),
           raft::device_span<f_t>(reflected_primal_.data(), reflected_primal_.size()));
+        RAFT_CUDA_TRY(cudaPeekAtLastError());
       }
cpp/src/linear_programming/pdhg.cu-825-842 (1)

825-842: Missing CUDA error check after kernel launch in graph capture block.

The refine_primal_projection_major_batch_kernel launch is missing error checking. While inside a graph capture block, cudaPeekAtLastError() should still be called to catch launch configuration errors.

🐛 Proposed fix
         refine_primal_projection_major_batch_kernel<<<grid_size, block_size, 0, stream_view_>>>(
           (i_t)climber_strategies_.size(),
           problem_ptr->n_variables,
           raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()),
           raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()),
           raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()),
           raft::device_span<const f_t>(current_saddle_point_state_.get_primal_solution().data(),
                                        current_saddle_point_state_.get_primal_solution().size()),
           raft::device_span<const f_t>(problem_ptr->objective_coefficients.data(),
                                        problem_ptr->objective_coefficients.size()),
           raft::device_span<const f_t>(current_saddle_point_state_.get_current_AtY().data(),
                                        current_saddle_point_state_.get_current_AtY().size()),
           raft::device_span<const f_t>(primal_step_size.data(), primal_step_size.size()),
           raft::device_span<f_t>(potential_next_primal_solution_.data(),
                                  potential_next_primal_solution_.size()),
           raft::device_span<f_t>(dual_slack_.data(), dual_slack_.size()),
           raft::device_span<f_t>(reflected_primal_.data(), reflected_primal_.size()));
+        RAFT_CUDA_TRY(cudaPeekAtLastError());
       }
cpp/src/linear_programming/pdhg.cu-189-192 (1)

189-192: Assertion message inconsistent with condition.

The assertion checks new_size < batch_size (strictly less than), but the error message says "less than or equal to". This inconsistency could cause confusion during debugging.

🐛 Proposed fix

If only shrinking is intended:

-  cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size");
+  cuopt_assert(new_size < batch_size, "New size must be less than batch size");

Or if same-size should be allowed (no-op resize):

-  cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size");
+  cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size");
cpp/src/utilities/copy_helpers.hpp-325-329 (1)

325-329: Remove or document the make_span(universal_host_pinned_vector) overload.

The function at lines 325–329 is never called in the codebase. When universal_host_pinned_vector is actually used (e.g., in pdlp.cu), code correctly bypasses this overload and uses thrust::raw_pointer_cast() directly. Either remove this as dead code, or if intended for future use, add explicit documentation that it requires zero-copy/UVA semantics and should only be passed to device kernels that support unified memory.

cpp/src/linear_programming/saddle_point.cu-77-92 (1)

77-92: Assertion message inconsistent with code logic.

Line 83 states "New size must be less than or equal to batch size" but the actual assertion uses strict less-than (new_size < batch_size). If equality should be allowed, update the condition; otherwise, fix the message.

🔧 Suggested fix
   cuopt_assert(new_size > 0, "New size must be greater than 0");
-  cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size");
+  cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size");

Or if strict less-than is intended:

   cuopt_assert(new_size > 0, "New size must be greater than 0");
-  cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size");
+  cuopt_assert(new_size < batch_size, "New size must be strictly less than batch size");
cpp/src/mip/diversity/recombiners/sub_mip.cuh-73-84 (1)

73-84: Address or remove the TODO comment before merging.

The TODO at line 73 indicates uncertainty about this change. Please either:

  1. Get confirmation from Akif and Alice that this approach is correct, then remove the TODO
  2. Document why this approach is appropriate

The change itself is sound—replacing hard-coded scaling constants with configurable hyper-parameters from context.settings.hyper_params improves maintainability and aligns with the PR's hyper-parameter consolidation effort.

cpp/include/cuopt/linear_programming/solve.hpp-77-124 (1)

77-124: Tighten docs + fix typos in new batch API.

Line 80+ has typos (“accross”, “thie”) and the doc lacks a @return. Consider clarifying size constraints and error behavior. As per coding guidelines, public headers should clearly document new APIs.

📝 Suggested doc fixes
- * The only difference accross climbers will be one variable bound change.
+ * The only difference across climbers will be one variable bound change.
@@
- *   Let the optimal objective value of thie problem be obj_down
+ *   Let the optimal objective value of this problem be obj_down
@@
- */
+ *
+ * `@return` optimization_problem_solution_t<i_t, f_t> owning container for the solver solution
+ */
cpp/include/cuopt/linear_programming/pdlp/pdlp_hyper_params.cuh-58-59 (1)

58-59: Fix typo in TODO comment.

“constpexr” should be “constexpr”.

📝 Suggested fix
-// constpexr version of each (Stable2, Stable1, Methodical1, Fast1, Stable3...)
+// constexpr version of each (Stable2, Stable1, Methodical1, Fast1, Stable3...)
cpp/src/linear_programming/restart_strategy/localized_duality_gap_container.cu-107-120 (1)

107-120: Inconsistent assertion message in resize_context.

Line 114 asserts new_size < batch_size but the message says "less than or equal to batch size". Also, restricting new_size < batch_size prevents resizing to the same size, which may be a valid no-op.

Suggested fix
-  cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size");
+  cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size");
cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp-120-133 (1)

120-133: Resolve the temporary batch-mode TODOs before release.

The public header exposes setters with “TODO batch mode: tmp”, which suggests the API is provisional. Please either finalize the behavior or gate it behind an internal-only path and track a cleanup task to avoid shipping TODOs in the public interface.

cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu-154-165 (1)

154-165: Assertion condition for resize_context may be too restrictive.

The assertion new_size < batch_size on line 160 prevents resizing to the current batch size. If resizing to the same size is a no-op, it should be allowed. Consider changing to <=.

🐛 Proposed fix
   [[maybe_unused]] const auto batch_size = static_cast<i_t>(interaction_.size());
   cuopt_assert(batch_size > 0, "Batch size must be greater than 0");
   cuopt_assert(new_size > 0, "New size must be greater than 0");
-  cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size");
+  cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size");
cpp/src/linear_programming/termination_strategy/convergence_information.cu-249-270 (1)

249-270: Assertion condition in resize_context is too restrictive.

Similar to the adaptive_step_size_strategy, the assertion new_size < batch_size on line 255 prevents resizing to the current size. The comment says "less than or equal" but the code uses strict less-than.

🐛 Proposed fix
   [[maybe_unused]] const auto batch_size = static_cast<i_t>(primal_objective_.size());
   cuopt_assert(batch_size > 0, "Batch size must be greater than 0");
   cuopt_assert(new_size > 0, "New size must be greater than 0");
-  cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size");
+  cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size");
🧹 Nitpick comments (26)
cpp/src/linear_programming/pdhg.cu (3)

97-109: Consider using structured bindings or std::transform for cleaner tuple unpacking.

The loop manually unpacks each tuple element. While functional, this could be simplified using std::transform with a projection or structured bindings in a single pass.

♻️ Optional refactor using std::transform
   if (!new_bounds.empty()) {
     std::vector<i_t> idx(new_bounds.size());
     std::vector<f_t> lower(new_bounds.size());
     std::vector<f_t> upper(new_bounds.size());
-    for (size_t i = 0; i < new_bounds.size(); ++i) {
-      idx[i]   = std::get<0>(new_bounds[i]);
-      lower[i] = std::get<1>(new_bounds[i]);
-      upper[i] = std::get<2>(new_bounds[i]);
-    }
+    std::transform(new_bounds.begin(), new_bounds.end(), idx.begin(),
+                   [](const auto& t) { return std::get<0>(t); });
+    std::transform(new_bounds.begin(), new_bounds.end(), lower.begin(),
+                   [](const auto& t) { return std::get<1>(t); });
+    std::transform(new_bounds.begin(), new_bounds.end(), upper.begin(),
+                   [](const auto& t) { return std::get<2>(t); });
     raft::copy(new_bounds_idx_.data(), idx.data(), idx.size(), stream_view_);
     raft::copy(new_bounds_lower_.data(), lower.data(), lower.size(), stream_view_);
     raft::copy(new_bounds_upper_.data(), upper.data(), upper.size(), stream_view_);
   }

445-541: Consider templating to reduce duplication between batch and non-batch projection functors.

There are 4 pairs of nearly identical functors (non-batch with pointer and batch with value parameter). Per coding guidelines, this duplication (8 functors with similar logic) could be reduced using a template approach.

♻️ Example: Unified template approach
template <typename f_t, bool IsBatch>
struct primal_reflected_projection_impl {
  using f_t2 = typename type_2<f_t>::type;
  
  // For non-batch: step_size_ptr points to single scalar
  // For batch: step_size_ptr points to array, indexed by batch_idx
  const f_t* step_size_ptr;
  
  HDI f_t get_step_size([[maybe_unused]] int batch_idx) const {
    if constexpr (IsBatch) {
      return step_size_ptr[batch_idx];
    } else {
      return *step_size_ptr;
    }
  }
  
  HDI f_t operator()(f_t current_primal, f_t objective, f_t Aty, f_t2 bounds, int batch_idx = 0) {
    const f_t step_size = get_step_size(batch_idx);
    cuopt_assert(step_size != f_t(0.0), "Scalar can't be 0");
    const f_t next = current_primal - step_size * (objective - Aty);
    const f_t next_clamped = raft::max<f_t>(raft::min<f_t>(next, bounds.y), bounds.x);
    return f_t(2.0) * next_clamped - current_primal;
  }
};

556-559: Indexing pattern is consistent but consider documenting the memory layout.

The bulk operations use batch_idx = idx % batch_size and var_idx = idx / batch_size, indicating a batch-interleaved (column-major w.r.t. batch) memory layout. This enables coalesced access for the batch dimension. A brief comment documenting this layout choice would aid maintainability.

+  // Memory layout: data[var_idx * batch_size + batch_idx]
+  // This interleaves batch elements for coalesced access when processing batches in parallel
   HDI void operator()(size_t idx)
   {
     const int batch_idx = idx % batch_size;
     const int var_idx   = idx / batch_size;
cpp/include/cuopt/linear_programming/constants.h (1)

55-55: New public API constant for batch PDLP strong branching.

The new CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING constant is added to support the batch PDLP feature. As per coding guidelines for public header files, consider adding a Doxygen comment describing this parameter's purpose and valid values (0 or 1 based on the solver_settings.cu registration).

cpp/src/linear_programming/optimization_problem.cu (1)

731-732: Inconsistent type usage remains in findMaxAbs.

Line 732 still returns bare 0.0 for empty vectors, while the analogous case in findMinAbs (line 745) was updated to f_t(0.0). Consider updating for consistency:

♻️ Suggested fix
   auto findMaxAbs = [](const std::vector<f_t>& vec) -> f_t {
-    if (vec.empty()) { return 0.0; }
+    if (vec.empty()) { return f_t(0.0); }
cpp/src/linear_programming/pdlp_climber_strategy.hpp (1)

19-21: Optional: Consider adding documentation for pdlp_climber_strategy_t.

The struct stores an original_index used for batch-mode climber strategy handling. Adding a brief doc comment explaining its purpose would improve maintainability.

Regarding templating with i_t: While the struct is used alongside templated classes that parameterize i_t, the current non-templated design with int is used consistently throughout the codebase as a fixed type passed in std::vector<pdlp_climber_strategy_t>. Templating would require updating multiple usage sites across 7+ files. This could be considered if future requirements demand runtime selection of index type sizes, but is not necessary for current usage patterns.

cpp/include/cuopt/linear_programming/mip/solver_settings.hpp (1)

86-102: Document new MIP PDLP settings and resolve the TODO.

These new public members change the API surface; please add Doxygen comments (defaults/behavior) and either clear the TODO or track it in an issue. Also confirm ABI/backward‑compat expectations in release notes.

✍️ Suggested doc update
-  i_t mip_batch_pdlp_strong_branching = 0;
+  /** Enable batch PDLP for strong branching (0 = disabled). */
+  i_t mip_batch_pdlp_strong_branching = 0;
   bool log_to_console                 = true;
@@
-  // TODO check with Akif and Alice
-  pdlp_hyper_params::pdlp_hyper_params_t hyper_params;
+  /** PDLP hyper-parameters used by strong-branching batch PDLP runs. */
+  pdlp_hyper_params::pdlp_hyper_params_t hyper_params;

As per coding guidelines, public C++ headers should document new API fields and call out compatibility considerations.

cpp/src/linear_programming/swap_and_resize_helper.cuh (1)

86-96: Consider using size_t for index parameters to match container size type.

The left_swap_index and right_swap_index parameters are int, but host_vector.size() returns size_t. This could cause issues with large vectors (though unlikely in practice). More importantly, the assertion left_swap_index < right_swap_index doesn't validate that indices are non-negative.

🔧 Suggested improvement
 template <typename host_vector_t>
-void host_vector_swap(host_vector_t& host_vector, int left_swap_index, int right_swap_index)
+void host_vector_swap(host_vector_t& host_vector, size_t left_swap_index, size_t right_swap_index)
 {
-  cuopt_assert(left_swap_index < host_vector.size(), "Left swap index is out of bounds");
-  cuopt_assert(right_swap_index < host_vector.size(), "Right swap index is out of bounds");
+  cuopt_assert(left_swap_index < host_vector.size(), "Left swap index out of bounds");
+  cuopt_assert(right_swap_index < host_vector.size(), "Right swap index out of bounds");
   cuopt_assert(left_swap_index < right_swap_index,
                "Left swap index must be less than right swap index");
cpp/src/linear_programming/saddle_point.cu (1)

57-75: primal_gradient_ excluded from swap_context - verify this is intentional.

The swap_context method swaps all batched buffers except primal_gradient_. This is consistent with the constructor comment that primal_gradient_ is only used in trust region restart mode which doesn't support batch mode. However, if swap_context is ever called in a context where primal_gradient_ matters, this could lead to subtle bugs.

Consider adding a comment here explaining why primal_gradient_ is excluded, mirroring the constructor comment.

📝 Suggested documentation
   matrix_swap(dual_gradient_, dual_size_, swap_pairs);
   matrix_swap(current_AtY_, primal_size_, swap_pairs);
   matrix_swap(next_AtY_, primal_size_, swap_pairs);
+  // Note: primal_gradient_ is not swapped as it's only used in trust region restart mode
+  // which does not support batch mode
 }
cpp/include/cuopt/linear_programming/utilities/segmented_sum_handler.cuh (2)

38-51: Inconsistent stream parameter usage between helper methods.

segmented_sum_helper passes stream_view_ directly to CUB (lines 40, 50), while segmented_reduce_helper uses stream_view_.value() (lines 69, 71, 81). Both should work, but the inconsistency could cause confusion. Consider using the same approach in both methods.

♻️ Suggested consistency fix
   template <typename InputIteratorT, typename OutputIteratorT>
   void segmented_sum_helper(InputIteratorT input,
                             OutputIteratorT output,
                             i_t batch_size,
                             i_t problem_size)
   {
     cub::DeviceSegmentedReduce::Sum(
-      nullptr, byte_needed_, input, output, batch_size, problem_size, stream_view_);
+      nullptr, byte_needed_, input, output, batch_size, problem_size, stream_view_.value());

-    segmented_sum_storage_.resize(byte_needed_, stream_view_);
+    segmented_sum_storage_.resize(byte_needed_, stream_view_.value());

     cub::DeviceSegmentedReduce::Sum(segmented_sum_storage_.data(),
                                     byte_needed_,
                                     input,
                                     output,
                                     batch_size,
                                     problem_size,
-                                    stream_view_);
+                                    stream_view_.value());
   }

Also applies to: 53-82


84-86: Consider making members private with accessors.

The struct exposes byte_needed_, segmented_sum_storage_, and stream_view_ as public members. While this may be intentional for a utility struct, it could lead to accidental misuse. Consider making them private if external access isn't needed.

cpp/src/linear_programming/utils.cuh (2)

91-98: Assertion followed by fallback behavior - clarify intent.

The batch_safe_div functor asserts b != 0 but then returns a when b == 0 anyway. In debug builds, the assertion will trigger; in release builds, it silently returns a. This mixed approach may cause confusion about expected behavior.

Consider either:

  1. Removing the assertion if returning a on division-by-zero is the intended fallback
  2. Keeping only the assertion if division-by-zero is truly an error
♻️ Option 1: Remove assertion if fallback is intended
 template <typename f_t>
 struct batch_safe_div {
   HDI f_t operator()(f_t a, f_t b)
   {
-    cuopt_assert(b != f_t(0), "Division by zero");
     return b != f_t(0) ? a / b : a;
   }
 };
♻️ Option 2: Keep only assertion if it's an error
 template <typename f_t>
 struct batch_safe_div {
   HDI f_t operator()(f_t a, f_t b)
   {
     cuopt_assert(b != f_t(0), "Division by zero");
-    return b != f_t(0) ? a / b : a;
+    return a / b;
   }
 };

194-221: Address TODO: Use i_t instead of int for index types.

Both batch_wrapped_iterator and problem_wrapped_iterator use int for problem_size_, and there's already a TODO comment on line 219. Using int could cause issues with large problem sizes exceeding INT_MAX. Consider templating these on i_t for consistency with the rest of the codebase.

♻️ Suggested fix
-template <typename f_t>
+template <typename i_t, typename f_t>
 struct batch_wrapped_iterator {
-  batch_wrapped_iterator(const f_t* problem_input, int problem_size)
+  batch_wrapped_iterator(const f_t* problem_input, i_t problem_size)
     : problem_input_(problem_input), problem_size_(problem_size)
   {
   }
-  HDI f_t operator()(int id) { return problem_input_[id / problem_size_]; }
+  HDI f_t operator()(i_t id) { return problem_input_[id / problem_size_]; }

   const f_t* problem_input_;
-  int problem_size_;
+  i_t problem_size_;
 };

-template <typename f_t>
+template <typename i_t, typename f_t>
 struct problem_wrapped_iterator {
-  problem_wrapped_iterator(const f_t* problem_input, int problem_size)
+  problem_wrapped_iterator(const f_t* problem_input, i_t problem_size)
     : problem_input_(problem_input), problem_size_(problem_size)
   {
   }
-  HDI f_t operator()(int id) { return problem_input_[id % problem_size_]; }
+  HDI f_t operator()(i_t id) { return problem_input_[id % problem_size_]; }

   const f_t* problem_input_;
-  // TODO use i_t
-  int problem_size_;
+  i_t problem_size_;
 };
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cuh (1)

388-389: Storing const references as class members poses lifetime risk.

The members climber_strategies_ and hyper_params_ are stored as const references. If the referenced objects are destroyed before this pdlp_restart_strategy_t instance, dangling reference access will cause undefined behavior. This is particularly risky in a solver context where objects may be moved or destroyed during algorithm phase transitions.

Consider storing by value or using std::reference_wrapper with explicit lifetime documentation:

♻️ Suggested alternatives
- const std::vector<pdlp_climber_strategy_t>& climber_strategies_;
- const pdlp_hyper_params::pdlp_hyper_params_t& hyper_params_;
+ std::vector<pdlp_climber_strategy_t> climber_strategies_;
+ pdlp_hyper_params::pdlp_hyper_params_t hyper_params_;

Or if references are intentional for performance, add a comment documenting the lifetime requirement.

benchmarks/linear_programming/cuopt/run_pdlp.cu (1)

82-95: Missing else before if (mode == "Stable2") creates inconsistent flow.

The function has an inconsistency: Stable1 check doesn't have an else if pattern like the others, but Stable2 does (line 86). While this doesn't affect correctness since early returns are used, it makes the code style inconsistent.

♻️ Suggested fix for consistency
 static cuopt::linear_programming::pdlp_solver_mode_t string_to_pdlp_solver_mode(
   const std::string& mode)
 {
   if (mode == "Stable1") return cuopt::linear_programming::pdlp_solver_mode_t::Stable1;
-  if (mode == "Stable2")
+  else if (mode == "Stable2")
     return cuopt::linear_programming::pdlp_solver_mode_t::Stable2;
   else if (mode == "Methodical1")
cpp/include/cuopt/linear_programming/pdlp/pdlp_hyper_params.cuh (1)

12-56: Add Doxygen docs for this new public struct.

This is a new public API type; please document its purpose and the key defaults/behaviors for consumers. As per coding guidelines, public headers should document new API surfaces.

cpp/src/mip/solve.cu (1)

61-61: Resolve the TODO before merge.

Line 61 indicates uncertainty about parameter propagation. Please clarify in code or track via an issue.

Would you like me to draft an issue or propose the exact change?

cpp/src/linear_programming/restart_strategy/localized_duality_gap_container.cu (1)

52-59: Incorrect second argument to cudaMemsetAsync.

cudaMemsetAsync takes a byte value (an int), not a floating-point value. Passing f_t(0.0) is implicitly converted to int, which happens to work for zeroing memory, but it's misleading and fragile. Use 0 directly.

Suggested fix
  RAFT_CUDA_TRY(cudaMemsetAsync(primal_solution_.data(),
-                               f_t(0.0),
+                               0,
                                sizeof(f_t) * primal_solution_.size(),
                                handle_ptr->get_stream()));
  RAFT_CUDA_TRY(cudaMemsetAsync(dual_solution_.data(),
-                               f_t(0.0),
+                               0,
                                sizeof(f_t) * dual_solution_.size(),
                                handle_ptr->get_stream()));
cpp/src/dual_simplex/pseudo_costs.cpp (2)

175-180: Consider using reserve for fraction_values to avoid reallocations.

Since the size is known upfront (fractional.size()), reserving capacity improves performance.

Suggested fix
    std::vector<f_t> fraction_values;
+   fraction_values.reserve(fractional.size());

    for (i_t k = 0; k < fractional.size(); k++) {
      const i_t j = fractional[k];
      fraction_values.push_back(original_root_soln_x[j]);
    }

186-196: TODO comment indicates incomplete infeasibility handling.

The code notes that infeasible branches should count as "done" but this is not yet implemented. This could lead to misleading progress statistics in the log output.

Would you like me to help implement the infeasibility counting logic, or open an issue to track this TODO?

cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp (1)

192-200: Document the tighter infeasibility tolerances.

The default infeasibility tolerances moved to 1e-10, which can materially change convergence behavior and runtime. Please update the public docs/release notes to reflect the new defaults and expected impact. Based on learnings, document numerical tolerances when defaults change.

cpp/src/linear_programming/solve.cu (2)

815-838: Memory estimation loop may not terminate efficiently for large batch sizes.

The batch size reduction loop decrements by 1 each iteration, which can be slow for large batch sizes with significant memory pressure. Consider using binary search or proportional reduction.

♻️ Suggested improvement
   if (memory_estimate > free_mem) {
     use_optimal_batch_size = true;
     // Decrement batch size iteratively until we find a batch size that fits
-    while (memory_max_batch_size > 1) {
-      const size_t memory_estimate = batch_pdlp_memory_estimator(problem, memory_max_batch_size);
-      if (memory_estimate <= free_mem) { break; }
-      memory_max_batch_size--;
+    // Use binary search for faster convergence
+    int low = 1, high = memory_max_batch_size;
+    while (low < high) {
+      int mid = low + (high - low + 1) / 2;
+      if (batch_pdlp_memory_estimator(problem, mid) <= free_mem) {
+        low = mid;
+      } else {
+        high = mid - 1;
+      }
     }
+    memory_max_batch_size = low;
   }

948-949: Misleading variable initialization for only_upper flag.

The variable only_upper is set to false, but the comment suggests "Lower bounds can sometimes generate infeasible instances". The naming is confusing since only_upper = false means both upper and lower bounds are used, not "only upper". Consider renaming for clarity.

cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.hpp (1)

129-130: Dangling reference risk with const reference members.

Storing climber_strategies_ and hyper_params_ as const references creates lifetime dependency on the caller. If the referenced objects are destroyed before this class, you'll have dangling references. This is acceptable if the ownership contract is documented, but consider adding a comment clarifying the expected lifetime.

cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp (1)

128-147: Document/mitigate the breaking constructor change

These public constructors now require rvalue vectors for termination data (and warm-start moves), which breaks existing callers. Consider adding deprecated overloads or a migration note in the public API docs/release notes to preserve compatibility guidance. As per coding guidelines, please document the API change/migration.

cpp/src/linear_programming/termination_strategy/termination_strategy.hpp (1)

34-44: Verify lifetime of climber_strategies_ reference

pdlp_termination_strategy_t stores climber_strategies_ as a const std::vector<...>&. If the caller passes a temporary or later resizes that vector, this becomes dangling. Please confirm the lifetime contract or store a copy/shared_ptr to make the ownership explicit.

Also applies to: 191-192

Comment on lines +30 to +31
// Empty constructor for when used in non batch mode
segmented_sum_handler_t() {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Default constructor leaves stream_view_ uninitialized - potential undefined behavior.

The default constructor doesn't initialize stream_view_, but the helper methods use it. If a caller uses the default constructor and then calls segmented_sum_helper or segmented_reduce_helper, this will result in undefined behavior.

Consider either:

  1. Deleting the default constructor if it shouldn't be used
  2. Initializing to a valid default (e.g., rmm::cuda_stream_default)
  3. Adding runtime checks in the helper methods
🔧 Suggested fix - Option 1: Delete default constructor
-  // Empty constructor for when used in non batch mode
-  segmented_sum_handler_t() {}
+  // Delete default constructor - stream_view_ must be initialized
+  segmented_sum_handler_t() = delete;
🔧 Suggested fix - Option 2: Initialize to default stream
   // Empty constructor for when used in non batch mode
-  segmented_sum_handler_t() {}
+  segmented_sum_handler_t() : stream_view_(rmm::cuda_stream_default) {}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Empty constructor for when used in non batch mode
segmented_sum_handler_t() {}
// Delete default constructor - stream_view_ must be initialized
segmented_sum_handler_t() = delete;
Suggested change
// Empty constructor for when used in non batch mode
segmented_sum_handler_t() {}
// Empty constructor for when used in non batch mode
segmented_sum_handler_t() : stream_view_(rmm::cuda_stream_default) {}
🤖 Prompt for AI Agents
In `@cpp/include/cuopt/linear_programming/utilities/segmented_sum_handler.cuh`
around lines 30 - 31, The default constructor segmented_sum_handler_t() leaves
member stream_view_ uninitialized causing undefined behavior when
segmented_sum_helper or segmented_reduce_helper are later called; fix by
initializing stream_view_ to a safe default (e.g., rmm::cuda_stream_default or
equivalent rmm::cuda_stream_view{}) in the default constructor (or alternatively
delete the default constructor if you want to forbid use), and/or add an
assert/check in segmented_sum_helper/segmented_reduce_helper that stream_view_
is valid before use; update the segmented_sum_handler_t() definition and any
constructors to ensure stream_view_ is always initialized.

Comment on lines +25 to +27
: x(primal_size * current_batch_size, handle_ptr->get_stream()),
y(dual_size * current_batch_size, handle_ptr->get_stream()),
buffer_non_transpose_batch(0, handle_ptr->get_stream()),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential overflow in buffer sizing for large problems.

primal_size * current_batch_size and dual_size * current_batch_size are computed in int, which can overflow before the value is converted to size_t, leading to undersized allocations and memory corruption. Cast to size_t (and optionally add a guard) before multiplying.

🐛 Proposed fix
-    : x(primal_size * current_batch_size, handle_ptr->get_stream()),
-      y(dual_size * current_batch_size, handle_ptr->get_stream()),
+    : x(static_cast<size_t>(primal_size) * static_cast<size_t>(current_batch_size),
+        handle_ptr->get_stream()),
+      y(static_cast<size_t>(dual_size) * static_cast<size_t>(current_batch_size),
+        handle_ptr->get_stream()),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
: x(primal_size * current_batch_size, handle_ptr->get_stream()),
y(dual_size * current_batch_size, handle_ptr->get_stream()),
buffer_non_transpose_batch(0, handle_ptr->get_stream()),
: x(static_cast<size_t>(primal_size) * static_cast<size_t>(current_batch_size),
handle_ptr->get_stream()),
y(static_cast<size_t>(dual_size) * static_cast<size_t>(current_batch_size),
handle_ptr->get_stream()),
buffer_non_transpose_batch(0, handle_ptr->get_stream()),
🤖 Prompt for AI Agents
In
`@cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu`
around lines 25 - 27, The allocation sizes for x, y (and
buffer_non_transpose_batch init) use int multiplication (primal_size *
current_batch_size and dual_size * current_batch_size) which can overflow;
change the computation to cast operands to size_t before multiplying (e.g.,
(size_t)primal_size * (size_t)current_batch_size) and/or compute into a size_t
temp and add an overflow guard (check against SIZE_MAX or compare
current_batch_size > SIZE_MAX / primal_size) before constructing x, y or
buffer_non_transpose_batch; update any local variable types used for these sizes
to size_t and ensure the constructor calls for x, y, buffer_non_transpose_batch
use the size_t result.

Comment on lines 191 to 213
int optimal_batch_size_handler(const optimization_problem_t<i_t, f_t>& op_problem,
int max_batch_size)
{
if (max_batch_size == 1) return 1;

// Try to quickly find what is the optimal batch size for the problem
// We run the two most ran SpMMs for both A and A_T and compute "time / batch_size"
// The one with the best ratio has the optimal batch size (since can solve most amount of work in
// least time) To try to have something representative we run each SpMM 5 times and take the
// average We do it for both A and A_T and take the sum since both will be run for each batch size

// We start with batch size 128 (power two better for cuSparse) and try to improve by either
// multitipling or dividing by 2 each time At max we take 5 steps of search

constexpr int max_steps = 4; // 4 because we already do one step for direction
constexpr int initial_batch_size = 128;
constexpr int benchmark_runs = 5;
// Take the floor power of two
// This ensures that we always start with a batch size that is a power of two or
// initial_batch_size
int current_batch_size =
std::pow(2, std::floor(std::log2(std::min(initial_batch_size, max_batch_size))));
int optimal_batch_size = current_batch_size;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard max_batch_size before log2/pow.

If max_batch_size <= 0, log2(0) is undefined and will propagate NaNs into current_batch_size. Add a precondition check to avoid UB.

🐛 Proposed fix
 int optimal_batch_size_handler(const optimization_problem_t<i_t, f_t>& op_problem,
                                int max_batch_size)
 {
+  cuopt_assert(max_batch_size >= 1, "max_batch_size must be >= 1");
   if (max_batch_size == 1) return 1;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
int optimal_batch_size_handler(const optimization_problem_t<i_t, f_t>& op_problem,
int max_batch_size)
{
if (max_batch_size == 1) return 1;
// Try to quickly find what is the optimal batch size for the problem
// We run the two most ran SpMMs for both A and A_T and compute "time / batch_size"
// The one with the best ratio has the optimal batch size (since can solve most amount of work in
// least time) To try to have something representative we run each SpMM 5 times and take the
// average We do it for both A and A_T and take the sum since both will be run for each batch size
// We start with batch size 128 (power two better for cuSparse) and try to improve by either
// multitipling or dividing by 2 each time At max we take 5 steps of search
constexpr int max_steps = 4; // 4 because we already do one step for direction
constexpr int initial_batch_size = 128;
constexpr int benchmark_runs = 5;
// Take the floor power of two
// This ensures that we always start with a batch size that is a power of two or
// initial_batch_size
int current_batch_size =
std::pow(2, std::floor(std::log2(std::min(initial_batch_size, max_batch_size))));
int optimal_batch_size = current_batch_size;
int optimal_batch_size_handler(const optimization_problem_t<i_t, f_t>& op_problem,
int max_batch_size)
{
cuopt_assert(max_batch_size >= 1, "max_batch_size must be >= 1");
if (max_batch_size == 1) return 1;
// Try to quickly find what is the optimal batch size for the problem
// We run the two most ran SpMMs for both A and A_T and compute "time / batch_size"
// The one with the best ratio has the optimal batch size (since can solve most amount of work in
// least time) To try to have something representative we run each SpMM 5 times and take the
// average We do it for both A and A_T and take the sum since both will be run for each batch size
// We start with batch size 128 (power two better for cuSparse) and try to improve by either
// multitipling or dividing by 2 each time At max we take 5 steps of search
constexpr int max_steps = 4; // 4 because we already do one step for direction
constexpr int initial_batch_size = 128;
constexpr int benchmark_runs = 5;
// Take the floor power of two
// This ensures that we always start with a batch size that is a power of two or
// initial_batch_size
int current_batch_size =
std::pow(2, std::floor(std::log2(std::min(initial_batch_size, max_batch_size))));
int optimal_batch_size = current_batch_size;
🤖 Prompt for AI Agents
In
`@cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu`
around lines 191 - 213, In optimal_batch_size_handler, guard against
non-positive max_batch_size before computing log2/pow: add a precondition like
"if (max_batch_size <= 0) return 1;" (or clamp max_batch_size = 1) near the top
of the function before computing current_batch_size so std::log2(std::min(...))
never receives 0 or negative values; this fixes potential NaNs/UB for
current_batch_size derived from initial_batch_size and max_batch_size.

Comment on lines +277 to +295
cub::DeviceSegmentedReduce::Sum(
nullptr,
byte_needed,
thrust::make_transform_iterator(reusable_device_scalar_1_.data(), power_two_func_t<f_t>{}),
reusable_device_scalar_1_.data(),
climber_strategies_.size(),
primal_size_h_,
stream_view_);
dot_product_bytes = std::max(dot_product_bytes, byte_needed);

cub::DeviceSegmentedReduce::Sum(
nullptr,
byte_needed,
thrust::make_transform_iterator(reusable_device_scalar_1_.data(), power_two_func_t<f_t>{}),
reusable_device_scalar_1_.data(),
climber_strategies_.size(),
dual_size_h_,
stream_view_);
dot_product_bytes = std::max(dot_product_bytes, byte_needed);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's view the actual code at the mentioned lines
cat -n cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu | sed -n '270,300p'

Repository: NVIDIA/cuopt

Length of output: 1285


🏁 Script executed:

# Also check lines 1249-1257
cat -n cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu | sed -n '1240,1265p'

Repository: NVIDIA/cuopt

Length of output: 1332


🏁 Script executed:

# Search for RAFT_CUDA_TRY usage in the codebase to understand the pattern
rg "RAFT_CUDA_TRY" -A 2 -B 2 | head -50

Repository: NVIDIA/cuopt

Length of output: 4526


🏁 Script executed:

# Check if RAFT_CUDA_TRY is defined in headers
rg "define RAFT_CUDA_TRY\|#define.*CUDA_TRY" -A 3

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

# Look for CUB error checking patterns in the file itself
rg "cub::" cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu -B 1 -A 1

Repository: NVIDIA/cuopt

Length of output: 619


🏁 Script executed:

# Search for RAFT_CUDA_TRY definition
find . -name "*.h" -o -name "*.hpp" | xargs grep -l "RAFT_CUDA_TRY" | head -5

Repository: NVIDIA/cuopt

Length of output: 204


🏁 Script executed:

# Look at the definition of RAFT_CUDA_TRY
rg "define.*RAFT_CUDA_TRY|RAFT_CUDA_TRY\s*=" -A 5 | head -30

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

# Check how DeviceSegmentedReduce is typically used in other codebases or NVIDIA projects
rg "DeviceSegmentedReduce" -B 2 -A 2 | head -80

Repository: NVIDIA/cuopt

Length of output: 7548


🏁 Script executed:

# Check imports/includes in the file to understand error handling mechanisms
head -50 cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu

Repository: NVIDIA/cuopt

Length of output: 1482


🏁 Script executed:

# Check the actual definition of RAFT_CUDA_TRY by looking at cuda_utils.cuh
find . -name "cuda_utils.cuh" | xargs cat | grep -A 20 "RAFT_CUDA_TRY"

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

# Check if CUB documentation or any wrapper indicates DeviceSegmentedReduce returns cudaError_t
rg "cudaError_t.*DeviceSegmentedReduce|DeviceSegmentedReduce.*cudaError_t"

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

# Search for other similar patterns in the codebase - look for unwrapped CUB calls that might be considered acceptable
rg "cub::(Device|Block)" -B 1 -A 1 | grep -v "RAFT_CUDA_TRY\|RAFT_CUBLAS" | head -40

Repository: NVIDIA/cuopt

Length of output: 3313


🏁 Script executed:

# Check the segmented_sum_handler to see if it has error checking (might be a wrapper class)
cat -n cpp/include/cuopt/linear_programming/utilities/segmented_sum_handler.cuh | head -100

Repository: NVIDIA/cuopt

Length of output: 4119


Add CUDA error checks around CUB segmented reductions

cub::DeviceSegmentedReduce::Sum returns cudaError_t, but the return status is ignored in all three calls. Wrap these with RAFT_CUDA_TRY to ensure CUDA errors surface immediately rather than being silently dropped.

Proposed fix
-    cub::DeviceSegmentedReduce::Sum(
+    RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
       nullptr,
       byte_needed,
       thrust::make_transform_iterator(reusable_device_scalar_1_.data(), power_two_func_t<f_t>{}),
       reusable_device_scalar_1_.data(),
       climber_strategies_.size(),
       primal_size_h_,
-      stream_view_);
+      stream_view_));

-    cub::DeviceSegmentedReduce::Sum(
+    RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
       nullptr,
       byte_needed,
       thrust::make_transform_iterator(reusable_device_scalar_1_.data(), power_two_func_t<f_t>{}),
       reusable_device_scalar_1_.data(),
       climber_strategies_.size(),
       dual_size_h_,
-      stream_view_);
+      stream_view_));

-    cub::DeviceSegmentedReduce::Sum(
+    RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
       dot_product_storage.data(),
       dot_product_bytes,
       thrust::make_transform_iterator(tmp.data(), power_two_func_t<f_t>{}),
       distance_moved.data(),
       climber_strategies_.size(),
       size_of_solutions_h,
-      stream_view_);
+      stream_view_));
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu` around
lines 277 - 295, The calls to cub::DeviceSegmentedReduce::Sum in
pdlp_restart_strategy.cu (the three usages that pass nullptr, byte_needed,
thrust::make_transform_iterator(..., power_two_func_t<f_t>{}),
reusable_device_scalar_1_.data(), climber_strategies_.size(),
primal_size_h_/dual_size_h_, stream_view_) ignore the returned cudaError_t; wrap
each call with RAFT_CUDA_TRY so failures are checked and propagated (i.e.,
replace the bare cub::DeviceSegmentedReduce::Sum(...) invocations with
RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(...)) for the segments computing
dot_product_bytes).

Comment on lines 48 to 84
template <typename i_t, typename f_t>
void matrix_swap(rmm::device_uvector<f_t>& matrix,
i_t vector_size,
const thrust::universal_host_pinned_vector<swap_pair_t<i_t>>& swap_pairs)
{
if (swap_pairs.empty()) { return; }

const i_t batch_size = matrix.size() / vector_size;
cuopt_assert(vector_size > 0, "Vector size must be greater than 0");
cuopt_assert(batch_size > 0, "Batch size must be greater than 0");

const size_t swap_count = swap_pairs.size();
const size_t total_items = swap_count * static_cast<size_t>(vector_size);

auto counting = thrust::make_counting_iterator<size_t>(0);
auto left_index = thrust::make_transform_iterator(
counting,
matrix_swap_index_functor<i_t>{thrust::raw_pointer_cast(swap_pairs.data()), vector_size, true});
auto right_index = thrust::make_transform_iterator(
counting,
matrix_swap_index_functor<i_t>{
thrust::raw_pointer_cast(swap_pairs.data()), vector_size, false});

auto left_perm = thrust::make_permutation_iterator(matrix.data(), left_index);
auto right_perm = thrust::make_permutation_iterator(matrix.data(), right_index);
auto in_zip = thrust::make_zip_iterator(left_perm, right_perm);
auto out_zip = thrust::make_zip_iterator(left_perm, right_perm);

cub::DeviceTransform::Transform(
in_zip,
out_zip,
total_items,
[] HD(thrust::tuple<f_t, f_t> values) -> thrust::tuple<f_t, f_t> {
return thrust::make_tuple(thrust::get<1>(values), thrust::get<0>(values));
},
matrix.stream());
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

CUB DeviceTransform::Transform in-place operations overlapping input output iterators

💡 Result:

cub::DeviceTransform::Transform does not document any guarantees about overlapping input/output ranges (unlike some other CUB algorithms that explicitly spell out when in-place is supported). The header shows it’s a device-wide parallel transform that takes random-access input iterator(s) and an output iterator, with no special “overlap” contract stated in the API. [1]

Because the operation is executed in parallel, you should assume the same rule as thrust::transform: only exact in-place is OK (i.e., output == input for a unary transform), but any partial overlap (e.g., output = input + k, or overlapping subranges) is not safe / effectively undefined due to read-after-write races. Thrust explicitly allows first == result but forbids overlapping ranges otherwise. [2]

Practical guidance

  • Safe: d_out == d_in (true in-place, same base and same indexing).
  • Unsafe: d_out points into [d_in, d_in + num_items) but is not equal to d_in (any shifted/partial overlap), or overlaps across multiple input ranges.
  • If you need shifted/overlapping behavior, use a separate output buffer (or stage through a temp buffer) and then copy back.

Sources

  1. cub/device/device_transform.cuh program listing (shows signatures/dispatch, no overlap preconditions stated). [1]
  2. Thrust transform documentation precondition on overlap. [2]

🏁 Script executed:

# Find files related to swap_pairs and matrix_swap to understand construction and validation
rg -l "swap_pair" --type cpp --type cu --type cuh

Repository: NVIDIA/cuopt

Length of output: 82


🏁 Script executed:

# Examine the swap_and_resize_helper.cuh file in detail
cat -n cpp/src/linear_programming/swap_and_resize_helper.cuh

Repository: NVIDIA/cuopt

Length of output: 4089


🏁 Script executed:

rg "swap_pair" --type-list | head -20

Repository: NVIDIA/cuopt

Length of output: 445


🏁 Script executed:

rg "swap_pairs" -A 3 -B 3

Repository: NVIDIA/cuopt

Length of output: 39341


🏁 Script executed:

rg "matrix_swap" -A 5 -B 5

Repository: NVIDIA/cuopt

Length of output: 9571


Use a temporary buffer or custom kernel for in-place matrix swap.

cub::DeviceTransform::Transform does not support partial overlap between input and output iterators. Here, in_zip and out_zip are created from the same iterator objects but access different memory locations via permutation (left_index and right_index compute different addresses). This violates CUB's precondition that only exact in-place (identical addresses with identical indexing) is safe.

Although calling code validates that pair.left < pair.right, this only ensures distinct indices within each pair. The documented API contract still forbids overlapping access patterns like this. Either:

  • Allocate a temporary buffer, transform into it, then copy back, or
  • Implement a custom CUDA kernel that explicitly handles in-place swaps with proper synchronization semantics.
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/swap_and_resize_helper.cuh` around lines 48 - 84,
matrix_swap currently calls cub::DeviceTransform with in_zip and out_zip built
from the same permutation iterators (left_index/right_index), which can produce
overlapping partial-addresses and violates CUB's no-overlap precondition; fix by
making the transform non-overlapping: allocate a temporary
rmm::device_uvector<f_t> temp(total_items) (or a temp buffer sized to
vector_size * swap_count), run DeviceTransform writing into temp (using
left_index/right_index-derived input and a linear output), then copy/scatter the
swapped values from temp back into matrix at the target permuted positions (or
implement a custom CUDA kernel that performs the swaps in-place safely); update
references in this function (matrix_swap, left_index, right_index, in_zip,
out_zip) accordingly and ensure matrix.stream() and proper synchronization are
preserved.

… reduce cost indexing, returning optimal climber results even if overall batch hit iteration/time limit (with test), corrected memory estimator with solution, use fast mod div in pdhg, correct row sense handling
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🤖 Fix all issues with AI agents
In `@cpp/src/linear_programming/pdhg.cu`:
- Around line 767-775: The kernel launch of
refine_initial_primal_projection_kernel lacks CUDA error checking; after the
launch that uses grid_size, block_size and stream_view_, add a CUDA error check
(e.g., wrap a RAFT_CUDA_TRY(cudaPeekAtLastError()) or
CUDA_CHECK(cudaPeekAtLastError()) call) to catch launch failures and ensure
stream synchronization errors are surfaced; update the surrounding code where
refine_initial_primal_projection_kernel is invoked to call the chosen macro
immediately after the <<<...>>> invocation using the same stream_view_ context.
- Around line 827-843: The kernel launch of
refine_primal_projection_major_batch_kernel currently has no CUDA error
checking; after the launch add a CUDA error check (e.g., call cudaGetLastError()
and wrap it with your project’s CUDA_CHECK macro) and also check the stream
(synchronize or cudaStreamQuery on stream_view_ as your codebase prefers) to
surface async launch failures; reference the
refine_primal_projection_major_batch_kernel launch and the stream_view_ used for
the call when adding the checks.
- Line 193: The cuopt_assert call using the condition new_size < batch_size has
a mismatched message; update the assertion to make the message and condition
consistent by either changing the condition to new_size <= batch_size or
changing the message text to "New size must be less than batch size"; locate the
cuopt_assert call that references new_size and batch_size (cuopt_assert(new_size
< batch_size, "...")) and apply one of the two fixes so the condition and the
string match.
- Around line 931-944: After launching refine_primal_projection_batch_kernel,
add CUDA error checking and stream synchronization to catch launch failures:
immediately call the project's CUDA error-check macro or use cudaGetLastError()
and check the return, and also synchronize or check the associated stream
(stream_view_) to surface asynchronous errors; place this right after the kernel
launch that passes new_bounds_idx_, new_bounds_lower_, new_bounds_upper_,
current_saddle_point_state_.get_primal_solution(),
problem_ptr->objective_coefficients,
current_saddle_point_state_.get_current_AtY(), primal_step_size, and
reflected_primal_ to ensure any launch or async execution errors are reported.
- Line 309: The code currently reads the hard-coded global
deterministic_batch_pdlp (defined in pdlp_constants.hpp) when choosing
CUSPARSE_SPMM_CSR_ALG3 vs ALG2 in pdhg.cu; make this behavior
runtime-configurable by adding a boolean field (e.g., deterministic_batch_pdlp)
to the pdlp_solver_settings_t struct, give it the existing default (true), and
thread that setting through all places that construct or receive solver settings
so the PDHG code can read settings->deterministic_batch_pdlp instead of the
global. Update any factory/constructor/parsing logic that builds
pdlp_solver_settings_t to accept/configure the new flag, and replace references
in pdhg.cu (the SpMM algorithm selection sites) and any other files that
currently reference the global to use the settings field. Ensure backward
compatibility by initializing the new field to the previous default when
settings are not explicitly provided.

In `@cpp/src/linear_programming/solve.cu`:
- Around line 956-958: The code currently leaves only_upper=false, which uses
lower bounds that can create undetected infeasible instances; change only_upper
to true (so batch_size = fractional.size()) until the infeasibility-detection
work is completed, and add a clear inline comment at the only_upper declaration
referencing the known detection limitation and the TODO to re-enable lower
bounds once detect_infeasible_instances (or the partial implementation) is
finished; update any logic that depends on batch_size or lower-bound paths
(symbols: only_upper, batch_size, fractional) to behave correctly under
only_upper=true.
- Around line 807-812: Decide and implement behavior for PDLP concerning
pre-solving/scaling: either run on the original unpresolved/unscaled problem by
skipping any presolve/scale steps or add a clear boolean flag (e.g.,
run_on_unscaled or apply_presolve_scale) and ensure PDLP's entry point checks
that flag (refer to variables
primal_dual_init/primal_weight_init/use_optimal_batch_size to find the PDLP init
area) and document the choice in comments; make iteration_limit configurable by
replacing the constexpr int iteration_limit with a configurable parameter read
from a settings struct or function argument (keep the current 100000 as the
default) and ensure the solver returns a partial solution plus a status/flag
when the iteration limit is hit so callers can detect and handle incomplete
results.
- Around line 742-800: The function batch_pdlp_memory_estimator contains an
orphaned trailing comment and a typo in a comment: inside
batch_pdlp_memory_estimator(...) remove the stray comment "Data from saddle
point state" immediately before the return statement, and correct the
misspelling "convergeance" to "convergence" in the comment that currently reads
"Data for the convergeance information"; no code logic changes are needed—only
update those two comment lines within batch_pdlp_memory_estimator.

In `@cpp/src/linear_programming/termination_strategy/convergence_information.cu`:
- Around line 262-269: The assertion in convergence_information_t<i_t,
f_t>::resize_context incorrectly forbids new_size == batch_size while the
message says "less than or equal"; modify the check to allow no-op resizes by
changing the condition to permit equality (i.e., assert new_size <= batch_size)
or, if you prefer to keep strictness, update the assertion message to remove "or
equal"; update the assertion referencing primal_objective_.size() and the
cuopt_assert that currently checks new_size < batch_size so it matches the
intended behavior.

In `@cpp/src/linear_programming/utils.cuh`:
- Around line 241-248: The call to cub::DeviceTransform::Transform (invoked with
problem_wrap_container(op_problem.constraint_lower_bounds/upper_bounds),
combined_bounds, combine_finite_abs_bounds<f_t>(),
op_problem.handle_ptr->get_stream()) can fail and its return status must be
checked; update this call to capture the returned cub::Result (or cudaError_t as
appropriate for your CUB wrapper) and verify it with the project's CUDA
error-checking macro/function (e.g., CUDA_CHECK or equivalent), logging or
propagating the error if the call indicates failure so that failures in
DeviceTransform are detected and handled.

In `@cpp/tests/linear_programming/pdlp_test.cu`:
- Around line 326-357: The test currently sets solver_settings.hyper_params
flags after constructing
cuopt::linear_programming::detail::pdlp_solver_t<int,double> (symbol:
pdlp_solver_t), so the solver instance never observes those toggles; move each
modification of solver_settings.hyper_params
(update_step_size_on_initial_solution, update_primal_weight_on_initial_solution)
to occur before constructing the corresponding pdlp_solver_t instance and
creating the timer_t, then construct the solver, run_solver, synchronize, assert
EXPECT_NEAR, and finally reset the hyper_params flags afterward as currently
done.
🧹 Nitpick comments (10)
cpp/src/linear_programming/termination_strategy/convergence_information.cu (1)

336-349: Clean up duplicate assert and fix the size-mismatch message.

Line 337’s message references primal_size_h_ while the check uses dual_size_h_, and the l2_primal_residual_ assert is duplicated. This is low-risk but improves correctness and readability.

🧹 Suggested cleanup
-  cuopt_assert(
-    primal_residual_.size() == dual_size_h_ * climber_strategies_.size(),
-    "primal_residual_ size must be equal to primal_size_h_ * climber_strategies_.size()");
+  cuopt_assert(
+    primal_residual_.size() == dual_size_h_ * climber_strategies_.size(),
+    "primal_residual_ size must be equal to dual_size_h_ * climber_strategies_.size()");
   cuopt_assert(primal_iterate.size() == primal_size_h_ * climber_strategies_.size(),
                "primal_iterate size must be equal to primal_size_h_ * climber_strategies_.size()");
   cuopt_assert(dual_residual_.size() == primal_size_h_ * climber_strategies_.size(),
                "dual_residual_ size must be equal to primal_size_h_ * climber_strategies_.size()");
   cuopt_assert(dual_iterate.size() == dual_size_h_ * climber_strategies_.size(),
                "dual_iterate size must be equal to dual_size_h_ * climber_strategies_.size()");
   cuopt_assert(l2_primal_residual_.size() == climber_strategies_.size(),
                "l2_primal_residual_ size must be equal to climber_strategies_.size()");
-  cuopt_assert(l2_primal_residual_.size() == climber_strategies_.size(),
-               "l2_primal_residual_ size must be equal to climber_strategies_.size()");
   cuopt_assert(l2_dual_residual_.size() == climber_strategies_.size(),
                "l2_dual_residual_ size must be equal to climber_strategies_.size()");
cpp/src/linear_programming/solve.cu (2)

847-848: Remove unused type alias.

The type alias f_t2 is defined but never used in the function.

🧹 Proposed cleanup
-  using f_t2 = typename type_2<f_t>::type;
-
   // If need warm start, solve the LP alone

834-839: Consider binary search for optimal batch size.

The current linear search for the right batch size (decrementing by 1) could be slow when memory is very constrained. Consider using binary search for better performance.

⚡ Proposed optimization
-    while (memory_max_batch_size > 1) {
-      const size_t memory_estimate =
-        batch_pdlp_memory_estimator(problem, memory_max_batch_size, max_batch_size);
-      if (memory_estimate <= free_mem) { break; }
-      memory_max_batch_size--;
-    }
+    int left = 1, right = memory_max_batch_size;
+    while (left < right) {
+      int mid = left + (right - left + 1) / 2;
+      const size_t memory_estimate =
+        batch_pdlp_memory_estimator(problem, mid, max_batch_size);
+      if (memory_estimate <= free_mem) {
+        left = mid;
+      } else {
+        right = mid - 1;
+      }
+    }
+    memory_max_batch_size = left;
cpp/src/linear_programming/pdhg.cu (2)

447-668: Significant code duplication in projection functors.

The batch and non-batch versions of projection functors (primal_reflected_major_projection, dual_reflected_major_projection, etc.) have nearly identical logic, differing only in whether step size is a pointer or parameter. Similarly, the bulk operation functors duplicate the projection logic.

Consider templating these functors or using a unified implementation to reduce duplication and maintenance burden.

Based on learnings, as per coding guidelines: refactor code duplication in solver components (3+ occurrences) into shared utilities.


556-577: Verify memory coalescing in batch bulk operations.

The bulk operation at Lines 560-577 uses the indexing pattern:

batch_idx = idx % batch_size
var_idx = idx / batch_size

This results in accesses to primal_solution[var_idx * batch_size + batch_idx], objective_coefficients[var_idx], and current_AtY[var_idx * batch_size + batch_idx].

For optimal coalescing, consecutive threads should access consecutive memory addresses. With the current pattern, threads in a warp access different batch indices of the same variable, which may lead to strided access patterns depending on batch size. Consider whether a different data layout (e.g., batch-major storage) would improve memory coalescing.

As per coding guidelines: detect inefficient GPU kernel memory access patterns; optimize for coalesced memory access.

cpp/src/linear_programming/utils.cuh (3)

91-98: Inconsistent assertion with fallback behavior.

The assertion fires when b == 0, but execution continues and returns a anyway. In debug builds this creates confusing behavior (assertion fails but code proceeds), and in release builds the assertion may be stripped, silently masking potential bugs.

Consider either:

  1. Making the assertion the only guard (remove the conditional return), or
  2. Removing the assertion and documenting that returning a on zero divisor is intentional behavior.
Option 1: Assertion-only (fail fast)
 template <typename f_t>
 struct batch_safe_div {
   HDI f_t operator()(f_t a, f_t b)
   {
     cuopt_assert(b != f_t(0), "Division by zero");
-    return b != f_t(0) ? a / b : a;
+    return a / b;
   }
 };
Option 2: Document intentional fallback (remove assertion)
 template <typename f_t>
 struct batch_safe_div {
+  // Returns a / b, or a unchanged if b is zero (safe fallback for batch operations)
   HDI f_t operator()(f_t a, f_t b)
   {
-    cuopt_assert(b != f_t(0), "Division by zero");
     return b != f_t(0) ? a / b : a;
   }
 };

194-205: Integer type limits problem scalability.

Using int for problem_size_ and the index parameter limits batch sizes to ~2.1 billion elements. For large-scale LP problems with millions of variables across many climbers, this could overflow. The TODO comment (line 203) acknowledges this.

Consider templating on the index type to match the solver's i_t parameter, ensuring consistency across problem transformations as per coding guidelines.

♻️ Suggested refactor
-template <typename f_t>
+template <typename i_t, typename f_t>
 struct problem_wrapped_iterator {
-  problem_wrapped_iterator(const f_t* problem_input, int problem_size)
+  problem_wrapped_iterator(const f_t* problem_input, i_t problem_size)
     : problem_input_(problem_input), problem_size_(problem_size)
   {
   }
-  HDI f_t operator()(int id) { return problem_input_[id % problem_size_]; }
+  HDI f_t operator()(i_t id) { return problem_input_[id % problem_size_]; }

   const f_t* problem_input_;
-  // TODO use i_t
-  int problem_size_;
+  i_t problem_size_;
 };

207-213: Counting iterator type should match index type.

thrust::make_counting_iterator(0) produces an int iterator. If problem_wrapped_iterator is updated to use i_t, this function should also be templated on the index type and use thrust::make_counting_iterator(i_t(0)) for consistency.

♻️ Suggested refactor (if adopting i_t)
-template <typename f_t>
+template <typename i_t, typename f_t>
 static inline auto problem_wrap_container(const rmm::device_uvector<f_t>& in)
 {
-  return thrust::make_transform_iterator(thrust::make_counting_iterator(0),
-                                         problem_wrapped_iterator<f_t>(in.data(), in.size()));
+  return thrust::make_transform_iterator(thrust::make_counting_iterator(i_t(0)),
+                                         problem_wrapped_iterator<i_t, f_t>(in.data(), in.size()));
 }
cpp/tests/linear_programming/pdlp_test.cu (2)

602-604: Remove the duplicate set_initial_dual_solution call.
It’s redundant and can be confusing if the setter manages ownership/state.

🧹 Proposed cleanup
-    solver3.set_initial_dual_solution(d_initial_dual);
     solver3.set_initial_dual_solution(d_initial_dual);

1412-1573: Consider gating the 1000-batch stress test if CI time is tight.
If runtime becomes an issue, marking it as slow or parameterizing the batch size could help.

cusparse_view_.batch_dual_solutions,
reusable_device_scalar_value_0_.data(),
cusparse_view_.batch_current_AtYs,
(deterministic_batch_pdlp) ? CUSPARSE_SPMM_CSR_ALG3 : CUSPARSE_SPMM_CSR_ALG2,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -nC5 'deterministic_batch_pdlp' --type=cpp --type=cu

Repository: NVIDIA/cuopt

Length of output: 82


🏁 Script executed:

rg -nC5 'deterministic_batch_pdlp'

Repository: NVIDIA/cuopt

Length of output: 15847


🏁 Script executed:

# Check if there's any solver settings or parameters class that might expose deterministic_batch_pdlp
rg -n 'class.*[Ss]etting|class.*[Cc]onfig|class.*[Pp]arameter' cpp/src/linear_programming | head -20

Repository: NVIDIA/cuopt

Length of output: 258


🏁 Script executed:

# Find the solver_settings header file
find . -name "*solver_settings*" -type f

Repository: NVIDIA/cuopt

Length of output: 779


🏁 Script executed:

# Check the PDLP solver settings header for deterministic_batch_pdlp
cat -n cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp | head -100

Repository: NVIDIA/cuopt

Length of output: 4124


🏁 Script executed:

# Continue reading the PDLP solver settings header - search for deterministic or batch
rg -n 'deterministic|batch' cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp

Repository: NVIDIA/cuopt

Length of output: 275


🏁 Script executed:

# Check the context around the batch-related TODO comments and cudss_deterministic
cat -n cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp | sed -n '115,260p'

Repository: NVIDIA/cuopt

Length of output: 7260


🏁 Script executed:

# Check if deterministic_batch_pdlp is ever modified at runtime
rg -n 'deterministic_batch_pdlp\s*=' --type=cpp --type=cu

Repository: NVIDIA/cuopt

Length of output: 82


🏁 Script executed:

# Check if deterministic_batch_pdlp is ever modified at runtime
rg -n 'deterministic_batch_pdlp\s*='

Repository: NVIDIA/cuopt

Length of output: 149


Make deterministic_batch_pdlp runtime-configurable through solver settings.

The flag deterministic_batch_pdlp is currently a hard-coded static global (initialized to true in pdlp_constants.hpp line 37) that controls algorithm selection for batch SpMM operations across multiple locations (lines 309, 341 in pdhg.cu and elsewhere). This flag must be exposed as a configurable parameter in pdlp_solver_settings_t to allow users to trade off between determinism (ALG3) and performance (ALG2) at runtime, consistent with the PR's goal of making hyperparameters configurable.

🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/pdhg.cu` at line 309, The code currently reads the
hard-coded global deterministic_batch_pdlp (defined in pdlp_constants.hpp) when
choosing CUSPARSE_SPMM_CSR_ALG3 vs ALG2 in pdhg.cu; make this behavior
runtime-configurable by adding a boolean field (e.g., deterministic_batch_pdlp)
to the pdlp_solver_settings_t struct, give it the existing default (true), and
thread that setting through all places that construct or receive solver settings
so the PDHG code can read settings->deterministic_batch_pdlp instead of the
global. Update any factory/constructor/parsing logic that builds
pdlp_solver_settings_t to accept/configure the new flag, and replace references
in pdhg.cu (the SpMM algorithm selection sites) and any other files that
currently reference the global to use the settings field. Ensure backward
compatibility by initializing the new field to the previous default when
settings are not explicitly provided.

Comment on lines 767 to 775
refine_initial_primal_projection_kernel<<<grid_size, block_size, 0, stream_view_>>>(
(i_t)climber_strategies_.size(),
problem_ptr->n_variables,
raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()),
raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()),
raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()),
raft::device_span<f_t>(current_saddle_point_state_.get_primal_solution().data(),
current_saddle_point_state_.get_primal_solution().size()));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing CUDA error checking after kernel launch.

The kernel launch for refine_initial_primal_projection_kernel lacks error checking. Per coding guidelines, every CUDA kernel launch must have error checking with CUDA_CHECK or RAFT_CUDA_TRY(cudaPeekAtLastError()).

🐛 Proposed fix
   refine_initial_primal_projection_kernel<<<grid_size, block_size, 0, stream_view_>>>(
     (i_t)climber_strategies_.size(),
     problem_ptr->n_variables,
     raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()),
     raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()),
     raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()),
     raft::device_span<f_t>(current_saddle_point_state_.get_primal_solution().data(),
                            current_saddle_point_state_.get_primal_solution().size()));
+  RAFT_CUDA_TRY(cudaPeekAtLastError());

As per coding guidelines.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
refine_initial_primal_projection_kernel<<<grid_size, block_size, 0, stream_view_>>>(
(i_t)climber_strategies_.size(),
problem_ptr->n_variables,
raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()),
raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()),
raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()),
raft::device_span<f_t>(current_saddle_point_state_.get_primal_solution().data(),
current_saddle_point_state_.get_primal_solution().size()));
}
refine_initial_primal_projection_kernel<<<grid_size, block_size, 0, stream_view_>>>(
(i_t)climber_strategies_.size(),
problem_ptr->n_variables,
raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()),
raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()),
raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()),
raft::device_span<f_t>(current_saddle_point_state_.get_primal_solution().data(),
current_saddle_point_state_.get_primal_solution().size()));
RAFT_CUDA_TRY(cudaPeekAtLastError());
}
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/pdhg.cu` around lines 767 - 775, The kernel launch
of refine_initial_primal_projection_kernel lacks CUDA error checking; after the
launch that uses grid_size, block_size and stream_view_, add a CUDA error check
(e.g., wrap a RAFT_CUDA_TRY(cudaPeekAtLastError()) or
CUDA_CHECK(cudaPeekAtLastError()) call) to catch launch failures and ensure
stream synchronization errors are surfaced; update the surrounding code where
refine_initial_primal_projection_kernel is invoked to call the chosen macro
immediately after the <<<...>>> invocation using the same stream_view_ context.

Comment on lines 931 to 944
refine_primal_projection_batch_kernel<<<grid_size, block_size, 0, stream_view_>>>(
(i_t)climber_strategies_.size(),
problem_ptr->n_variables,
raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()),
raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()),
raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()),
raft::device_span<const f_t>(current_saddle_point_state_.get_primal_solution().data(),
current_saddle_point_state_.get_primal_solution().size()),
raft::device_span<const f_t>(problem_ptr->objective_coefficients.data(),
problem_ptr->objective_coefficients.size()),
raft::device_span<const f_t>(current_saddle_point_state_.get_current_AtY().data(),
current_saddle_point_state_.get_current_AtY().size()),
raft::device_span<const f_t>(primal_step_size.data(), primal_step_size.size()),
raft::device_span<f_t>(reflected_primal_.data(), reflected_primal_.size()));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing CUDA error checking after kernel launch.

The kernel launch for refine_primal_projection_batch_kernel lacks error checking. Per coding guidelines, every CUDA kernel launch must have error checking.

🐛 Proposed fix
         refine_primal_projection_batch_kernel<<<grid_size, block_size, 0, stream_view_>>>(
           (i_t)climber_strategies_.size(),
           problem_ptr->n_variables,
           raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()),
           raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()),
           raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()),
           raft::device_span<const f_t>(current_saddle_point_state_.get_primal_solution().data(),
                                        current_saddle_point_state_.get_primal_solution().size()),
           raft::device_span<const f_t>(problem_ptr->objective_coefficients.data(),
                                        problem_ptr->objective_coefficients.size()),
           raft::device_span<const f_t>(current_saddle_point_state_.get_current_AtY().data(),
                                        current_saddle_point_state_.get_current_AtY().size()),
           raft::device_span<const f_t>(primal_step_size.data(), primal_step_size.size()),
           raft::device_span<f_t>(reflected_primal_.data(), reflected_primal_.size()));
+        RAFT_CUDA_TRY(cudaPeekAtLastError());
       }

As per coding guidelines.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
refine_primal_projection_batch_kernel<<<grid_size, block_size, 0, stream_view_>>>(
(i_t)climber_strategies_.size(),
problem_ptr->n_variables,
raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()),
raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()),
raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()),
raft::device_span<const f_t>(current_saddle_point_state_.get_primal_solution().data(),
current_saddle_point_state_.get_primal_solution().size()),
raft::device_span<const f_t>(problem_ptr->objective_coefficients.data(),
problem_ptr->objective_coefficients.size()),
raft::device_span<const f_t>(current_saddle_point_state_.get_current_AtY().data(),
current_saddle_point_state_.get_current_AtY().size()),
raft::device_span<const f_t>(primal_step_size.data(), primal_step_size.size()),
raft::device_span<f_t>(reflected_primal_.data(), reflected_primal_.size()));
refine_primal_projection_batch_kernel<<<grid_size, block_size, 0, stream_view_>>>(
(i_t)climber_strategies_.size(),
problem_ptr->n_variables,
raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()),
raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()),
raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()),
raft::device_span<const f_t>(current_saddle_point_state_.get_primal_solution().data(),
current_saddle_point_state_.get_primal_solution().size()),
raft::device_span<const f_t>(problem_ptr->objective_coefficients.data(),
problem_ptr->objective_coefficients.size()),
raft::device_span<const f_t>(current_saddle_point_state_.get_current_AtY().data(),
current_saddle_point_state_.get_current_AtY().size()),
raft::device_span<const f_t>(primal_step_size.data(), primal_step_size.size()),
raft::device_span<f_t>(reflected_primal_.data(), reflected_primal_.size()));
RAFT_CUDA_TRY(cudaPeekAtLastError());
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/pdhg.cu` around lines 931 - 944, After launching
refine_primal_projection_batch_kernel, add CUDA error checking and stream
synchronization to catch launch failures: immediately call the project's CUDA
error-check macro or use cudaGetLastError() and check the return, and also
synchronize or check the associated stream (stream_view_) to surface
asynchronous errors; place this right after the kernel launch that passes
new_bounds_idx_, new_bounds_lower_, new_bounds_upper_,
current_saddle_point_state_.get_primal_solution(),
problem_ptr->objective_coefficients,
current_saddle_point_state_.get_current_AtY(), primal_step_size, and
reflected_primal_ to ensure any launch or async execution errors are reported.

Comment on lines +807 to +812
bool primal_dual_init = true;
bool primal_weight_init = true;
bool use_optimal_batch_size = false;
constexpr int iteration_limit = 100000;
// Shouldn't we work on the unpresolved and/or unscaled problem for PDLP?
// Shouldn't we put an iteration limit? If yes what should we do with the partial solutions?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Resolve TODO and make iteration_limit configurable.

Two issues:

  1. Line 811: The TODO comment raises an important question about whether batch PDLP should operate on unresolved/unscaled problems. This should be resolved before merging.

  2. Line 810: The iteration_limit = 100000 is hardcoded. Consider making this configurable via settings or document why this specific value was chosen.

🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/solve.cu` around lines 807 - 812, Decide and
implement behavior for PDLP concerning pre-solving/scaling: either run on the
original unpresolved/unscaled problem by skipping any presolve/scale steps or
add a clear boolean flag (e.g., run_on_unscaled or apply_presolve_scale) and
ensure PDLP's entry point checks that flag (refer to variables
primal_dual_init/primal_weight_init/use_optimal_batch_size to find the PDLP init
area) and document the choice in comments; make iteration_limit configurable by
replacing the constexpr int iteration_limit with a configurable parameter read
from a settings struct or function argument (keep the current 100000 as the
default) and ensure the solver returns a partial solution plus a status/flag
when the iteration limit is hit so callers can detect and handle incomplete
results.

Comment on lines +956 to +958
constexpr bool only_upper = false;
int batch_size = only_upper ? fractional.size() : fractional.size() * 2;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Document or address the infeasibility detection limitation.

The comment indicates a known issue: "Lower bounds can sometimes generate infeasible instances that we struggle to detect." However, only_upper is set to false, meaning lower bounds are still being used. This could lead to incorrect results if infeasibility goes undetected.

Please either:

  1. Address the infeasibility detection issue (as mentioned in PR objectives, this is partially implemented)
  2. Document this limitation clearly in user-facing documentation
  3. Enable only_upper = true until the detection is reliable
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/solve.cu` around lines 956 - 958, The code
currently leaves only_upper=false, which uses lower bounds that can create
undetected infeasible instances; change only_upper to true (so batch_size =
fractional.size()) until the infeasibility-detection work is completed, and add
a clear inline comment at the only_upper declaration referencing the known
detection limitation and the TODO to re-enable lower bounds once
detect_infeasible_instances (or the partial implementation) is finished; update
any logic that depends on batch_size or lower-bound paths (symbols: only_upper,
batch_size, fractional) to behave correctly under only_upper=true.

Comment on lines 262 to 269
template <typename i_t, typename f_t>
void convergence_information_t<i_t, f_t>::resize_context(i_t new_size)
{
[[maybe_unused]] const auto batch_size = static_cast<i_t>(primal_objective_.size());
cuopt_assert(batch_size > 0, "Batch size must be greater than 0");
cuopt_assert(new_size > 0, "New size must be greater than 0");
cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size");

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make resize_context accept a no-op resize (or fix the assertion message).

Line 268 currently rejects new_size == batch_size, but the error message says “less than or equal.” Either allow equality (no-op) or adjust the message to avoid confusion.

🔧 Suggested fix (allow no-op resize)
-  cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size");
+  cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size");
+  if (new_size == batch_size) { return; }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
template <typename i_t, typename f_t>
void convergence_information_t<i_t, f_t>::resize_context(i_t new_size)
{
[[maybe_unused]] const auto batch_size = static_cast<i_t>(primal_objective_.size());
cuopt_assert(batch_size > 0, "Batch size must be greater than 0");
cuopt_assert(new_size > 0, "New size must be greater than 0");
cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size");
template <typename i_t, typename f_t>
void convergence_information_t<i_t, f_t>::resize_context(i_t new_size)
{
[[maybe_unused]] const auto batch_size = static_cast<i_t>(primal_objective_.size());
cuopt_assert(batch_size > 0, "Batch size must be greater than 0");
cuopt_assert(new_size > 0, "New size must be greater than 0");
cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size");
if (new_size == batch_size) { return; }
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/termination_strategy/convergence_information.cu`
around lines 262 - 269, The assertion in convergence_information_t<i_t,
f_t>::resize_context incorrectly forbids new_size == batch_size while the
message says "less than or equal"; modify the check to allow no-op resizes by
changing the condition to permit equality (i.e., assert new_size <= batch_size)
or, if you prefer to keep strictness, update the assertion message to remove "or
equal"; update the assertion referencing primal_objective_.size() and the
cuopt_assert that currently checks new_size < batch_size so it matches the
intended behavior.

Comment on lines +241 to 248
cub::DeviceTransform::Transform(
cuda::std::make_tuple(problem_wrap_container(op_problem.constraint_lower_bounds),
problem_wrap_container(op_problem.constraint_upper_bounds)),
combined_bounds.data(),
combined_bounds.size(),
combine_finite_abs_bounds<f_t>(),
op_problem.handle_ptr->get_stream());
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing CUDA error check on DeviceTransform.

cub::DeviceTransform::Transform can fail, but its return status isn't checked. Per coding guidelines, CUDA operations require error verification.

🔧 Suggested fix
-    cub::DeviceTransform::Transform(
+    RAFT_CUDA_TRY(cub::DeviceTransform::Transform(
       cuda::std::make_tuple(problem_wrap_container(op_problem.constraint_lower_bounds),
                             problem_wrap_container(op_problem.constraint_upper_bounds)),
       combined_bounds.data(),
       combined_bounds.size(),
       combine_finite_abs_bounds<f_t>(),
-      op_problem.handle_ptr->get_stream());
+      op_problem.handle_ptr->get_stream()));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cub::DeviceTransform::Transform(
cuda::std::make_tuple(problem_wrap_container(op_problem.constraint_lower_bounds),
problem_wrap_container(op_problem.constraint_upper_bounds)),
combined_bounds.data(),
combined_bounds.size(),
combine_finite_abs_bounds<f_t>(),
op_problem.handle_ptr->get_stream());
}
RAFT_CUDA_TRY(cub::DeviceTransform::Transform(
cuda::std::make_tuple(problem_wrap_container(op_problem.constraint_lower_bounds),
problem_wrap_container(op_problem.constraint_upper_bounds)),
combined_bounds.data(),
combined_bounds.size(),
combine_finite_abs_bounds<f_t>(),
op_problem.handle_ptr->get_stream()));
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/utils.cuh` around lines 241 - 248, The call to
cub::DeviceTransform::Transform (invoked with
problem_wrap_container(op_problem.constraint_lower_bounds/upper_bounds),
combined_bounds, combine_finite_abs_bounds<f_t>(),
op_problem.handle_ptr->get_stream()) can fail and its return status must be
checked; update this call to capture the returned cub::Result (or cudaError_t as
appropriate for your CUB wrapper) and verify it with the project's CUDA
error-checking macro/function (e.g., CUDA_CHECK or equivalent), logging or
propagating the error if the call indicates failure so that failures in
DeviceTransform are detected and handled.

Comment on lines 326 to 357
{
cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
auto pdlp_timer = timer_t(solver_settings.time_limit);
cuopt::linear_programming::pdlp_hyper_params::update_step_size_on_initial_solution = true;
solver_settings.hyper_params.update_step_size_on_initial_solution = true;
solver.run_solver(pdlp_timer);
RAFT_CUDA_TRY(cudaStreamSynchronize(handle_.get_stream()));
EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(), factor_tolerance);
EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(), factor_tolerance);
cuopt::linear_programming::pdlp_hyper_params::update_step_size_on_initial_solution = false;
EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(0), factor_tolerance);
EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(0), factor_tolerance);
solver_settings.hyper_params.update_step_size_on_initial_solution = false;
}
{
cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
auto pdlp_timer = timer_t(solver_settings.time_limit);
cuopt::linear_programming::pdlp_hyper_params::update_primal_weight_on_initial_solution = true;
solver_settings.hyper_params.update_primal_weight_on_initial_solution = true;
solver.run_solver(pdlp_timer);
RAFT_CUDA_TRY(cudaStreamSynchronize(handle_.get_stream()));
EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(), factor_tolerance);
EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(), factor_tolerance);
cuopt::linear_programming::pdlp_hyper_params::update_primal_weight_on_initial_solution = false;
EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(0), factor_tolerance);
EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(0), factor_tolerance);
solver_settings.hyper_params.update_primal_weight_on_initial_solution = false;
}
{
cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
auto pdlp_timer = timer_t(solver_settings.time_limit);
cuopt::linear_programming::pdlp_hyper_params::update_primal_weight_on_initial_solution = true;
cuopt::linear_programming::pdlp_hyper_params::update_step_size_on_initial_solution = true;
solver_settings.hyper_params.update_primal_weight_on_initial_solution = true;
solver_settings.hyper_params.update_step_size_on_initial_solution = true;
solver.run_solver(pdlp_timer);
RAFT_CUDA_TRY(cudaStreamSynchronize(handle_.get_stream()));
EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(), factor_tolerance);
EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(), factor_tolerance);
cuopt::linear_programming::pdlp_hyper_params::update_primal_weight_on_initial_solution = false;
cuopt::linear_programming::pdlp_hyper_params::update_step_size_on_initial_solution = false;
EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(0), factor_tolerance);
EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(0), factor_tolerance);
solver_settings.hyper_params.update_primal_weight_on_initial_solution = false;
solver_settings.hyper_params.update_step_size_on_initial_solution = false;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Move hyper_params toggles before solver construction.
The flags are set after pdlp_solver_t is created, so the solver may never see the new values and the test won’t actually validate the intended behavior.

🔧 Proposed fix
-  {
-    cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
-    auto pdlp_timer = timer_t(solver_settings.time_limit);
-    solver_settings.hyper_params.update_step_size_on_initial_solution = true;
+  {
+    solver_settings.hyper_params.update_step_size_on_initial_solution = true;
+    cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
+    auto pdlp_timer = timer_t(solver_settings.time_limit);
     solver.run_solver(pdlp_timer);
     RAFT_CUDA_TRY(cudaStreamSynchronize(handle_.get_stream()));
     EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(0), factor_tolerance);
     EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(0), factor_tolerance);
     solver_settings.hyper_params.update_step_size_on_initial_solution = false;
   }
-  {
-    cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
-    auto pdlp_timer = timer_t(solver_settings.time_limit);
-    solver_settings.hyper_params.update_primal_weight_on_initial_solution = true;
+  {
+    solver_settings.hyper_params.update_primal_weight_on_initial_solution = true;
+    cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
+    auto pdlp_timer = timer_t(solver_settings.time_limit);
     solver.run_solver(pdlp_timer);
     RAFT_CUDA_TRY(cudaStreamSynchronize(handle_.get_stream()));
     EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(0), factor_tolerance);
     EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(0), factor_tolerance);
     solver_settings.hyper_params.update_primal_weight_on_initial_solution = false;
   }
-  {
-    cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
-    auto pdlp_timer = timer_t(solver_settings.time_limit);
-    solver_settings.hyper_params.update_primal_weight_on_initial_solution = true;
-    solver_settings.hyper_params.update_step_size_on_initial_solution     = true;
+  {
+    solver_settings.hyper_params.update_primal_weight_on_initial_solution = true;
+    solver_settings.hyper_params.update_step_size_on_initial_solution     = true;
+    cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
+    auto pdlp_timer = timer_t(solver_settings.time_limit);
     solver.run_solver(pdlp_timer);
     RAFT_CUDA_TRY(cudaStreamSynchronize(handle_.get_stream()));
     EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(0), factor_tolerance);
     EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(0), factor_tolerance);
     solver_settings.hyper_params.update_primal_weight_on_initial_solution = false;
     solver_settings.hyper_params.update_step_size_on_initial_solution     = false;
   }
🤖 Prompt for AI Agents
In `@cpp/tests/linear_programming/pdlp_test.cu` around lines 326 - 357, The test
currently sets solver_settings.hyper_params flags after constructing
cuopt::linear_programming::detail::pdlp_solver_t<int,double> (symbol:
pdlp_solver_t), so the solver instance never observes those toggles; move each
modification of solver_settings.hyper_params
(update_step_size_on_initial_solution, update_primal_weight_on_initial_solution)
to occur before constructing the corresponding pdlp_solver_t instance and
creating the timer_t, then construct the solver, run_solver, synchronize, assert
EXPECT_NEAR, and finally reset the hyper_params flags afterward as currently
done.

…ut of bound, reenable initial solution checking if not in batch pdlp mode
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp (1)

95-147: Update Doxygen for batch-aware constructors and per-id getters.

Docs still describe singular termination data and the new overload at Lines 139-147 lacks a Doxygen block. Please document the batch vector semantics and the meaning/bounds of the id parameter in the per-id accessors.

As per coding guidelines, for public header files (C++ API): Check if new public functions/classes have documentation comments (Doxygen format) and verify parameter descriptions match actual types/behavior.

Also applies to: 198-213

cpp/src/linear_programming/solver_solution.cu (1)

48-89: Enforce termination vector size invariants to prevent OOB access.

Accessors index termination_stats_ based on termination_status_.size(). If the vectors differ in size or are empty, this can go out-of-bounds. Add a size/non-empty validation in the vector-based constructors (or a shared validator) so accessor checks remain safe.

🛠️ Suggested guard (apply to both vector constructors)
   : primal_solution_(std::move(final_primal_solution)),
     dual_solution_(std::move(final_dual_solution)),
     reduced_cost_(std::move(final_reduced_cost)),
     pdlp_warm_start_data_(std::move(warm_start_data)),
     objective_name_(objective_name),
     var_names_(std::move(var_names)),
     row_names_(std::move(row_names)),
     termination_stats_(std::move(termination_stats)),
     termination_status_(std::move(termination_status)),
     error_status_(cuopt::logic_error("", cuopt::error_type_t::Success))
 {
+  cuopt_expects(termination_stats_.size() == termination_status_.size() &&
+                  !termination_stats_.empty(),
+                error_type_t::ValidationError,
+                "termination_stats and termination_status must have the same non-zero size");
 }

Also applies to: 321-332, 387-392

🧹 Nitpick comments (2)
cpp/src/linear_programming/solve.cu (1)

862-865: Consider using consistent stream for device_uvector copies.

The warm start device_uvectors are constructed using the source solution's stream. While this likely works since warm_start_settings uses the same handle, explicitly using the stream variable from line 814 would make the intent clearer and ensure consistency.

♻️ Optional clarification
     if (primal_dual_init) {
-      initial_primal = rmm::device_uvector<double>(
-        original_solution.get_primal_solution(), original_solution.get_primal_solution().stream());
-      initial_dual      = rmm::device_uvector<double>(original_solution.get_dual_solution(),
-                                                 original_solution.get_dual_solution().stream());
+      initial_primal = rmm::device_uvector<double>(
+        original_solution.get_primal_solution(), stream);
+      initial_dual   = rmm::device_uvector<double>(
+        original_solution.get_dual_solution(), stream);
       initial_step_size = original_solution.get_pdlp_warm_start_data().initial_step_size_;
     }
cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp (1)

243-259: Consider const overloads to avoid copies and unintended mutation.

get_additional_termination_informations() const returns by value, and there’s no const overload for get_terminations_status(). A const& return for both would avoid copies and accidental mutation in read-only contexts.

…roblem from public facing api, default init mip_batch_pdlp to 0
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@cpp/src/dual_simplex/pseudo_costs.cpp`:
- Around line 263-264: The batch_pdlp_solve call is missing the configured
settings and thus uses defaults; update the call to pass the existing settings
variable (batch_pdlp_solve(..., fractional, fraction_values, settings)) but
first adjust the solver time limit in settings for strong-branching use (e.g.,
reduce settings.time_limit or set a short per-branch timeout) so the PDLP solver
runs with the intended hyper-parameters; locate the call to batch_pdlp_solve and
the nearby settings variable to apply this change.
🧹 Nitpick comments (1)
cpp/src/dual_simplex/pseudo_costs.cpp (1)

145-146: Use template type i_t instead of int for consistency.

Variables m and n are declared as int but should use the template type i_t for consistency with the rest of the codebase.

📝 Proposed fix
   cuopt::mps_parser::mps_data_model_t<i_t, f_t> mps_model;
-  int m = user_problem.num_rows;
-  int n = user_problem.num_cols;
+  i_t m = user_problem.num_rows;
+  i_t n = user_problem.num_cols;

Comment on lines 263 to 264
const auto mps_model = simplex_problem_to_mps_data_model(original_problem);
const auto solutions = batch_pdlp_solve(original_problem.handle_ptr, mps_model, fractional, fraction_values);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Search for batch_pdlp_solve definition and all call sites
rg -n 'batch_pdlp_solve' --type cpp --type hpp -A 2 -B 1

Repository: NVIDIA/cuopt

Length of output: 83


🏁 Script executed:

# Also search for the function definition in header files
fd . --name "*.hpp" --name "*.h" | xargs rg -l 'batch_pdlp_solve' | head -5

Repository: NVIDIA/cuopt

Length of output: 468


🏁 Script executed:

# Get context around lines 263-264 in pseudo_costs.cpp
sed -n '250,280p' cpp/src/dual_simplex/pseudo_costs.cpp | cat -n

Repository: NVIDIA/cuopt

Length of output: 1641


🏁 Script executed:

# Find batch_pdlp_solve signature in solve.hpp
cat cpp/include/cuopt/linear_programming/solve.hpp | grep -A 20 'batch_pdlp_solve'

Repository: NVIDIA/cuopt

Length of output: 987


🏁 Script executed:

# Check all call sites of batch_pdlp_solve in solve.cu
rg -n 'batch_pdlp_solve' cpp/src/linear_programming/solve.cu -A 1 -B 1

Repository: NVIDIA/cuopt

Length of output: 486


🏁 Script executed:

# Check all call sites in pdlp_test.cu
rg -n 'batch_pdlp_solve' cpp/tests/linear_programming/pdlp_test.cu -A 1 -B 1

Repository: NVIDIA/cuopt

Length of output: 297


🏁 Script executed:

# Get full context around the batch_pdlp_solve call to see what 'settings' is available
sed -n '240,320p' cpp/src/dual_simplex/pseudo_costs.cpp | cat -n

Repository: NVIDIA/cuopt

Length of output: 4069


🏁 Script executed:

# Check what 'settings' contains - look for settings type and hyper_params usage
rg -n 'settings\.' cpp/src/dual_simplex/pseudo_costs.cpp | head -20

Repository: NVIDIA/cuopt

Length of output: 1371


Pass configured settings to batch_pdlp_solve call.

The call to batch_pdlp_solve at line 264 omits the settings parameter and will use the default pdlp_solver_settings_t<i_t, f_t>{}. This ignores configured hyper-parameters that should influence the solver behavior during strong branching. The settings variable is already in scope and should be passed with appropriate time-limit adjustment.

🔧 Proposed fix
     const auto mps_model = simplex_problem_to_mps_data_model(original_problem);
-    const auto solutions = batch_pdlp_solve(original_problem.handle_ptr, mps_model, fractional, fraction_values);
+    // Create PDLP settings from simplex settings
+    cuopt::linear_programming::pdlp_solver_settings_t<i_t, f_t> pdlp_settings;
+    pdlp_settings.hyper_params = settings.hyper_params;
+    pdlp_settings.time_limit = std::max(0.0, settings.time_limit - toc(start_time));
+    const auto solutions = batch_pdlp_solve(original_problem.handle_ptr, mps_model, fractional, fraction_values, pdlp_settings);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const auto mps_model = simplex_problem_to_mps_data_model(original_problem);
const auto solutions = batch_pdlp_solve(original_problem.handle_ptr, mps_model, fractional, fraction_values);
const auto mps_model = simplex_problem_to_mps_data_model(original_problem);
// Create PDLP settings from simplex settings
cuopt::linear_programming::pdlp_solver_settings_t<i_t, f_t> pdlp_settings;
pdlp_settings.hyper_params = settings.hyper_params;
pdlp_settings.time_limit = std::max(0.0, settings.time_limit - toc(start_time));
const auto solutions = batch_pdlp_solve(original_problem.handle_ptr, mps_model, fractional, fraction_values, pdlp_settings);
🤖 Prompt for AI Agents
In `@cpp/src/dual_simplex/pseudo_costs.cpp` around lines 263 - 264, The
batch_pdlp_solve call is missing the configured settings and thus uses defaults;
update the call to pass the existing settings variable (batch_pdlp_solve(...,
fractional, fraction_values, settings)) but first adjust the solver time limit
in settings for strong-branching use (e.g., reduce settings.time_limit or set a
short per-branch timeout) so the PDLP solver runs with the intended
hyper-parameters; locate the call to batch_pdlp_solve and the nearby settings
variable to apply this change.

…usparse macro, call spmm preprocess only after a certain cuda version, optimal batch size risking 0 evaluation
…ays be equal, use bulk instead of custom kernel so that it's more efficient limits risk of out of bound and have an always good thread number
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@cpp/src/linear_programming/cusparse_view.cu`:
- Around line 236-262: The call to cusparseSpMM_preprocess in
my_cusparsespmm_preprocess must be guarded by a runtime dlsym check like SpMV;
add a typedef cusparseSpMM_preprocess_sig matching cusparseSpMM_preprocess's
signature, use dynamic_load_runtime::function<cusparseSpMM_preprocess_sig> to
lookup the symbol (handle missing symbol by returning/throwing similarly to
my_cusparsespmmv_preprocess), then invoke the resolved function instead of
calling cusparseSpMM_preprocess directly and wrap the call with
RAFT_CUSPARSE_TRY (keep the existing cusparseSetStream(handle, stream) before
the call).

In
`@cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu`:
- Around line 40-46: The assignments of size_t current_batch_size to int
variables (col_primal, ld_primal, col_dual, ld_dual) can silently truncate large
values; update the code to either keep these variables as a size-safe type or
validate-and-cast before use: add an explicit runtime check/assert that
current_batch_size <= INT_MAX (or the max required by the cuSPARSE API) and then
use a checked cast (e.g., static_cast<int>) when assigning to
col_primal/ld_primal/col_dual/ld_dual, or change their declarations from int to
a wider unsigned type that matches primal_size/dual_size and only convert when
calling cuSPARSE. Ensure you reference and modify the variables
current_batch_size, col_primal, ld_primal, col_dual, and ld_dual (and any call
sites that pass them into cuSPARSE descriptors) so no silent narrowing occurs.
- Around line 244-250: The batch-size doubling and sum expressions can overflow
32-bit ints; change multiplications to use a wider intermediate (e.g., int64_t /
long long) before clamping back to int: for the right_node assignment (currently
computing right_node = std::min(current_batch_size * 2, max_batch_size)) and for
the sum expression used elsewhere (the (current_batch_size * 2) +
current_batch_size calculation), compute the doubled/added value in an int64_t
variable (e.g., int64_t doubled = static_cast<int64_t>(current_batch_size) * 2;
int64_t sum = doubled + static_cast<int64_t>(current_batch_size);), then use
std::min with static_cast<int>(...) to clamp against max_batch_size and assign
to right_node or the target variable; apply the same pattern for the other
occurrence referenced in the review to avoid overflow during the multiplication
step.
🧹 Nitpick comments (3)
cpp/src/linear_programming/pdhg.cu (1)

774-782: Consider adding CUDA error check after CUB device operation.

cub::DeviceFor::Bulk launches a kernel internally. While CUB operations are generally reliable, adding RAFT_CUDA_TRY(cudaPeekAtLastError()) after the call would align with the project's error-checking guidelines and catch any launch failures.

As per coding guidelines: "Every CUDA kernel launch and memory operation must have error checking."

🔧 Suggested addition
   cub::DeviceFor::Bulk(climber_strategies_.size(),
                        refine_initial_primal_projection_bulk_op<i_t, f_t>{
                          make_span(new_bounds_idx_),
                          make_span(new_bounds_lower_),
                          make_span(new_bounds_upper_),
                          make_span(current_saddle_point_state_.get_primal_solution()),
                          problem_ptr->n_variables},
                        stream_view_.value());
+  RAFT_CUDA_TRY(cudaPeekAtLastError());
cpp/src/linear_programming/solver_solution.cu (2)

278-286: Clarify batch semantics for solve time.

In batch mode this writes/reads a single solve time across all climbers. Consider storing per-id solve times (or documenting that this is a global solve time) to avoid misleading per-solution stats.

Also applies to: 298-304


375-418: Avoid exposing mutable vectors that can break size invariants.

Returning non-const refs lets callers resize and violate the termination_stats_/termination_status_ size invariant. Consider returning const&/span and add explicit setters that enforce size equality (or document and enforce “no resize”).

Comment on lines +236 to +262
// TODO add proper checking
#if CUDA_VER_12_4_UP
template <typename T,
typename std::enable_if_t<std::is_same_v<T, float> || std::is_same_v<T, double>>*>
void my_cusparsespmm_preprocess(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const T* alpha,
const cusparseSpMatDescr_t matA,
const cusparseDnMatDescr_t matB,
const T* beta,
const cusparseDnMatDescr_t matC,
cusparseSpMMAlg_t alg,
void* externalBuffer,
cudaStream_t stream)
{
auto constexpr float_type = []() constexpr {
if constexpr (std::is_same_v<T, float>) {
return CUDA_R_32F;
} else if constexpr (std::is_same_v<T, double>) {
return CUDA_R_64F;
}
}();
CUSPARSE_CHECK(cusparseSetStream(handle, stream));
RAFT_CUSPARSE_TRY(cusparseSpMM_preprocess(
handle, opA, opB, alpha, matA, matB, beta, matC, float_type, alg, externalBuffer));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "cusparse_view.cu" | head -5

Repository: NVIDIA/cuopt

Length of output: 141


🏁 Script executed:

# Once we find the file, let's examine the relevant lines
git ls-files | grep -E "(cusparse|linear_programming)" | head -20

Repository: NVIDIA/cuopt

Length of output: 1217


🏁 Script executed:

wc -l ./cpp/src/linear_programming/cusparse_view.cu

Repository: NVIDIA/cuopt

Length of output: 105


🏁 Script executed:

sed -n '230,270p' ./cpp/src/linear_programming/cusparse_view.cu

Repository: NVIDIA/cuopt

Length of output: 1680


🏁 Script executed:

# Search for SpMV implementation to understand the guard pattern being referenced
rg -A 15 "my_cusparsespmmv_preprocess|dlsym|dynamic_load_runtime" ./cpp/src/linear_programming/cusparse_view.cu | head -50

Repository: NVIDIA/cuopt

Length of output: 2339


🏁 Script executed:

# Check for CUDA version requirements in the file
rg -B 5 -A 5 "CUDA_VER_12_4|cusparseSpMM_preprocess" ./cpp/src/linear_programming/cusparse_view.cu

Repository: NVIDIA/cuopt

Length of output: 3035


🏁 Script executed:

# Search for cusparseSpMM_preprocess_sig definition
rg "cusparseSpMM_preprocess_sig" ./cpp/src/linear_programming/cusparse_view.cu

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

# Check where my_cusparsespmm_preprocess is actually called to understand impact
rg -B 3 -A 3 "my_cusparsespmm_preprocess\(" ./cpp/src/linear_programming/cusparse_view.cu | head -40

Repository: NVIDIA/cuopt

Length of output: 1785


🏁 Script executed:

# Check the actual signature of cusparseSpMM_preprocess in CUDA headers or documentation
# to understand what signature typedef would be needed
rg "cusparseSpMV_preprocess_sig|using cusparse_sig" ./cpp/src/linear_programming/cusparse_view.cu -A 5 | head -30

Repository: NVIDIA/cuopt

Length of output: 834


🏁 Script executed:

# Verify the pattern: check if similar runtime guards exist elsewhere in the codebase
rg -B 2 "dynamic_load_runtime::function" ./cpp/src/linear_programming/cusparse_view.cu | head -20

Repository: NVIDIA/cuopt

Length of output: 296


Add runtime symbol guard to cusparseSpMM_preprocess to match SpMV's dlsym protection.

The my_cusparsespmm_preprocess function (lines 242–260) directly calls cusparseSpMM_preprocess without runtime symbol verification. If the binary runs with pre-12.4 cuSPARSE, this call will fail to load. The my_cusparsespmmv_preprocess (SpMV) implementation correctly uses dynamic_load_runtime::function with runtime dlsym checking; SpMM needs the same guard.

Add a cusparseSpMM_preprocess_sig typedef matching the function's arguments and wrap the call with the dlsym guard pattern already demonstrated in SpMV.

🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/cusparse_view.cu` around lines 236 - 262, The call
to cusparseSpMM_preprocess in my_cusparsespmm_preprocess must be guarded by a
runtime dlsym check like SpMV; add a typedef cusparseSpMM_preprocess_sig
matching cusparseSpMM_preprocess's signature, use
dynamic_load_runtime::function<cusparseSpMM_preprocess_sig> to lookup the symbol
(handle missing symbol by returning/throwing similarly to
my_cusparsespmmv_preprocess), then invoke the resolved function instead of
calling cusparseSpMM_preprocess directly and wrap the call with
RAFT_CUSPARSE_TRY (keep the existing cusparseSetStream(handle, stream) before
the call).

Comment on lines +40 to +46
int rows_primal = primal_size;
int col_primal = current_batch_size;
int ld_primal = current_batch_size;

int rows_dual = dual_size;
int col_dual = current_batch_size;
int ld_dual = current_batch_size;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Narrowing conversion from size_t to int may truncate large batch sizes.

current_batch_size is size_t but is assigned to int variables col_primal and ld_primal. If current_batch_size > INT_MAX, this truncates silently, causing incorrect cuSPARSE descriptors and potential memory corruption.

🐛 Proposed fix
-    int rows_primal = primal_size;
-    int col_primal  = current_batch_size;
-    int ld_primal   = current_batch_size;
-
-    int rows_dual = dual_size;
-    int col_dual  = current_batch_size;
-    int ld_dual   = current_batch_size;
+    int64_t rows_primal = primal_size;
+    int64_t col_primal  = current_batch_size;
+    int64_t ld_primal   = current_batch_size;
+
+    int64_t rows_dual = dual_size;
+    int64_t col_dual  = current_batch_size;
+    int64_t ld_dual   = current_batch_size;

Alternatively, add an assertion to validate that current_batch_size fits in the type expected by the cuSPARSE API.

🤖 Prompt for AI Agents
In
`@cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu`
around lines 40 - 46, The assignments of size_t current_batch_size to int
variables (col_primal, ld_primal, col_dual, ld_dual) can silently truncate large
values; update the code to either keep these variables as a size-safe type or
validate-and-cast before use: add an explicit runtime check/assert that
current_batch_size <= INT_MAX (or the max required by the cuSPARSE API) and then
use a checked cast (e.g., static_cast<int>) when assigning to
col_primal/ld_primal/col_dual/ld_dual, or change their declarations from int to
a wider unsigned type that matches primal_size/dual_size and only convert when
calling cuSPARSE. Ensure you reference and modify the variables
current_batch_size, col_primal, ld_primal, col_dual, and ld_dual (and any call
sites that pass them into cuSPARSE descriptors) so no silent narrowing occurs.

Comment on lines +244 to +250
// Sync before starting anything to make sure everything is done
RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view));

// Evaluate current, left and right nodes to pick a direction

const int left_node = std::max(1, current_batch_size / 2);
const int right_node = std::min(current_batch_size * 2, max_batch_size);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate the file
find . -name "optimal_batch_size_handler.cu" -type f

Repository: NVIDIA/cuopt

Length of output: 141


🏁 Script executed:

# Search for call sites and function definition
rg -n "optimal_batch_size_handler" --type cpp --type cu -A 3 -B 3

Repository: NVIDIA/cuopt

Length of output: 82


🏁 Script executed:

# Look at the file structure and function signatures
rg -n "max_batch_size" -B 5 -A 10 --type cpp --type cu | head -100

Repository: NVIDIA/cuopt

Length of output: 82


🏁 Script executed:

# Read the file to examine the code at specified lines
cat -n ./cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu | sed -n '240,260p'

Repository: NVIDIA/cuopt

Length of output: 1303


🏁 Script executed:

# Also check the other lines mentioned
cat -n ./cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu | sed -n '280,295p'

Repository: NVIDIA/cuopt

Length of output: 767


🏁 Script executed:

# Check lines around 321, 365, 400-401
cat -n ./cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu | sed -n '315,330p'

Repository: NVIDIA/cuopt

Length of output: 903


🏁 Script executed:

# Check line 365
cat -n ./cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu | sed -n '360,375p'

Repository: NVIDIA/cuopt

Length of output: 912


🏁 Script executed:

# Check lines around 400-401
cat -n ./cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu | sed -n '395,410p'

Repository: NVIDIA/cuopt

Length of output: 855


🏁 Script executed:

# Find the function definition to understand the parameter types
cat -n ./cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu | head -50

Repository: NVIDIA/cuopt

Length of output: 2238


🏁 Script executed:

# Search for calls to optimal_batch_size_handler without type restrictions
grep -rn "optimal_batch_size_handler" --include="*.cu" --include="*.cpp" --include="*.hpp" -A 3 -B 1

Repository: NVIDIA/cuopt

Length of output: 4071


🏁 Script executed:

# Find memory_max_batch_size definition and usage
grep -rn "memory_max_batch_size" --include="*.cu" --include="*.cpp" --include="*.hpp" -B 2 -A 2

Repository: NVIDIA/cuopt

Length of output: 1709


🏁 Script executed:

# Check if there's any validation or constraint on max_batch_size in the function
cat -n ./cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu | sed -n '196,230p'

Repository: NVIDIA/cuopt

Length of output: 1971


🏁 Script executed:

# Look for any constants or limits related to batch size
grep -rn "INT_MAX\|INT_MIN\|batch_size.*limit\|max.*batch" --include="*.hpp" --include="*.h" | grep -i batch | head -20

Repository: NVIDIA/cuopt

Length of output: 195


Verify integer overflow protection for batch size calculations.

On line 250 and 365, current_batch_size * 2 computes in int and can overflow if current_batch_size is large. While the result is capped by max_batch_size, the overflow happens before the comparison. Similar risk exists on line 321 where (current_batch_size * 2) + current_batch_size can overflow during the multiplication step.

However, line 287 performs division (current_batch_size / 2), not multiplication, and line 400-401 doesn't involve doubling ((current_batch_size / 2) + current_batch_size), so these don't have the same overflow risk.

To fix: Use wider intermediate types for all multiplications involving batch sizes:

   const int left_node  = std::max(1, current_batch_size / 2);
-  const int right_node = std::min(current_batch_size * 2, max_batch_size);
+  const int right_node = std::min(static_cast<int>(static_cast<int64_t>(current_batch_size) * 2),
+                                  max_batch_size);

Apply similar fixes to lines 321 and 365.

🤖 Prompt for AI Agents
In
`@cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu`
around lines 244 - 250, The batch-size doubling and sum expressions can overflow
32-bit ints; change multiplications to use a wider intermediate (e.g., int64_t /
long long) before clamping back to int: for the right_node assignment (currently
computing right_node = std::min(current_batch_size * 2, max_batch_size)) and for
the sum expression used elsewhere (the (current_batch_size * 2) +
current_batch_size calculation), compute the doubled/added value in an int64_t
variable (e.g., int64_t doubled = static_cast<int64_t>(current_batch_size) * 2;
int64_t sum = doubled + static_cast<int64_t>(current_batch_size);), then use
std::min with static_cast<int>(...) to clamp against max_batch_size and assign
to right_node or the target variable; apply the same pattern for the other
occurrence referenced in the review to avoid overflow during the multiplication
step.

…assert size comment, changed from non-const static to constexpr, fix potential segfault in non-concurrent mode, correction use f_t instead of double, protect again wrong size in matrix swap and optimal batch size handler, guarded again SegemendeReduce buffer allocation issues
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/src/linear_programming/saddle_point.cu (1)

94-107: Add validation or documentation clarifying copy() batch semantics.

The copy() method copies only primal_size_ and dual_size_ elements, while all other state buffers are allocated as batch_size * dimension. Unlike primal_gradient_ (which has an explicit comment stating it does not support batch mode), copy() lacks documentation clarifying whether it intentionally copies only a single problem's solution or if this is a batch-aware design oversight.

Since swap_context() and resize_context() properly handle batch data, add either:

  • A validation check asserting batch_size == 1 before copying, or
  • Documentation explicitly stating that copy() only operates on single-batch problems.
🤖 Fix all issues with AI agents
In `@cpp/src/linear_programming/cusparse_view.cu`:
- Line 261: Replace the inconsistent use of CUSPARSE_CHECK with the project's
standard RAFT_CUSPARSE_TRY macro for the cusparseSetStream call: change the call
that currently invokes CUSPARSE_CHECK(cusparseSetStream(handle, stream)) to use
RAFT_CUSPARSE_TRY(cusparseSetStream(handle, stream)) so error handling matches
the rest of the file and uses the same macro semantics for the symbols
cusparseSetStream, handle, and stream.

In `@cpp/src/linear_programming/pdhg.cu`:
- Around line 170-177: The include for utilities/copy_helpers.hpp (which
provides make_span) is currently guarded by CUPDLP_DEBUG_MODE causing
compilation failures in release; remove the debug-only guard so make_span is
available in all builds, ensuring calls like make_span(...) passed into
pdhg_swap_bounds_kernel (and any other kernel launches) compile; update the
include that provides make_span to be unconditional (no CUPDLP_DEBUG_MODE
`#if/`#ifdef) so symbols like make_span are defined for pdhg_swap_bounds_kernel
and related uses.
- Around line 774-781: The CUB bulk launches (cub::DeviceFor::Bulk calls)
currently lack CUDA error checks; wrap each call with RAFT_CUDA_TRY so failures
are surfaced. Specifically, replace each raw call like
cub::DeviceFor::Bulk(climber_strategies_.size(),
refine_initial_primal_projection_bulk_op<i_t, f_t>{...}, stream_view_.value())
(and the similar calls using other bulk op functors at the other locations) with
RAFT_CUDA_TRY(cub::DeviceFor::Bulk(...)); keeping the same arguments (e.g.,
climber_strategies_.size(), the respective refine_*_bulk_op functor instances,
and stream_view_.value()) so the return cudaError_t is checked.

In `@cpp/src/linear_programming/solve.cu`:
- Around line 728-743: The loop that decrements memory_max_batch_size using
batch_pdlp_memory_estimator can exit at 1 while the estimated memory still
exceeds free_mem, risking OOM; after the while loop (or when setting
use_optimal_batch_size) recompute or recheck
batch_pdlp_memory_estimator(problem, memory_max_batch_size, max_batch_size)
against free_mem and if it still exceeds free_mem, fail fast with a clear
error/exception or return an error code (include context like
memory_max_batch_size, free_mem and the failing estimate) instead of proceeding;
update the logic around use_optimal_batch_size and memory_max_batch_size to
ensure callers see the failure.

In
`@cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu`:
- Around line 483-516: The three calls to cub::DeviceSegmentedReduce::Sum (the
calls that use tuple_multiplies<f_t> and power_two_func_t<f_t> and the
temporaries dot_product_storage/dot_product_bytes on stream_view_) are missing
CUDA error checking; capture the cudaError_t returned by each
DeviceSegmentedReduce::Sum call and pass it to the project's error-checking
macro (e.g., CUDA_CHECK) or explicitly check and log/fail on non-success, so
each segmented-reduce invocation is validated (include a short context message
identifying the call when reporting the error).

In `@cpp/src/linear_programming/swap_and_resize_helper.cuh`:
- Around line 78-85: cub::DeviceTransform::Transform returns a cudaError_t and
its return value must be checked; wrap the call to
cub::DeviceTransform::Transform (the lambda swap transform invocation) with the
project's CUDA error macro (e.g. RAFT_CUDA_TRY or CUDA_CHECK) to surface launch
failures and propagate errors, i.e. call
RAFT_CUDA_TRY(cub::DeviceTransform::Transform(...)) or equivalent around the
existing call and keep passing matrix.stream() and the same
in_zip/out_zip/total_items and lambda as before.

In `@cpp/src/linear_programming/termination_strategy/convergence_information.cu`:
- Around line 371-385: The two CUB calls (cub::DeviceSegmentedReduce::Sum and
cub::DeviceTransform::Transform) in the batch residual path ignore their return
codes; capture their returned status and check it using the project's CUDA
error-checking macro (e.g., RAFT_CUDA_TRY or CUDA_TRY) or by testing the
returned cudaError_t and handling/logging on failure. Specifically, wrap the
call to cub::DeviceSegmentedReduce::Sum(...) (using dot_product_storage_,
dot_product_bytes_, thrust::make_transform_iterator(primal_residual_.data(),
power_two_func_t<f_t>{}), l2_primal_residual_.data(),
climber_strategies_.size(), dual_size_h_, stream_view_) and the call to
cub::DeviceTransform::Transform(...) (using l2_primal_residual_.data(),
l2_primal_residual_.data(), l2_primal_residual_.size(), the sqrt lambda,
stream_view_) with the error check so any CUDA/CUB error is propagated or logged
consistently.
- Around line 137-188: The sizing calls to cub::DeviceSegmentedReduce::Sum (the
nullptr temp-storage queries using variables like byte_needed, dual_dot_,
power_two_func_t, tuple_multiplies, primal_objective_, stream_view_, and
dot_product_bytes_) are not checking the returned cudaError_t; assign each
call's return to a cudaError_t (e.g., cudaError_t err =
cub::DeviceSegmentedReduce::Sum(...)) and immediately validate it with the
project error macro (CUDA_CHECK(err) or equivalent) after every sizing
invocation to ensure failures are caught during the segmented-reduce sizing
phase.
- Around line 766-793: Wrap each CUB call in the dual objective computation (the
two cub::DeviceSegmentedReduce::Sum calls and the
cub::DeviceTransform::Transform call) with CUDA error checking: capture the
returned cudaError_t from each call and call the project's CUDA error-handling
utility (or check for != cudaSuccess and log/return) using the same context
variables (dot_product_storage_, dot_product_bytes_,
thrust::make_transform_iterator(...), dual_dot_.data(), primal_slack_.data(),
sum_primal_slack_.data(), dual_objective_.data(), dual_objective_.size(),
stream_view_) so any failure is detected and reported immediately; ensure you do
this in the block that computes the dual objective so failures in segmented
reduce or transform are handled.
🧹 Nitpick comments (13)
cpp/src/linear_programming/saddle_point.cu (2)

23-38: Missing validation for batch_size parameter.

The constructor validates primal_size > 0 and dual_size > 0 but does not validate batch_size > 0. While zero-sized allocations may be handled by RMM, it would be more consistent to explicitly validate this parameter, especially since swap_context and resize_context assert batch_size > 0.

🔧 Suggested fix
 {
   EXE_CUOPT_EXPECTS(primal_size > 0, "Size of the primal problem must be larger than 0");
   EXE_CUOPT_EXPECTS(dual_size > 0, "Size of the dual problem must be larger than 0");
+  EXE_CUOPT_EXPECTS(batch_size > 0, "Batch size must be larger than 0");

77-92: Implementation looks correct; minor clarification suggestion.

The resize logic correctly shrinks all batched buffers while preserving the stream associations. The strict inequality new_size < batch_size is appropriate for the use case of removing terminated climbers.

Consider clarifying the assertion message to indicate it refers to the current batch size:

📝 Optional clarification
-  cuopt_assert(new_size < batch_size, "New size must be less than batch size");
+  cuopt_assert(new_size < batch_size, "New size must be less than current batch size");
cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu (3)

130-143: Incorrect cast of buffer data to (f_t*).

The SpMM buffer is a void* opaque workspace returned by cuSPARSE, not typed data. Casting it to (f_t*) is semantically incorrect. While this may work because cuSPARSE interprets the pointer as void* anyway, the cast is misleading and could cause issues if the buffer is used elsewhere expecting f_t*.

♻️ Suggested fix
     RAFT_CUSPARSE_TRY(raft::sparse::detail::cusparsespmm(
       handle_ptr->get_cusparse_handle(),
       CUSPARSE_OPERATION_NON_TRANSPOSE,
       CUSPARSE_OPERATION_NON_TRANSPOSE,
       alpha.data(),
       A,
       x_descr,
       beta.data(),
       y_descr,
       (deterministic_batch_pdlp) ? CUSPARSE_SPMM_CSR_ALG3 : CUSPARSE_SPMM_CSR_ALG2,
-      (f_t*)buffer_non_transpose_batch.data(),
+      buffer_non_transpose_batch.data(),
       stream_view));
 
     RAFT_CUSPARSE_TRY(raft::sparse::detail::cusparsespmm(
       handle_ptr->get_cusparse_handle(),
       CUSPARSE_OPERATION_NON_TRANSPOSE,
       CUSPARSE_OPERATION_NON_TRANSPOSE,
       alpha.data(),
       A_T,
       y_descr,
       beta.data(),
       x_descr,
       (deterministic_batch_pdlp) ? CUSPARSE_SPMM_CSR_ALG3 : CUSPARSE_SPMM_CSR_ALG2,
-      (f_t*)buffer_transpose_batch.data(),
+      buffer_transpose_batch.data(),
       stream_view));

221-221: Unused variable stream_view.

The variable stream_view is declared on line 221 but is only used once on line 246 for synchronization. Consider inlining the call or removing the variable if the handle's stream can be used directly.

♻️ Suggested fix
-  rmm::cuda_stream_view stream_view = op_problem.get_handle_ptr()->get_stream();
 
   detail::problem_t<i_t, f_t> problem(op_problem);
 
   // ... (cusparse view creation) ...
 
   // Sync before starting anything to make sure everything is done
-  RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view));
+  RAFT_CUDA_TRY(cudaStreamSynchronize(op_problem.get_handle_ptr()->get_stream()));

425-435: Unreachable code block is intentional but may trigger compiler warnings.

The else block at line 426-431 handles the case where current_ratio is best, and it returns. The code at lines 433-434 is never reached. While this is good defensive programming, some compilers may warn about unreachable code.

Consider whether to keep the defensive assert or remove it to avoid warnings, depending on your CI/compiler warning policy.

cpp/src/linear_programming/cusparse_view.cu (3)

339-396: Batch mode dense matrix descriptors created without error checking on data validity.

The batch descriptor creation (lines 343-396) uses raw data pointers from current_saddle_point_state and input buffers. While there's an assert for _reflected_primal_solution at line 363, other pointers like get_dual_solution().data(), get_current_AtY().data() etc. are used directly without validation. Consider adding similar guards for critical pointers, especially since batch mode is new functionality.

Also note the typo on line 339: "mdoe" should be "mode".

✏️ Fix typo
-  // TODO batch mdoe: convert those to RAII views
+  // TODO batch mode: convert those to RAII views

481-512: Batch-mode buffer allocation lacks error checking after resize.

The buffer_transpose_batch_row_row_ and buffer_non_transpose_batch_row_row_ buffers are resized (lines 495-496, 510-511) but there's no subsequent check that the allocation succeeded before using .data() in the preprocessing calls (lines 570, 582). While RMM typically throws on allocation failure, consider adding explicit size validation for robustness in this new batch path.


559-584: Missing CUDA error check after my_cusparsespmm_preprocess calls in batch mode.

The preprocessing calls inside the if (batch_mode_) block (lines 560-571, 572-583) don't have explicit error checking after the function calls. While my_cusparsespmm_preprocess uses RAFT_CUSPARSE_TRY internally, it's good practice to verify the stream state after a sequence of GPU operations, especially in new batch-mode code paths.

cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu (2)

154-165: resize_context assertion may be overly restrictive.

Line 160 asserts new_size < batch_size, which prevents resizing to the same size. This might be intentional, but typically resize operations allow new_size <= batch_size (no-op when equal). If resizing to current size should be valid, consider:

✏️ Allow no-op resize
-  cuopt_assert(new_size < batch_size, "New size must be less than batch size");
+  cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size");
+  if (new_size == batch_size) { return; }  // No-op

167-268: Kernel asserts batch mode not supported, but batch path exists elsewhere.

The compute_step_sizes_from_movement_and_interaction kernel (lines 176-177) asserts that primal_weight.size() == 1, meaning it doesn't support batch mode. However, the compute_step_sizes function (line 329) that calls this kernel also asserts !batch_mode_.

This is internally consistent, but the kernel assertion message could be more informative about why batch is unsupported here (presumably because per-climber adaptive step sizing isn't implemented yet).

✏️ Clarify assertion message
-  cuopt_assert(step_size_strategy_view.primal_weight.size() == 1,
-               "compute_step_sizes_from_movement_and_interaction not supported in batch");
+  cuopt_assert(step_size_strategy_view.primal_weight.size() == 1,
+               "Per-climber adaptive step size computation not yet implemented in batch mode");
cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu (3)

980-985: Redundant assignment in loop.

The weighted_average_solution_.iterations_since_last_restart_ = 0 assignment is set multiple times inside the loop but only needs to be set once outside.

♻️ Suggested fix
   // TODO later batch mode: remove if you have per climber restart
+  weighted_average_solution_.iterations_since_last_restart_ = 0;
   for (size_t i = 0; i < climber_strategies_.size(); ++i) {
-    weighted_average_solution_.iterations_since_last_restart_ = 0;
     last_trial_fixed_point_error_[i] = std::numeric_limits<f_t>::infinity();
   }

995-1003: Consider adding bounds validation for swap indices.

Unlike the similar swap_context in convergence_information.cu which validates pair.left < batch_size and pair.right < batch_size, this function skips bounds checking before accessing vector elements.

♻️ Suggested validation
 template <typename i_t, typename f_t>
 void pdlp_restart_strategy_t<i_t, f_t>::swap_context(
   const thrust::universal_host_pinned_vector<swap_pair_t<i_t>>& swap_pairs)
 {
   if (swap_pairs.empty()) { return; }

   last_restart_duality_gap_.swap_context(swap_pairs);

   cuopt_assert(!fixed_point_error_.empty(), "Fixed point error must not be empty");
+  const auto batch_size = static_cast<i_t>(fixed_point_error_.size());
   for (const auto& pair : swap_pairs) {
+    cuopt_assert(pair.left < batch_size && pair.right < batch_size, "Swap index out of bounds");
     host_vector_swap(fixed_point_error_, pair.left, pair.right);

709-717: Consider parallel execution for large batch sizes.

The TODO comment is valid - if batch sizes grow large, std::reduce without an execution policy may become a bottleneck. For now, this is acceptable, but consider using std::reduce(std::execution::par_unseq, ...) if batch sizes exceed a threshold.

return CUDA_R_64F;
}
}();
CUSPARSE_CHECK(cusparseSetStream(handle, stream));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent error-checking macro: use RAFT_CUSPARSE_TRY instead of CUSPARSE_CHECK.

Line 261 uses CUSPARSE_CHECK while the rest of the file consistently uses RAFT_CUSPARSE_TRY. This inconsistency could lead to different error handling behavior.

🔧 Proposed fix
-  CUSPARSE_CHECK(cusparseSetStream(handle, stream));
+  RAFT_CUSPARSE_TRY(cusparseSetStream(handle, stream));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CUSPARSE_CHECK(cusparseSetStream(handle, stream));
RAFT_CUSPARSE_TRY(cusparseSetStream(handle, stream));
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/cusparse_view.cu` at line 261, Replace the
inconsistent use of CUSPARSE_CHECK with the project's standard RAFT_CUSPARSE_TRY
macro for the cusparseSetStream call: change the call that currently invokes
CUSPARSE_CHECK(cusparseSetStream(handle, stream)) to use
RAFT_CUSPARSE_TRY(cusparseSetStream(handle, stream)) so error handling matches
the rest of the file and uses the same macro semantics for the symbols
cusparseSetStream, handle, and stream.

Comment on lines +170 to +177
const auto [grid_size, block_size] =
kernel_config_from_batch_size(static_cast<i_t>(swap_pairs.size()));
pdhg_swap_bounds_kernel<i_t, f_t>
<<<grid_size, block_size, 0, stream_view_>>>(thrust::raw_pointer_cast(swap_pairs.data()),
static_cast<i_t>(swap_pairs.size()),
make_span(new_bounds_idx_),
make_span(new_bounds_lower_),
make_span(new_bounds_upper_));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for make_span definition and declarations
echo "=== Searching for make_span definition ==="
rg -n "make_span" cpp/src -g'*.{cu,cpp,cuh,hpp,h}' --max-count=20

echo ""
echo "=== Checking pdhg.cu includes ==="
head -50 cpp/src/linear_programming/pdhg.cu | grep -n "#include"

echo ""
echo "=== Checking lines around 170-177 in pdhg.cu with context ==="
sed -n '1,50p' cpp/src/linear_programming/pdhg.cu | grep -n "CUPDLP_DEBUG_MODE\|copy_helpers"

Repository: NVIDIA/cuopt

Length of output: 49128


🏁 Script executed:

#!/bin/bash
# Get the exact context around lines 170-177
echo "=== Code context at lines 165-180 ==="
sed -n '165,180p' cpp/src/linear_programming/pdhg.cu | cat -n

echo ""
echo "=== Checking if make_span is defined in copy_helpers.hpp ==="
rg -n "make_span" cpp/src/utilities/copy_helpers.hpp

Repository: NVIDIA/cuopt

Length of output: 1713


🏁 Script executed:

#!/bin/bash
# Check all includes in pdhg.cu to understand the full picture
echo "=== All includes in pdhg.cu ==="
grep -n "#include" cpp/src/linear_programming/pdhg.cu | head -30

echo ""
echo "=== Check for CUPDLP_DEBUG_MODE guards around copy_helpers include ==="
grep -B2 -A2 "copy_helpers" cpp/src/linear_programming/pdhg.cu

Repository: NVIDIA/cuopt

Length of output: 1006


Ensure make_span is available in non-debug builds.

make_span(...) is used at lines 175-177 outside debug-only blocks, but utilities/copy_helpers.hpp (where it's defined) is only included when CUPDLP_DEBUG_MODE is enabled. In release builds, this causes a compilation error. Remove the debug guard on the include.

Proposed fix
-#ifdef CUPDLP_DEBUG_MODE
-#include <utilities/copy_helpers.hpp>
-#endif
+#include <utilities/copy_helpers.hpp>
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/pdhg.cu` around lines 170 - 177, The include for
utilities/copy_helpers.hpp (which provides make_span) is currently guarded by
CUPDLP_DEBUG_MODE causing compilation failures in release; remove the debug-only
guard so make_span is available in all builds, ensuring calls like
make_span(...) passed into pdhg_swap_bounds_kernel (and any other kernel
launches) compile; update the include that provides make_span to be
unconditional (no CUPDLP_DEBUG_MODE `#if/`#ifdef) so symbols like make_span are
defined for pdhg_swap_bounds_kernel and related uses.

Comment on lines +774 to +781
cub::DeviceFor::Bulk(climber_strategies_.size(),
refine_initial_primal_projection_bulk_op<i_t, f_t>{
make_span(new_bounds_idx_),
make_span(new_bounds_lower_),
make_span(new_bounds_upper_),
make_span(current_saddle_point_state_.get_primal_solution()),
problem_ptr->n_variables},
stream_view_.value());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the actual code in pdhg.cu at the mentioned line ranges
fd "pdhg.cu" -t f

Repository: NVIDIA/cuopt

Length of output: 90


🌐 Web query:

cub::DeviceFor::Bulk return type cudaError_t CUB CUDA error checking

💡 Result:

cub::DeviceFor::Bulk(...) returns cudaError_t (just like other CUB device-scope algorithms). You should treat it as a CUDA runtime call: check the return immediately, and (if you need to catch asynchronous kernel failures) add a synchronize + last-error check. [1], [5]

What the return value means

  • The returned cudaError_t typically reflects immediate/launch-time issues (bad args, invalid device state, etc.). [3]
  • Because GPU work is asynchronous, runtime kernel failures may only surface later unless you explicitly synchronize. [3]

Recommended error-checking pattern

`#include` <cub/device/device_for.cuh>
`#include` <cuda_runtime.h>

`#define` CUDA_TRY(call) do {                                \
  cudaError_t _e = (call);                                 \
  if (_e != cudaSuccess) return _e;                        \
} while(0)

cudaError_t run(int n, MyOp op, cudaStream_t s)
{
  // If using the temp-storage overload:
  void*  d_temp = nullptr;
  size_t temp_bytes = 0;

  CUDA_TRY(cub::DeviceFor::Bulk(d_temp, temp_bytes, n, op, s)); // size query [1]
  CUDA_TRY(cudaMalloc(&d_temp, temp_bytes));

  CUDA_TRY(cub::DeviceFor::Bulk(d_temp, temp_bytes, n, op, s)); // launch [1]

  // Catch async failures (debug / when correctness matters):
  CUDA_TRY(cudaGetLastError());        // last launch error reporting guidance [3]
  CUDA_TRY(cudaDeviceSynchronize());   // forces async errors to surface [3]

  cudaFree(d_temp);
  return cudaSuccess;
}

Notes:

  • cub::DeviceFor::Bulk has both overloads (with and without temp storage), and the temp-storage form uses the standard CUB “nullptr to query temp_storage_bytes” convention. [1]
  • CUB device-scope algorithms “always return cudaError_t” by design. [5]

🏁 Script executed:

# Check what CUB operations are already being used with error checking in the codebase
rg "RAFT_CUDA_TRY.*cub::" -A 2 -B 2

Repository: NVIDIA/cuopt

Length of output: 2013


🏁 Script executed:

# Look for existing patterns of DeviceFor or DeviceTransform usage
rg "DeviceFor::|DeviceTransform" -A 2 -B 2

Repository: NVIDIA/cuopt

Length of output: 50368


🏁 Script executed:

# Get line count to understand file size
wc -l cpp/src/linear_programming/pdhg.cu

Repository: NVIDIA/cuopt

Length of output: 95


🏁 Script executed:

# Extract the specific line ranges mentioned in the review
echo "=== Lines 774-781 ==="
sed -n '774,781p' cpp/src/linear_programming/pdhg.cu

echo "=== Lines 813-824 ==="
sed -n '813,824p' cpp/src/linear_programming/pdhg.cu

echo "=== Lines 873-884 ==="
sed -n '873,884p' cpp/src/linear_programming/pdhg.cu

echo "=== Lines 920-930 ==="
sed -n '920,930p' cpp/src/linear_programming/pdhg.cu

echo "=== Lines 981-990 ==="
sed -n '981,990p' cpp/src/linear_programming/pdhg.cu

Repository: NVIDIA/cuopt

Length of output: 3556


🏁 Script executed:

# Search for RAFT_CUDA_TRY usage with DeviceFor or DeviceTransform in pdhg.cu specifically
rg "RAFT_CUDA_TRY.*Device(For|Transform)" cpp/src/linear_programming/pdhg.cu

Repository: NVIDIA/cuopt

Length of output: 38


Add CUDA error checks for CUB bulk operations.

The cub::DeviceFor::Bulk calls at lines 774–781, 813–824, 873–884, 920–930, and 981–990 lack error checking despite returning cudaError_t. Wrap them with RAFT_CUDA_TRY to surface launch failures, consistent with error-handling patterns used for CUB operations elsewhere in the codebase.

🐛 Example fix (apply similarly to all five instances)
-  cub::DeviceFor::Bulk(climber_strategies_.size(),
-                       refine_initial_primal_projection_bulk_op<i_t, f_t>{
-                         make_span(new_bounds_idx_),
-                         make_span(new_bounds_lower_),
-                         make_span(new_bounds_upper_),
-                         make_span(current_saddle_point_state_.get_primal_solution()),
-                         problem_ptr->n_variables},
-                       stream_view_.value());
+  RAFT_CUDA_TRY(cub::DeviceFor::Bulk(climber_strategies_.size(),
+                                     refine_initial_primal_projection_bulk_op<i_t, f_t>{
+                                       make_span(new_bounds_idx_),
+                                       make_span(new_bounds_lower_),
+                                       make_span(new_bounds_upper_),
+                                       make_span(current_saddle_point_state_.get_primal_solution()),
+                                       problem_ptr->n_variables},
+                                     stream_view_.value()));

Per coding guidelines: Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cub::DeviceFor::Bulk(climber_strategies_.size(),
refine_initial_primal_projection_bulk_op<i_t, f_t>{
make_span(new_bounds_idx_),
make_span(new_bounds_lower_),
make_span(new_bounds_upper_),
make_span(current_saddle_point_state_.get_primal_solution()),
problem_ptr->n_variables},
stream_view_.value());
RAFT_CUDA_TRY(cub::DeviceFor::Bulk(climber_strategies_.size(),
refine_initial_primal_projection_bulk_op<i_t, f_t>{
make_span(new_bounds_idx_),
make_span(new_bounds_lower_),
make_span(new_bounds_upper_),
make_span(current_saddle_point_state_.get_primal_solution()),
problem_ptr->n_variables},
stream_view_.value()));
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/pdhg.cu` around lines 774 - 781, The CUB bulk
launches (cub::DeviceFor::Bulk calls) currently lack CUDA error checks; wrap
each call with RAFT_CUDA_TRY so failures are surfaced. Specifically, replace
each raw call like cub::DeviceFor::Bulk(climber_strategies_.size(),
refine_initial_primal_projection_bulk_op<i_t, f_t>{...}, stream_view_.value())
(and the similar calls using other bulk op functors at the other locations) with
RAFT_CUDA_TRY(cub::DeviceFor::Bulk(...)); keeping the same arguments (e.g.,
climber_strategies_.size(), the respective refine_*_bulk_op functor instances,
and stream_view_.value()) so the return cudaError_t is checked.

Comment on lines +728 to +743
// Check if we don't hit the limit using max_batch_size
const size_t memory_estimate =
batch_pdlp_memory_estimator(problem, max_batch_size, max_batch_size);
size_t free_mem, total_mem;
RAFT_CUDA_TRY(cudaMemGetInfo(&free_mem, &total_mem));

if (memory_estimate > free_mem) {
use_optimal_batch_size = true;
// Decrement batch size iteratively until we find a batch size that fits
while (memory_max_batch_size > 1) {
const size_t memory_estimate =
batch_pdlp_memory_estimator(problem, memory_max_batch_size, max_batch_size);
if (memory_estimate <= free_mem) { break; }
memory_max_batch_size--;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle the case where even batch_size=1 exceeds available memory.

The loop stops at 1 even if the estimate still exceeds free memory, which can lead to OOM later. Consider a hard failure with a clear error when no batch size fits.

🐛 Proposed fix
   if (memory_estimate > free_mem) {
     use_optimal_batch_size = true;
     // Decrement batch size iteratively until we find a batch size that fits
     while (memory_max_batch_size > 1) {
       const size_t memory_estimate =
         batch_pdlp_memory_estimator(problem, memory_max_batch_size, max_batch_size);
       if (memory_estimate <= free_mem) { break; }
       memory_max_batch_size--;
     }
+    const size_t min_estimate =
+      batch_pdlp_memory_estimator(problem, memory_max_batch_size, max_batch_size);
+    cuopt_expects(min_estimate <= free_mem,
+                  error_type_t::RuntimeError,
+                  "Insufficient GPU memory for batch PDLP (even batch_size=1)");
   }
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/solve.cu` around lines 728 - 743, The loop that
decrements memory_max_batch_size using batch_pdlp_memory_estimator can exit at 1
while the estimated memory still exceeds free_mem, risking OOM; after the while
loop (or when setting use_optimal_batch_size) recompute or recheck
batch_pdlp_memory_estimator(problem, memory_max_batch_size, max_batch_size)
against free_mem and if it still exceeds free_mem, fail fast with a clear
error/exception or return an error code (include context like
memory_max_batch_size, free_mem and the failing estimate) instead of proceeding;
update the logic around use_optimal_batch_size and memory_max_batch_size to
ensure callers see the failure.

Comment on lines +483 to +516
} else {
// TODO later batch mode: remove this once you want to do per climber restart
cub::DeviceSegmentedReduce::Sum(
dot_product_storage.data(),
dot_product_bytes,
thrust::make_transform_iterator(
thrust::make_zip_iterator(tmp_primal.data(),
current_saddle_point_state.get_delta_primal().data()),
tuple_multiplies<f_t>{}),
interaction_.data(),
climber_strategies_.size(),
primal_size_,
stream_view_);

cub::DeviceSegmentedReduce::Sum(
dot_product_storage.data(),
dot_product_bytes,
thrust::make_transform_iterator(current_saddle_point_state.get_delta_primal().data(),
power_two_func_t<f_t>{}),
norm_squared_delta_primal_.data(),
climber_strategies_.size(),
primal_size_,
stream_view_);

cub::DeviceSegmentedReduce::Sum(
dot_product_storage.data(),
dot_product_bytes,
thrust::make_transform_iterator(current_saddle_point_state.get_delta_dual().data(),
power_two_func_t<f_t>{}),
norm_squared_delta_dual_.data(),
climber_strategies_.size(),
dual_size_,
stream_view_);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing CUDA error checks after cub::DeviceSegmentedReduce::Sum calls.

The three segmented reduce calls (lines 485-495, 497-505, 507-515) don't have error checking after them. CUB functions can fail due to insufficient temporary storage or other runtime issues. Add error checking for robustness.

🔧 Add error checking
     cub::DeviceSegmentedReduce::Sum(
       dot_product_storage.data(),
       dot_product_bytes,
       thrust::make_transform_iterator(...),
       interaction_.data(),
       climber_strategies_.size(),
       primal_size_,
       stream_view_);
+    RAFT_CUDA_TRY(cudaPeekAtLastError());

As per coding guidelines: "Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification."

🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu`
around lines 483 - 516, The three calls to cub::DeviceSegmentedReduce::Sum (the
calls that use tuple_multiplies<f_t> and power_two_func_t<f_t> and the
temporaries dot_product_storage/dot_product_bytes on stream_view_) are missing
CUDA error checking; capture the cudaError_t returned by each
DeviceSegmentedReduce::Sum call and pass it to the project's error-checking
macro (e.g., CUDA_CHECK) or explicitly check and log/fail on non-success, so
each segmented-reduce invocation is validated (include a short context message
identifying the call when reporting the error).

Comment on lines +78 to +85
cub::DeviceTransform::Transform(
in_zip,
out_zip,
total_items,
[] HD(thrust::tuple<f_t, f_t> values) -> thrust::tuple<f_t, f_t> {
return thrust::make_tuple(thrust::get<1>(values), thrust::get<0>(values));
},
matrix.stream());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Check CUB DeviceTransform return status.

cub::DeviceTransform::Transform returns cudaError_t; it should be wrapped in RAFT_CUDA_TRY (or equivalent) to surface launch failures.

🐛 Proposed fix
-  cub::DeviceTransform::Transform(
+  RAFT_CUDA_TRY(cub::DeviceTransform::Transform(
     in_zip,
     out_zip,
     total_items,
     [] HD(thrust::tuple<f_t, f_t> values) -> thrust::tuple<f_t, f_t> {
       return thrust::make_tuple(thrust::get<1>(values), thrust::get<0>(values));
     },
-    matrix.stream());
+    matrix.stream()));
As per coding guidelines: Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification.
What is the return type of cub::DeviceTransform::Transform and the recommended way to check errors?
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/swap_and_resize_helper.cuh` around lines 78 - 85,
cub::DeviceTransform::Transform returns a cudaError_t and its return value must
be checked; wrap the call to cub::DeviceTransform::Transform (the lambda swap
transform invocation) with the project's CUDA error macro (e.g. RAFT_CUDA_TRY or
CUDA_CHECK) to surface launch failures and propagate errors, i.e. call
RAFT_CUDA_TRY(cub::DeviceTransform::Transform(...)) or equivalent around the
existing call and keep passing matrix.stream() and the same
in_zip/out_zip/total_items and lambda as before.

Comment on lines +137 to +188
cub::DeviceSegmentedReduce::Sum(
nullptr,
byte_needed,
thrust::make_transform_iterator(dual_dot_.data(), power_two_func_t<f_t>{}),
dual_dot_.data(),
climber_strategies_.size(),
dual_size,
stream_view_);
dot_product_bytes_ = std::max(dot_product_bytes_, byte_needed);

cub::DeviceSegmentedReduce::Sum(
nullptr,
byte_needed,
thrust::make_transform_iterator(dual_dot_.data(), power_two_func_t<f_t>{}),
dual_dot_.data(),
climber_strategies_.size(),
primal_size,
stream_view_);
dot_product_bytes_ = std::max(dot_product_bytes_, byte_needed);

cub::DeviceSegmentedReduce::Sum(
nullptr,
byte_needed,
thrust::make_transform_iterator(thrust::make_zip_iterator(dual_dot_.data(), dual_dot_.data()),
tuple_multiplies<f_t>{}),
dual_dot_.data(),
climber_strategies_.size(),
primal_size,
stream_view_);
dot_product_bytes_ = std::max(dot_product_bytes_, byte_needed);

cub::DeviceSegmentedReduce::Sum(nullptr,
byte_needed,
dual_dot_.data(),
dual_dot_.data(),
climber_strategies_.size(),
dual_size_h_,
stream_view_);
dot_product_bytes_ = std::max(dot_product_bytes_, byte_needed);

cub::DeviceSegmentedReduce::Sum(
nullptr,
dot_product_bytes_,
thrust::make_transform_iterator(
thrust::make_zip_iterator(dual_dot_.data(),
problem_wrap_container(problem_ptr->objective_coefficients)),
tuple_multiplies<f_t>{}),
primal_objective_.data(),
climber_strategies_.size(),
primal_size_h_,
stream_view_);
dot_product_bytes_ = std::max(dot_product_bytes_, byte_needed);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add CUDA error checks around CUB segmented reduction sizing calls.

The multiple cub::DeviceSegmentedReduce::Sum calls used to determine dot_product_bytes_ ignore the returned cudaError_t. While these are sizing queries (passing nullptr for temp storage), error checking is still recommended.

🐛 Suggested fix pattern
-    cub::DeviceSegmentedReduce::Sum(
+    RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
       nullptr,
       byte_needed,
       thrust::make_transform_iterator(dual_dot_.data(), power_two_func_t<f_t>{}),
       dual_dot_.data(),
       climber_strategies_.size(),
       dual_size,
-      stream_view_);
+      stream_view_));

Apply this pattern to all sizing calls in lines 137-188.

As per coding guidelines: "Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification."

🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/termination_strategy/convergence_information.cu`
around lines 137 - 188, The sizing calls to cub::DeviceSegmentedReduce::Sum (the
nullptr temp-storage queries using variables like byte_needed, dual_dot_,
power_two_func_t, tuple_multiplies, primal_objective_, stream_view_, and
dot_product_bytes_) are not checking the returned cudaError_t; assign each
call's return to a cudaError_t (e.g., cudaError_t err =
cub::DeviceSegmentedReduce::Sum(...)) and immediately validate it with the
project error macro (CUDA_CHECK(err) or equivalent) after every sizing
invocation to ensure failures are caught during the segmented-reduce sizing
phase.

Comment on lines +371 to +385
cub::DeviceSegmentedReduce::Sum(
dot_product_storage_.data(),
dot_product_bytes_,
thrust::make_transform_iterator(primal_residual_.data(), power_two_func_t<f_t>{}),
l2_primal_residual_.data(),
climber_strategies_.size(),
dual_size_h_,
stream_view_);
cub::DeviceTransform::Transform(
l2_primal_residual_.data(),
l2_primal_residual_.data(),
l2_primal_residual_.size(),
[] HD(f_t x) { return raft::sqrt(x); },
stream_view_);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add CUDA error checks around CUB calls in batch mode residual computation.

The cub::DeviceSegmentedReduce::Sum and cub::DeviceTransform::Transform calls ignore returned error codes.

🐛 Suggested fix
   } else {
-    cub::DeviceSegmentedReduce::Sum(
+    RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
       dot_product_storage_.data(),
       dot_product_bytes_,
       thrust::make_transform_iterator(primal_residual_.data(), power_two_func_t<f_t>{}),
       l2_primal_residual_.data(),
       climber_strategies_.size(),
       dual_size_h_,
-      stream_view_);
-    cub::DeviceTransform::Transform(
+      stream_view_));
+    RAFT_CUDA_TRY(cub::DeviceTransform::Transform(
       l2_primal_residual_.data(),
       l2_primal_residual_.data(),
       l2_primal_residual_.size(),
       [] HD(f_t x) { return raft::sqrt(x); },
-      stream_view_);
+      stream_view_));
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cub::DeviceSegmentedReduce::Sum(
dot_product_storage_.data(),
dot_product_bytes_,
thrust::make_transform_iterator(primal_residual_.data(), power_two_func_t<f_t>{}),
l2_primal_residual_.data(),
climber_strategies_.size(),
dual_size_h_,
stream_view_);
cub::DeviceTransform::Transform(
l2_primal_residual_.data(),
l2_primal_residual_.data(),
l2_primal_residual_.size(),
[] HD(f_t x) { return raft::sqrt(x); },
stream_view_);
}
cub::DeviceSegmentedReduce::Sum(
dot_product_storage_.data(),
dot_product_bytes_,
thrust::make_transform_iterator(primal_residual_.data(), power_two_func_t<f_t>{}),
l2_primal_residual_.data(),
climber_strategies_.size(),
dual_size_h_,
stream_view_));
RAFT_CUDA_TRY(cub::DeviceTransform::Transform(
l2_primal_residual_.data(),
l2_primal_residual_.data(),
l2_primal_residual_.size(),
[] HD(f_t x) { return raft::sqrt(x); },
stream_view_));
Suggested change
cub::DeviceSegmentedReduce::Sum(
dot_product_storage_.data(),
dot_product_bytes_,
thrust::make_transform_iterator(primal_residual_.data(), power_two_func_t<f_t>{}),
l2_primal_residual_.data(),
climber_strategies_.size(),
dual_size_h_,
stream_view_);
cub::DeviceTransform::Transform(
l2_primal_residual_.data(),
l2_primal_residual_.data(),
l2_primal_residual_.size(),
[] HD(f_t x) { return raft::sqrt(x); },
stream_view_);
}
RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
dot_product_storage_.data(),
dot_product_bytes_,
thrust::make_transform_iterator(primal_residual_.data(), power_two_func_t<f_t>{}),
l2_primal_residual_.data(),
climber_strategies_.size(),
dual_size_h_,
stream_view_));
RAFT_CUDA_TRY(cub::DeviceTransform::Transform(
l2_primal_residual_.data(),
l2_primal_residual_.data(),
l2_primal_residual_.size(),
[] HD(f_t x) { return raft::sqrt(x); },
stream_view_));
}
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/termination_strategy/convergence_information.cu`
around lines 371 - 385, The two CUB calls (cub::DeviceSegmentedReduce::Sum and
cub::DeviceTransform::Transform) in the batch residual path ignore their return
codes; capture their returned status and check it using the project's CUDA
error-checking macro (e.g., RAFT_CUDA_TRY or CUDA_TRY) or by testing the
returned cudaError_t and handling/logging on failure. Specifically, wrap the
call to cub::DeviceSegmentedReduce::Sum(...) (using dot_product_storage_,
dot_product_bytes_, thrust::make_transform_iterator(primal_residual_.data(),
power_two_func_t<f_t>{}), l2_primal_residual_.data(),
climber_strategies_.size(), dual_size_h_, stream_view_) and the call to
cub::DeviceTransform::Transform(...) (using l2_primal_residual_.data(),
l2_primal_residual_.data(), l2_primal_residual_.size(), the sqrt lambda,
stream_view_) with the error check so any CUDA/CUB error is propagated or logged
consistently.

Comment on lines +766 to 793
} else {
cub::DeviceSegmentedReduce::Sum(
dot_product_storage_.data(),
dot_product_bytes_,
thrust::make_transform_iterator(
thrust::make_zip_iterator(dual_slack.data(), primal_solution.data()),
tuple_multiplies<f_t>{}),
dual_dot_.data(),
climber_strategies_.size(),
primal_size_h_,
stream_view_);

cub::DeviceSegmentedReduce::Sum(dot_product_storage_.data(),
dot_product_bytes_,
primal_slack_.data(),
sum_primal_slack_.data(),
climber_strategies_.size(),
dual_size_h_,
stream_view_);
}

const f_t sum = dual_dot_.value(stream_view_) + sum_primal_slack_.value(stream_view_);
dual_objective_.set_value_async(sum, stream_view_);
cub::DeviceTransform::Transform(
cuda::std::make_tuple(dual_dot_.data(), sum_primal_slack_.data()),
dual_objective_.data(),
dual_objective_.size(),
cuda::std::plus<>{},
stream_view_);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add CUDA error checks around CUB calls in dual objective computation.

The batch mode cub::DeviceSegmentedReduce::Sum and cub::DeviceTransform::Transform calls should have error checking.

🐛 Suggested fix
     } else {
-      cub::DeviceSegmentedReduce::Sum(
+      RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
         dot_product_storage_.data(),
         dot_product_bytes_,
         thrust::make_transform_iterator(
           thrust::make_zip_iterator(dual_slack.data(), primal_solution.data()),
           tuple_multiplies<f_t>{}),
         dual_dot_.data(),
         climber_strategies_.size(),
         primal_size_h_,
-        stream_view_);
+        stream_view_));

-      cub::DeviceSegmentedReduce::Sum(dot_product_storage_.data(),
+      RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(dot_product_storage_.data(),
                                       dot_product_bytes_,
                                       primal_slack_.data(),
                                       sum_primal_slack_.data(),
                                       climber_strategies_.size(),
                                       dual_size_h_,
-                                      stream_view_);
+                                      stream_view_));
     }
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/termination_strategy/convergence_information.cu`
around lines 766 - 793, Wrap each CUB call in the dual objective computation
(the two cub::DeviceSegmentedReduce::Sum calls and the
cub::DeviceTransform::Transform call) with CUDA error checking: capture the
returned cudaError_t from each call and call the project's CUDA error-handling
utility (or check for != cudaSuccess and log/return) using the same context
variables (dot_product_storage_, dot_product_bytes_,
thrust::make_transform_iterator(...), dual_dot_.data(), primal_slack_.data(),
sum_primal_slack_.data(), dual_objective_.data(), dual_objective_.size(),
stream_view_) so any failure is detected and reported immediately; ensure you do
this in the block that computes the dual objective so failures in segmented
reduce or transform are handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request mip non-breaking Introduces a non-breaking change pdlp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant