-
Notifications
You must be signed in to change notification settings - Fork 118
Batch PDLP #805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release/26.02
Are you sure you want to change the base?
Batch PDLP #805
Conversation
📝 WalkthroughWalkthroughThis pull request introduces comprehensive batch PDLP solve capabilities, refactoring hyper-parameter handling from static globals to runtime-configurable structs, and implementing multi-climber context management. The changes enable solving multiple problems with different variable bounds simultaneously while maintaining backward compatibility for single-problem scenarios. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (8)
cpp/src/linear_programming/utilities/problem_checking.cu (1)
59-81: Restore bounds validation for non-batch modes.The bounds check is fully disabled, so non-batch PDLP and MIP warm starts can slip through with out-of-bounds values, which can corrupt solver state or yield incorrect results. Please gate this by an explicit batch-mode flag rather than removing the check globally.
🔧 Suggested fix (gate by a batch-mode flag)
- // TODO batch mode: Akif/Alice why? - // This clashes with batch PDLP warm start which by design will not be within the bounds - /*cuopt_expects(!thrust::any_of(op_problem.get_handle_ptr()->get_thrust_policy(), - thrust::make_counting_iterator(0), - thrust::make_counting_iterator(0) + op_problem.get_n_variables(), - [lower_bounds = make_span(op_problem.get_variable_lower_bounds()), - upper_bounds = make_span(op_problem.get_variable_upper_bounds()), - assignment_span = make_span(primal_initial_solution), - int_tol = 1e-8] __device__(i_t idx) -> bool { - return assignment_span[idx] < lower_bounds[idx] - int_tol || - assignment_span[idx] > upper_bounds[idx] + int_tol; - }), - error_type_t::ValidationError, - "Initial solution violates variable bounds.");*/ + // Allow out-of-bounds only when batch mode explicitly requests it. + if (!allow_out_of_bounds) { + cuopt_expects(!thrust::any_of(op_problem.get_handle_ptr()->get_thrust_policy(), + thrust::make_counting_iterator(0), + thrust::make_counting_iterator(0) + + op_problem.get_n_variables(), + [lower_bounds = make_span( + op_problem.get_variable_lower_bounds()), + upper_bounds = make_span( + op_problem.get_variable_upper_bounds()), + assignment_span = make_span(primal_initial_solution), + int_tol = 1e-8] __device__(i_t idx) -> bool { + return assignment_span[idx] < lower_bounds[idx] - int_tol || + assignment_span[idx] > upper_bounds[idx] + int_tol; + }), + error_type_t::ValidationError, + "Initial solution violates variable bounds."); + }You’ll need to thread
allow_out_of_boundsfrom the caller (e.g., batch PDLP settings) so non-batch paths keep the validation. Based on learnings, validate correct initialization of variable bounds before solving.cpp/src/linear_programming/pdhg.cu (1)
1020-1052: Missing batch matrix view updates after solution buffer swaps in batch mode.The
update_solutionmethod swaps solution buffers and updates single-vector cuSparse views, but does not update the batch matrix views (batch_dual_solutions,batch_current_AtYs,batch_next_AtYs) when batch mode is enabled. Since these batch descriptors point to the underlying buffers that are being swapped, they become stale and point to invalid pointers after the swaps complete (lines 1030-1032).Add corresponding
cusparseDnMatSetValuescalls for batch matrix descriptors whenbatch_mode_is true, following the pattern used elsewhere in pdlp.cu (e.g., lines 1715, 1756).cpp/src/mip/diversity/diversity_manager.cu (1)
90-101:config_idis parsed but never used.The variable
config_idis read from theCUOPT_CONFIG_IDenvironment variable and logged, but it's never actually used within the function or passed to any other code. The[[maybe_unused]]attribute suppresses the compiler warning, indicating the developer was aware of the unused variable. While a functionget_local_search_and_lm_from_configinmulti_armed_bandit.cuhaccepts aconfig_idparameter, the parsed value is never provided to it.Complete the implementation if configuration ID is intended to affect solver behavior, or remove the unused parsing if not needed. If this is placeholder code for future work, add a
// TODOcomment clarifying the intention.cpp/src/linear_programming/initial_scaling_strategy/initial_scaling.cuh (1)
45-54: Fix dangling reference to hyper_params.Line 98 passes a reference to the constructor parameter
settings.hyper_params, butsettings_(the member copy created at line 89) should be used instead. After construction, the parameter reference becomes dangling. Change line 98 fromsettings.hyper_paramstosettings_.hyper_params.cpp/src/linear_programming/pdlp.cuh (1)
88-90: Fix parameter name mismatch for clarity.The declaration uses
initial_primal_weightforset_initial_step_size, which is confusing and inconsistent with the method intent.🧹 Proposed fix
- void set_initial_step_size(f_t initial_primal_weight); + void set_initial_step_size(f_t initial_step_size);cpp/src/linear_programming/termination_strategy/termination_strategy.cu (1)
318-338: Batch +per_constraint_residualuses a single residual for all climbers.In batch mode, the per-constraint residual path reads a scalar residual (no per-climber indexing), which can incorrectly mark termination when climbers diverge. Please either compute per-climber residuals or explicitly disallow this mode in batch.
🛡️ Suggested guard (disallow until per-climber residuals exist)
void pdlp_termination_strategy_t<i_t, f_t>::check_termination_criteria() { `#ifdef` PDLP_VERBOSE_MODE RAFT_CUDA_TRY(cudaDeviceSynchronize()); `#endif` + cuopt_assert(!(settings_.per_constraint_residual && climber_strategies_.size() > 1), + "per_constraint_residual is not supported in batch mode yet"); const auto [grid_size, block_size] = kernel_config_from_batch_size(climber_strategies_.size()); check_termination_criteria_kernel<i_t, f_t>cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp (1)
243-259: Add Doxygen for new batch accessors
get_terminations_status()and the vector-returningget_additional_termination_informations()are new public APIs but lack Doxygen and batch semantics/ownership notes. Please add brief docs (including id bounds and batch-size expectations). As per coding guidelines, please add Doxygen for new public API.cpp/src/linear_programming/solver_solution.cu (1)
25-43: Ensure termination_stats_ is non-empty and aligned with termination_status_The constructors that only set
termination_status_leavetermination_stats_empty, but getters indextermination_stats_[id]after only checkingtermination_status_.size(). That can pass the assert and still read out of bounds (e.g.,get_objective_value(0)on an error-status instance). Initialize a default entry and enforce size parity in getters.🐛 Proposed fix
optimization_problem_solution_t<i_t, f_t>::optimization_problem_solution_t( pdlp_termination_status_t termination_status, rmm::cuda_stream_view stream_view) : primal_solution_{0, stream_view}, dual_solution_{0, stream_view}, reduced_cost_{0, stream_view}, - termination_status_{termination_status}, + termination_status_{termination_status}, + termination_stats_{additional_termination_information_t{}}, error_status_(cuopt::logic_error("", cuopt::error_type_t::Success)) { } optimization_problem_solution_t<i_t, f_t>::optimization_problem_solution_t( cuopt::logic_error error_status_, rmm::cuda_stream_view stream_view) : primal_solution_{0, stream_view}, dual_solution_{0, stream_view}, reduced_cost_{0, stream_view}, - termination_status_{pdlp_termination_status_t::NoTermination}, + termination_status_{pdlp_termination_status_t::NoTermination}, + termination_stats_{additional_termination_information_t{}}, error_status_(error_status_) { } f_t optimization_problem_solution_t<i_t, f_t>::get_objective_value(i_t id) const { cuopt_assert(id < termination_status_.size(), "id too big for batch size"); + cuopt_assert(termination_stats_.size() == termination_status_.size(), + "termination_stats_ size mismatch"); return termination_stats_[id].primal_objective; }Also applies to: 288-293, 320-331, 386-391
🤖 Fix all issues with AI agents
In `@cpp/include/cuopt/linear_programming/solve.hpp`:
- Around line 17-19: The public header solve.hpp currently includes an
internal-only header <dual_simplex/user_problem.hpp>, preventing external users
from resolving dual_simplex::user_problem_t; either move/export user_problem.hpp
into the public include tree (e.g., place it under the public include path such
as cpp/include/cuopt/linear_programming/dual_simplex/user_problem.hpp) and
update the project's install/export targets so the header is installed, or
change the include in solve.hpp to a public-facing path that will be installed;
ensure the declaration/definition of dual_simplex::user_problem_t remains
available to any translation unit that includes solve.hpp and update
build/install CMake/EXPORT rules accordingly.
In `@cpp/include/cuopt/linear_programming/utilities/segmented_sum_handler.cuh`:
- Around line 30-31: The default constructor segmented_sum_handler_t() leaves
member stream_view_ uninitialized causing undefined behavior when
segmented_sum_helper or segmented_reduce_helper are later called; fix by
initializing stream_view_ to a safe default (e.g., rmm::cuda_stream_default or
equivalent rmm::cuda_stream_view{}) in the default constructor (or alternatively
delete the default constructor if you want to forbid use), and/or add an
assert/check in segmented_sum_helper/segmented_reduce_helper that stream_view_
is valid before use; update the segmented_sum_handler_t() definition and any
constructors to ensure stream_view_ is always initialized.
In `@cpp/src/dual_simplex/simplex_solver_settings.hpp`:
- Around line 158-159: The member variable mip_batch_pdlp_strong_branching is
declared but not initialized, causing indeterminate values; update the class
constructor (the initializer list in simplex_solver_settings constructor) to
initialize mip_batch_pdlp_strong_branching (e.g., to 0 or 1 as appropriate) or
give it an in-class default at its declaration so the flag has a well-defined
value whenever code checks mip_batch_pdlp_strong_branching.
In
`@cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu`:
- Around line 25-27: The allocation sizes for x, y (and
buffer_non_transpose_batch init) use int multiplication (primal_size *
current_batch_size and dual_size * current_batch_size) which can overflow;
change the computation to cast operands to size_t before multiplying (e.g.,
(size_t)primal_size * (size_t)current_batch_size) and/or compute into a size_t
temp and add an overflow guard (check against SIZE_MAX or compare
current_batch_size > SIZE_MAX / primal_size) before constructing x, y or
buffer_non_transpose_batch; update any local variable types used for these sizes
to size_t and ensure the constructor calls for x, y, buffer_non_transpose_batch
use the size_t result.
- Around line 191-213: In optimal_batch_size_handler, guard against non-positive
max_batch_size before computing log2/pow: add a precondition like "if
(max_batch_size <= 0) return 1;" (or clamp max_batch_size = 1) near the top of
the function before computing current_batch_size so std::log2(std::min(...))
never receives 0 or negative values; this fixes potential NaNs/UB for
current_batch_size derived from initial_batch_size and max_batch_size.
In `@cpp/src/linear_programming/pdhg.hpp`:
- Around line 27-33: The constructor pdhg_solver_t currently stores const
references to incoming climber_strategies and hyper_params which can dangle
(e.g., when callers pass temporaries or default pdlp_solver_settings_t{} is
used); change the class members to own copies instead of references by replacing
the reference members with value members (e.g.,
pdlp_hyper_params::pdlp_hyper_params_t hyper_params_; and
std::vector<pdlp_climber_strategy_t> climber_strategies_); update the
pdhg_solver_t constructor initialization to copy the passed-in hyper_params and
climber_strategies into these members (remove const& member usage), and ensure
any places that relied on the reference types (including pdlp_solver_t defaults)
now work with the owned copies.
In `@cpp/src/linear_programming/pdlp_constants.hpp`:
- Around line 17-22: The function kernel_config_from_batch_size can divide by
zero when batch_size == 0 because block_size becomes 0 and cuda::ceil_div is
called; add an early guard in kernel_config_from_batch_size to handle batch_size
== 0 (either assert(batch_size > 0) or return a safe pair like {0,0} or {1,0}
per project convention) before computing block_size and calling cuda::ceil_div,
and ensure the change references kernel_config_from_batch_size, block_size,
grid_size, and cuda::ceil_div so reviewers can find the fix.
In `@cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu`:
- Around line 277-295: The calls to cub::DeviceSegmentedReduce::Sum in
pdlp_restart_strategy.cu (the three usages that pass nullptr, byte_needed,
thrust::make_transform_iterator(..., power_two_func_t<f_t>{}),
reusable_device_scalar_1_.data(), climber_strategies_.size(),
primal_size_h_/dual_size_h_, stream_view_) ignore the returned cudaError_t; wrap
each call with RAFT_CUDA_TRY so failures are checked and propagated (i.e.,
replace the bare cub::DeviceSegmentedReduce::Sum(...) invocations with
RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(...)) for the segments computing
dot_product_bytes).
In `@cpp/src/linear_programming/solve.cu`:
- Around line 562-572: The mapping of row senses to constraint bounds in the
loop handling user_problem.row_sense is inverted: for 'G' (Ax >= b) set
constraint_lower[i] = user_problem.rhs[i] and constraint_upper[i] =
std::numeric_limits<f_t>::infinity(), and for 'L' (Ax <= b) set
constraint_lower[i] = -std::numeric_limits<f_t>::infinity() and
constraint_upper[i] = user_problem.rhs[i]; update the assignments for
constraint_lower and constraint_upper accordingly where those variables and
user_problem.rhs/row_sense are used.
- Around line 711-719: The code declares an empty std::vector named info of type
optimization_problem_solution_t<i_t, f_t>::additional_termination_information_t
and then accesses info[0], causing undefined behavior; fix by ensuring the
vector contains at least one element before accessing it — e.g., resize info to
1 or construct the element via info.emplace_back() or info.push_back(...) prior
to setting info[0].primal_objective, info[0].number_of_steps_taken, and
info[0].solve_time; reference the symbols info,
optimization_problem_solution_t::additional_termination_information_t,
vertex_solution, and start_solver to locate where to add the resize/emplace.
In `@cpp/src/linear_programming/swap_and_resize_helper.cuh`:
- Around line 48-84: matrix_swap currently calls cub::DeviceTransform with
in_zip and out_zip built from the same permutation iterators
(left_index/right_index), which can produce overlapping partial-addresses and
violates CUB's no-overlap precondition; fix by making the transform
non-overlapping: allocate a temporary rmm::device_uvector<f_t> temp(total_items)
(or a temp buffer sized to vector_size * swap_count), run DeviceTransform
writing into temp (using left_index/right_index-derived input and a linear
output), then copy/scatter the swapped values from temp back into matrix at the
target permuted positions (or implement a custom CUDA kernel that performs the
swaps in-place safely); update references in this function (matrix_swap,
left_index, right_index, in_zip, out_zip) accordingly and ensure matrix.stream()
and proper synchronization are preserved.
In `@cpp/src/mip/solve.cu`:
- Around line 62-65: You modified a local copy hyper_params but never propagated
those flag changes into the PDLP solvers; update the flow so the PDLP flags
actually reach the solvers: either assign the modified hyper_params back into
the global settings before constructing mip_solver_t (e.g.,
settings.hyper_params = hyper_params) so mip_solver_t and downstream code uses
the updated flags, or change the PDLP settings initialization path
(pdlp_solver_settings_t and set_pdlp_solver_mode) to accept and preserve these
hyper_params (pass hyper_params into pdlp_solver_settings_t or have
set_pdlp_solver_mode merge/prefer explicit hyper_params instead of overwriting
them). Ensure the chosen fix covers all solver creation sites (mip_solver_t
construction and places that create pdlp_solver_settings_t) so
update_primal_weight_on_initial_solution and
update_step_size_on_initial_solution are honored by PDLP.
🟡 Minor comments (13)
cpp/src/linear_programming/pdhg.cu-764-773 (1)
764-773: Missing CUDA error check after kernel launch.Per coding guidelines, every CUDA kernel launch should have error checking. The
refine_initial_primal_projection_kernellaunch at line 765 is missingRAFT_CUDA_TRY(cudaPeekAtLastError())that is present for other kernel launches (e.g., line 175).🐛 Proposed fix
refine_initial_primal_projection_kernel<<<grid_size, block_size, 0, stream_view_>>>( (i_t)climber_strategies_.size(), problem_ptr->n_variables, raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()), raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()), raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()), raft::device_span<f_t>(current_saddle_point_state_.get_primal_solution().data(), current_saddle_point_state_.get_primal_solution().size())); + RAFT_CUDA_TRY(cudaPeekAtLastError()); }cpp/src/linear_programming/pdhg.cu-929-943 (1)
929-943: Missing CUDA error check after kernel launch.Same issue as above - the
refine_primal_projection_batch_kernellaunch is missingRAFT_CUDA_TRY(cudaPeekAtLastError()).🐛 Proposed fix
refine_primal_projection_batch_kernel<<<grid_size, block_size, 0, stream_view_>>>( (i_t)climber_strategies_.size(), problem_ptr->n_variables, raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()), raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()), raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()), raft::device_span<const f_t>(current_saddle_point_state_.get_primal_solution().data(), current_saddle_point_state_.get_primal_solution().size()), raft::device_span<const f_t>(problem_ptr->objective_coefficients.data(), problem_ptr->objective_coefficients.size()), raft::device_span<const f_t>(current_saddle_point_state_.get_current_AtY().data(), current_saddle_point_state_.get_current_AtY().size()), raft::device_span<const f_t>(primal_step_size.data(), primal_step_size.size()), raft::device_span<f_t>(reflected_primal_.data(), reflected_primal_.size())); + RAFT_CUDA_TRY(cudaPeekAtLastError()); }cpp/src/linear_programming/pdhg.cu-825-842 (1)
825-842: Missing CUDA error check after kernel launch in graph capture block.The
refine_primal_projection_major_batch_kernellaunch is missing error checking. While inside a graph capture block,cudaPeekAtLastError()should still be called to catch launch configuration errors.🐛 Proposed fix
refine_primal_projection_major_batch_kernel<<<grid_size, block_size, 0, stream_view_>>>( (i_t)climber_strategies_.size(), problem_ptr->n_variables, raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()), raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()), raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()), raft::device_span<const f_t>(current_saddle_point_state_.get_primal_solution().data(), current_saddle_point_state_.get_primal_solution().size()), raft::device_span<const f_t>(problem_ptr->objective_coefficients.data(), problem_ptr->objective_coefficients.size()), raft::device_span<const f_t>(current_saddle_point_state_.get_current_AtY().data(), current_saddle_point_state_.get_current_AtY().size()), raft::device_span<const f_t>(primal_step_size.data(), primal_step_size.size()), raft::device_span<f_t>(potential_next_primal_solution_.data(), potential_next_primal_solution_.size()), raft::device_span<f_t>(dual_slack_.data(), dual_slack_.size()), raft::device_span<f_t>(reflected_primal_.data(), reflected_primal_.size())); + RAFT_CUDA_TRY(cudaPeekAtLastError()); }cpp/src/linear_programming/pdhg.cu-189-192 (1)
189-192: Assertion message inconsistent with condition.The assertion checks
new_size < batch_size(strictly less than), but the error message says "less than or equal to". This inconsistency could cause confusion during debugging.🐛 Proposed fix
If only shrinking is intended:
- cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size"); + cuopt_assert(new_size < batch_size, "New size must be less than batch size");Or if same-size should be allowed (no-op resize):
- cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size"); + cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size");cpp/src/utilities/copy_helpers.hpp-325-329 (1)
325-329: Remove or document themake_span(universal_host_pinned_vector)overload.The function at lines 325–329 is never called in the codebase. When
universal_host_pinned_vectoris actually used (e.g., inpdlp.cu), code correctly bypasses this overload and usesthrust::raw_pointer_cast()directly. Either remove this as dead code, or if intended for future use, add explicit documentation that it requires zero-copy/UVA semantics and should only be passed to device kernels that support unified memory.cpp/src/linear_programming/saddle_point.cu-77-92 (1)
77-92: Assertion message inconsistent with code logic.Line 83 states "New size must be less than or equal to batch size" but the actual assertion uses strict less-than (
new_size < batch_size). If equality should be allowed, update the condition; otherwise, fix the message.🔧 Suggested fix
cuopt_assert(new_size > 0, "New size must be greater than 0"); - cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size"); + cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size");Or if strict less-than is intended:
cuopt_assert(new_size > 0, "New size must be greater than 0"); - cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size"); + cuopt_assert(new_size < batch_size, "New size must be strictly less than batch size");cpp/src/mip/diversity/recombiners/sub_mip.cuh-73-84 (1)
73-84: Address or remove the TODO comment before merging.The TODO at line 73 indicates uncertainty about this change. Please either:
- Get confirmation from Akif and Alice that this approach is correct, then remove the TODO
- Document why this approach is appropriate
The change itself is sound—replacing hard-coded scaling constants with configurable hyper-parameters from
context.settings.hyper_paramsimproves maintainability and aligns with the PR's hyper-parameter consolidation effort.cpp/include/cuopt/linear_programming/solve.hpp-77-124 (1)
77-124: Tighten docs + fix typos in new batch API.Line 80+ has typos (“accross”, “thie”) and the doc lacks a
@return. Consider clarifying size constraints and error behavior. As per coding guidelines, public headers should clearly document new APIs.📝 Suggested doc fixes
- * The only difference accross climbers will be one variable bound change. + * The only difference across climbers will be one variable bound change. @@ - * Let the optimal objective value of thie problem be obj_down + * Let the optimal objective value of this problem be obj_down @@ - */ + * + * `@return` optimization_problem_solution_t<i_t, f_t> owning container for the solver solution + */cpp/include/cuopt/linear_programming/pdlp/pdlp_hyper_params.cuh-58-59 (1)
58-59: Fix typo in TODO comment.“constpexr” should be “constexpr”.
📝 Suggested fix
-// constpexr version of each (Stable2, Stable1, Methodical1, Fast1, Stable3...) +// constexpr version of each (Stable2, Stable1, Methodical1, Fast1, Stable3...)cpp/src/linear_programming/restart_strategy/localized_duality_gap_container.cu-107-120 (1)
107-120: Inconsistent assertion message inresize_context.Line 114 asserts
new_size < batch_sizebut the message says "less than or equal to batch size". Also, restrictingnew_size < batch_sizeprevents resizing to the same size, which may be a valid no-op.Suggested fix
- cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size"); + cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size");cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp-120-133 (1)
120-133: Resolve the temporary batch-mode TODOs before release.The public header exposes setters with “TODO batch mode: tmp”, which suggests the API is provisional. Please either finalize the behavior or gate it behind an internal-only path and track a cleanup task to avoid shipping TODOs in the public interface.
cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu-154-165 (1)
154-165: Assertion condition for resize_context may be too restrictive.The assertion
new_size < batch_sizeon line 160 prevents resizing to the current batch size. If resizing to the same size is a no-op, it should be allowed. Consider changing to<=.🐛 Proposed fix
[[maybe_unused]] const auto batch_size = static_cast<i_t>(interaction_.size()); cuopt_assert(batch_size > 0, "Batch size must be greater than 0"); cuopt_assert(new_size > 0, "New size must be greater than 0"); - cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size"); + cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size");cpp/src/linear_programming/termination_strategy/convergence_information.cu-249-270 (1)
249-270: Assertion condition in resize_context is too restrictive.Similar to the adaptive_step_size_strategy, the assertion
new_size < batch_sizeon line 255 prevents resizing to the current size. The comment says "less than or equal" but the code uses strict less-than.🐛 Proposed fix
[[maybe_unused]] const auto batch_size = static_cast<i_t>(primal_objective_.size()); cuopt_assert(batch_size > 0, "Batch size must be greater than 0"); cuopt_assert(new_size > 0, "New size must be greater than 0"); - cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size"); + cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size");
🧹 Nitpick comments (26)
cpp/src/linear_programming/pdhg.cu (3)
97-109: Consider using structured bindings orstd::transformfor cleaner tuple unpacking.The loop manually unpacks each tuple element. While functional, this could be simplified using
std::transformwith a projection or structured bindings in a single pass.♻️ Optional refactor using std::transform
if (!new_bounds.empty()) { std::vector<i_t> idx(new_bounds.size()); std::vector<f_t> lower(new_bounds.size()); std::vector<f_t> upper(new_bounds.size()); - for (size_t i = 0; i < new_bounds.size(); ++i) { - idx[i] = std::get<0>(new_bounds[i]); - lower[i] = std::get<1>(new_bounds[i]); - upper[i] = std::get<2>(new_bounds[i]); - } + std::transform(new_bounds.begin(), new_bounds.end(), idx.begin(), + [](const auto& t) { return std::get<0>(t); }); + std::transform(new_bounds.begin(), new_bounds.end(), lower.begin(), + [](const auto& t) { return std::get<1>(t); }); + std::transform(new_bounds.begin(), new_bounds.end(), upper.begin(), + [](const auto& t) { return std::get<2>(t); }); raft::copy(new_bounds_idx_.data(), idx.data(), idx.size(), stream_view_); raft::copy(new_bounds_lower_.data(), lower.data(), lower.size(), stream_view_); raft::copy(new_bounds_upper_.data(), upper.data(), upper.size(), stream_view_); }
445-541: Consider templating to reduce duplication between batch and non-batch projection functors.There are 4 pairs of nearly identical functors (non-batch with pointer and batch with value parameter). Per coding guidelines, this duplication (8 functors with similar logic) could be reduced using a template approach.
♻️ Example: Unified template approach
template <typename f_t, bool IsBatch> struct primal_reflected_projection_impl { using f_t2 = typename type_2<f_t>::type; // For non-batch: step_size_ptr points to single scalar // For batch: step_size_ptr points to array, indexed by batch_idx const f_t* step_size_ptr; HDI f_t get_step_size([[maybe_unused]] int batch_idx) const { if constexpr (IsBatch) { return step_size_ptr[batch_idx]; } else { return *step_size_ptr; } } HDI f_t operator()(f_t current_primal, f_t objective, f_t Aty, f_t2 bounds, int batch_idx = 0) { const f_t step_size = get_step_size(batch_idx); cuopt_assert(step_size != f_t(0.0), "Scalar can't be 0"); const f_t next = current_primal - step_size * (objective - Aty); const f_t next_clamped = raft::max<f_t>(raft::min<f_t>(next, bounds.y), bounds.x); return f_t(2.0) * next_clamped - current_primal; } };
556-559: Indexing pattern is consistent but consider documenting the memory layout.The bulk operations use
batch_idx = idx % batch_sizeandvar_idx = idx / batch_size, indicating a batch-interleaved (column-major w.r.t. batch) memory layout. This enables coalesced access for the batch dimension. A brief comment documenting this layout choice would aid maintainability.+ // Memory layout: data[var_idx * batch_size + batch_idx] + // This interleaves batch elements for coalesced access when processing batches in parallel HDI void operator()(size_t idx) { const int batch_idx = idx % batch_size; const int var_idx = idx / batch_size;cpp/include/cuopt/linear_programming/constants.h (1)
55-55: New public API constant for batch PDLP strong branching.The new
CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHINGconstant is added to support the batch PDLP feature. As per coding guidelines for public header files, consider adding a Doxygen comment describing this parameter's purpose and valid values (0 or 1 based on the solver_settings.cu registration).cpp/src/linear_programming/optimization_problem.cu (1)
731-732: Inconsistent type usage remains infindMaxAbs.Line 732 still returns bare
0.0for empty vectors, while the analogous case infindMinAbs(line 745) was updated tof_t(0.0). Consider updating for consistency:♻️ Suggested fix
auto findMaxAbs = [](const std::vector<f_t>& vec) -> f_t { - if (vec.empty()) { return 0.0; } + if (vec.empty()) { return f_t(0.0); }cpp/src/linear_programming/pdlp_climber_strategy.hpp (1)
19-21: Optional: Consider adding documentation forpdlp_climber_strategy_t.The struct stores an
original_indexused for batch-mode climber strategy handling. Adding a brief doc comment explaining its purpose would improve maintainability.Regarding templating with
i_t: While the struct is used alongside templated classes that parameterizei_t, the current non-templated design withintis used consistently throughout the codebase as a fixed type passed instd::vector<pdlp_climber_strategy_t>. Templating would require updating multiple usage sites across 7+ files. This could be considered if future requirements demand runtime selection of index type sizes, but is not necessary for current usage patterns.cpp/include/cuopt/linear_programming/mip/solver_settings.hpp (1)
86-102: Document new MIP PDLP settings and resolve the TODO.These new public members change the API surface; please add Doxygen comments (defaults/behavior) and either clear the TODO or track it in an issue. Also confirm ABI/backward‑compat expectations in release notes.
✍️ Suggested doc update
- i_t mip_batch_pdlp_strong_branching = 0; + /** Enable batch PDLP for strong branching (0 = disabled). */ + i_t mip_batch_pdlp_strong_branching = 0; bool log_to_console = true; @@ - // TODO check with Akif and Alice - pdlp_hyper_params::pdlp_hyper_params_t hyper_params; + /** PDLP hyper-parameters used by strong-branching batch PDLP runs. */ + pdlp_hyper_params::pdlp_hyper_params_t hyper_params;As per coding guidelines, public C++ headers should document new API fields and call out compatibility considerations.
cpp/src/linear_programming/swap_and_resize_helper.cuh (1)
86-96: Consider usingsize_tfor index parameters to match container size type.The
left_swap_indexandright_swap_indexparameters areint, buthost_vector.size()returnssize_t. This could cause issues with large vectors (though unlikely in practice). More importantly, the assertionleft_swap_index < right_swap_indexdoesn't validate that indices are non-negative.🔧 Suggested improvement
template <typename host_vector_t> -void host_vector_swap(host_vector_t& host_vector, int left_swap_index, int right_swap_index) +void host_vector_swap(host_vector_t& host_vector, size_t left_swap_index, size_t right_swap_index) { - cuopt_assert(left_swap_index < host_vector.size(), "Left swap index is out of bounds"); - cuopt_assert(right_swap_index < host_vector.size(), "Right swap index is out of bounds"); + cuopt_assert(left_swap_index < host_vector.size(), "Left swap index out of bounds"); + cuopt_assert(right_swap_index < host_vector.size(), "Right swap index out of bounds"); cuopt_assert(left_swap_index < right_swap_index, "Left swap index must be less than right swap index");cpp/src/linear_programming/saddle_point.cu (1)
57-75:primal_gradient_excluded from swap_context - verify this is intentional.The
swap_contextmethod swaps all batched buffers exceptprimal_gradient_. This is consistent with the constructor comment thatprimal_gradient_is only used in trust region restart mode which doesn't support batch mode. However, ifswap_contextis ever called in a context whereprimal_gradient_matters, this could lead to subtle bugs.Consider adding a comment here explaining why
primal_gradient_is excluded, mirroring the constructor comment.📝 Suggested documentation
matrix_swap(dual_gradient_, dual_size_, swap_pairs); matrix_swap(current_AtY_, primal_size_, swap_pairs); matrix_swap(next_AtY_, primal_size_, swap_pairs); + // Note: primal_gradient_ is not swapped as it's only used in trust region restart mode + // which does not support batch mode }cpp/include/cuopt/linear_programming/utilities/segmented_sum_handler.cuh (2)
38-51: Inconsistent stream parameter usage between helper methods.
segmented_sum_helperpassesstream_view_directly to CUB (lines 40, 50), whilesegmented_reduce_helperusesstream_view_.value()(lines 69, 71, 81). Both should work, but the inconsistency could cause confusion. Consider using the same approach in both methods.♻️ Suggested consistency fix
template <typename InputIteratorT, typename OutputIteratorT> void segmented_sum_helper(InputIteratorT input, OutputIteratorT output, i_t batch_size, i_t problem_size) { cub::DeviceSegmentedReduce::Sum( - nullptr, byte_needed_, input, output, batch_size, problem_size, stream_view_); + nullptr, byte_needed_, input, output, batch_size, problem_size, stream_view_.value()); - segmented_sum_storage_.resize(byte_needed_, stream_view_); + segmented_sum_storage_.resize(byte_needed_, stream_view_.value()); cub::DeviceSegmentedReduce::Sum(segmented_sum_storage_.data(), byte_needed_, input, output, batch_size, problem_size, - stream_view_); + stream_view_.value()); }Also applies to: 53-82
84-86: Consider making members private with accessors.The struct exposes
byte_needed_,segmented_sum_storage_, andstream_view_as public members. While this may be intentional for a utility struct, it could lead to accidental misuse. Consider making them private if external access isn't needed.cpp/src/linear_programming/utils.cuh (2)
91-98: Assertion followed by fallback behavior - clarify intent.The
batch_safe_divfunctor assertsb != 0but then returnsawhenb == 0anyway. In debug builds, the assertion will trigger; in release builds, it silently returnsa. This mixed approach may cause confusion about expected behavior.Consider either:
- Removing the assertion if returning
aon division-by-zero is the intended fallback- Keeping only the assertion if division-by-zero is truly an error
♻️ Option 1: Remove assertion if fallback is intended
template <typename f_t> struct batch_safe_div { HDI f_t operator()(f_t a, f_t b) { - cuopt_assert(b != f_t(0), "Division by zero"); return b != f_t(0) ? a / b : a; } };♻️ Option 2: Keep only assertion if it's an error
template <typename f_t> struct batch_safe_div { HDI f_t operator()(f_t a, f_t b) { cuopt_assert(b != f_t(0), "Division by zero"); - return b != f_t(0) ? a / b : a; + return a / b; } };
194-221: Address TODO: Usei_tinstead ofintfor index types.Both
batch_wrapped_iteratorandproblem_wrapped_iteratoruseintforproblem_size_, and there's already a TODO comment on line 219. Usingintcould cause issues with large problem sizes exceedingINT_MAX. Consider templating these oni_tfor consistency with the rest of the codebase.♻️ Suggested fix
-template <typename f_t> +template <typename i_t, typename f_t> struct batch_wrapped_iterator { - batch_wrapped_iterator(const f_t* problem_input, int problem_size) + batch_wrapped_iterator(const f_t* problem_input, i_t problem_size) : problem_input_(problem_input), problem_size_(problem_size) { } - HDI f_t operator()(int id) { return problem_input_[id / problem_size_]; } + HDI f_t operator()(i_t id) { return problem_input_[id / problem_size_]; } const f_t* problem_input_; - int problem_size_; + i_t problem_size_; }; -template <typename f_t> +template <typename i_t, typename f_t> struct problem_wrapped_iterator { - problem_wrapped_iterator(const f_t* problem_input, int problem_size) + problem_wrapped_iterator(const f_t* problem_input, i_t problem_size) : problem_input_(problem_input), problem_size_(problem_size) { } - HDI f_t operator()(int id) { return problem_input_[id % problem_size_]; } + HDI f_t operator()(i_t id) { return problem_input_[id % problem_size_]; } const f_t* problem_input_; - // TODO use i_t - int problem_size_; + i_t problem_size_; };cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cuh (1)
388-389: Storing const references as class members poses lifetime risk.The members
climber_strategies_andhyper_params_are stored as const references. If the referenced objects are destroyed before thispdlp_restart_strategy_tinstance, dangling reference access will cause undefined behavior. This is particularly risky in a solver context where objects may be moved or destroyed during algorithm phase transitions.Consider storing by value or using
std::reference_wrapperwith explicit lifetime documentation:♻️ Suggested alternatives
- const std::vector<pdlp_climber_strategy_t>& climber_strategies_; - const pdlp_hyper_params::pdlp_hyper_params_t& hyper_params_; + std::vector<pdlp_climber_strategy_t> climber_strategies_; + pdlp_hyper_params::pdlp_hyper_params_t hyper_params_;Or if references are intentional for performance, add a comment documenting the lifetime requirement.
benchmarks/linear_programming/cuopt/run_pdlp.cu (1)
82-95: Missingelsebeforeif (mode == "Stable2")creates inconsistent flow.The function has an inconsistency:
Stable1check doesn't have anelse ifpattern like the others, butStable2does (line 86). While this doesn't affect correctness since early returns are used, it makes the code style inconsistent.♻️ Suggested fix for consistency
static cuopt::linear_programming::pdlp_solver_mode_t string_to_pdlp_solver_mode( const std::string& mode) { if (mode == "Stable1") return cuopt::linear_programming::pdlp_solver_mode_t::Stable1; - if (mode == "Stable2") + else if (mode == "Stable2") return cuopt::linear_programming::pdlp_solver_mode_t::Stable2; else if (mode == "Methodical1")cpp/include/cuopt/linear_programming/pdlp/pdlp_hyper_params.cuh (1)
12-56: Add Doxygen docs for this new public struct.This is a new public API type; please document its purpose and the key defaults/behaviors for consumers. As per coding guidelines, public headers should document new API surfaces.
cpp/src/mip/solve.cu (1)
61-61: Resolve the TODO before merge.Line 61 indicates uncertainty about parameter propagation. Please clarify in code or track via an issue.
Would you like me to draft an issue or propose the exact change?
cpp/src/linear_programming/restart_strategy/localized_duality_gap_container.cu (1)
52-59: Incorrect second argument tocudaMemsetAsync.
cudaMemsetAsynctakes a byte value (anint), not a floating-point value. Passingf_t(0.0)is implicitly converted toint, which happens to work for zeroing memory, but it's misleading and fragile. Use0directly.Suggested fix
RAFT_CUDA_TRY(cudaMemsetAsync(primal_solution_.data(), - f_t(0.0), + 0, sizeof(f_t) * primal_solution_.size(), handle_ptr->get_stream())); RAFT_CUDA_TRY(cudaMemsetAsync(dual_solution_.data(), - f_t(0.0), + 0, sizeof(f_t) * dual_solution_.size(), handle_ptr->get_stream()));cpp/src/dual_simplex/pseudo_costs.cpp (2)
175-180: Consider usingreserveforfraction_valuesto avoid reallocations.Since the size is known upfront (
fractional.size()), reserving capacity improves performance.Suggested fix
std::vector<f_t> fraction_values; + fraction_values.reserve(fractional.size()); for (i_t k = 0; k < fractional.size(); k++) { const i_t j = fractional[k]; fraction_values.push_back(original_root_soln_x[j]); }
186-196: TODO comment indicates incomplete infeasibility handling.The code notes that infeasible branches should count as "done" but this is not yet implemented. This could lead to misleading progress statistics in the log output.
Would you like me to help implement the infeasibility counting logic, or open an issue to track this TODO?
cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp (1)
192-200: Document the tighter infeasibility tolerances.The default infeasibility tolerances moved to 1e-10, which can materially change convergence behavior and runtime. Please update the public docs/release notes to reflect the new defaults and expected impact. Based on learnings, document numerical tolerances when defaults change.
cpp/src/linear_programming/solve.cu (2)
815-838: Memory estimation loop may not terminate efficiently for large batch sizes.The batch size reduction loop decrements by 1 each iteration, which can be slow for large batch sizes with significant memory pressure. Consider using binary search or proportional reduction.
♻️ Suggested improvement
if (memory_estimate > free_mem) { use_optimal_batch_size = true; // Decrement batch size iteratively until we find a batch size that fits - while (memory_max_batch_size > 1) { - const size_t memory_estimate = batch_pdlp_memory_estimator(problem, memory_max_batch_size); - if (memory_estimate <= free_mem) { break; } - memory_max_batch_size--; + // Use binary search for faster convergence + int low = 1, high = memory_max_batch_size; + while (low < high) { + int mid = low + (high - low + 1) / 2; + if (batch_pdlp_memory_estimator(problem, mid) <= free_mem) { + low = mid; + } else { + high = mid - 1; + } } + memory_max_batch_size = low; }
948-949: Misleading variable initialization foronly_upperflag.The variable
only_upperis set tofalse, but the comment suggests "Lower bounds can sometimes generate infeasible instances". The naming is confusing sinceonly_upper = falsemeans both upper and lower bounds are used, not "only upper". Consider renaming for clarity.cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.hpp (1)
129-130: Dangling reference risk with const reference members.Storing
climber_strategies_andhyper_params_as const references creates lifetime dependency on the caller. If the referenced objects are destroyed before this class, you'll have dangling references. This is acceptable if the ownership contract is documented, but consider adding a comment clarifying the expected lifetime.cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp (1)
128-147: Document/mitigate the breaking constructor changeThese public constructors now require rvalue vectors for termination data (and warm-start moves), which breaks existing callers. Consider adding deprecated overloads or a migration note in the public API docs/release notes to preserve compatibility guidance. As per coding guidelines, please document the API change/migration.
cpp/src/linear_programming/termination_strategy/termination_strategy.hpp (1)
34-44: Verify lifetime of climber_strategies_ reference
pdlp_termination_strategy_tstoresclimber_strategies_as aconst std::vector<...>&. If the caller passes a temporary or later resizes that vector, this becomes dangling. Please confirm the lifetime contract or store a copy/shared_ptr to make the ownership explicit.Also applies to: 191-192
| // Empty constructor for when used in non batch mode | ||
| segmented_sum_handler_t() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default constructor leaves stream_view_ uninitialized - potential undefined behavior.
The default constructor doesn't initialize stream_view_, but the helper methods use it. If a caller uses the default constructor and then calls segmented_sum_helper or segmented_reduce_helper, this will result in undefined behavior.
Consider either:
- Deleting the default constructor if it shouldn't be used
- Initializing to a valid default (e.g.,
rmm::cuda_stream_default) - Adding runtime checks in the helper methods
🔧 Suggested fix - Option 1: Delete default constructor
- // Empty constructor for when used in non batch mode
- segmented_sum_handler_t() {}
+ // Delete default constructor - stream_view_ must be initialized
+ segmented_sum_handler_t() = delete;🔧 Suggested fix - Option 2: Initialize to default stream
// Empty constructor for when used in non batch mode
- segmented_sum_handler_t() {}
+ segmented_sum_handler_t() : stream_view_(rmm::cuda_stream_default) {}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Empty constructor for when used in non batch mode | |
| segmented_sum_handler_t() {} | |
| // Delete default constructor - stream_view_ must be initialized | |
| segmented_sum_handler_t() = delete; |
| // Empty constructor for when used in non batch mode | |
| segmented_sum_handler_t() {} | |
| // Empty constructor for when used in non batch mode | |
| segmented_sum_handler_t() : stream_view_(rmm::cuda_stream_default) {} |
🤖 Prompt for AI Agents
In `@cpp/include/cuopt/linear_programming/utilities/segmented_sum_handler.cuh`
around lines 30 - 31, The default constructor segmented_sum_handler_t() leaves
member stream_view_ uninitialized causing undefined behavior when
segmented_sum_helper or segmented_reduce_helper are later called; fix by
initializing stream_view_ to a safe default (e.g., rmm::cuda_stream_default or
equivalent rmm::cuda_stream_view{}) in the default constructor (or alternatively
delete the default constructor if you want to forbid use), and/or add an
assert/check in segmented_sum_helper/segmented_reduce_helper that stream_view_
is valid before use; update the segmented_sum_handler_t() definition and any
constructors to ensure stream_view_ is always initialized.
| : x(primal_size * current_batch_size, handle_ptr->get_stream()), | ||
| y(dual_size * current_batch_size, handle_ptr->get_stream()), | ||
| buffer_non_transpose_batch(0, handle_ptr->get_stream()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential overflow in buffer sizing for large problems.
primal_size * current_batch_size and dual_size * current_batch_size are computed in int, which can overflow before the value is converted to size_t, leading to undersized allocations and memory corruption. Cast to size_t (and optionally add a guard) before multiplying.
🐛 Proposed fix
- : x(primal_size * current_batch_size, handle_ptr->get_stream()),
- y(dual_size * current_batch_size, handle_ptr->get_stream()),
+ : x(static_cast<size_t>(primal_size) * static_cast<size_t>(current_batch_size),
+ handle_ptr->get_stream()),
+ y(static_cast<size_t>(dual_size) * static_cast<size_t>(current_batch_size),
+ handle_ptr->get_stream()),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| : x(primal_size * current_batch_size, handle_ptr->get_stream()), | |
| y(dual_size * current_batch_size, handle_ptr->get_stream()), | |
| buffer_non_transpose_batch(0, handle_ptr->get_stream()), | |
| : x(static_cast<size_t>(primal_size) * static_cast<size_t>(current_batch_size), | |
| handle_ptr->get_stream()), | |
| y(static_cast<size_t>(dual_size) * static_cast<size_t>(current_batch_size), | |
| handle_ptr->get_stream()), | |
| buffer_non_transpose_batch(0, handle_ptr->get_stream()), |
🤖 Prompt for AI Agents
In
`@cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu`
around lines 25 - 27, The allocation sizes for x, y (and
buffer_non_transpose_batch init) use int multiplication (primal_size *
current_batch_size and dual_size * current_batch_size) which can overflow;
change the computation to cast operands to size_t before multiplying (e.g.,
(size_t)primal_size * (size_t)current_batch_size) and/or compute into a size_t
temp and add an overflow guard (check against SIZE_MAX or compare
current_batch_size > SIZE_MAX / primal_size) before constructing x, y or
buffer_non_transpose_batch; update any local variable types used for these sizes
to size_t and ensure the constructor calls for x, y, buffer_non_transpose_batch
use the size_t result.
| int optimal_batch_size_handler(const optimization_problem_t<i_t, f_t>& op_problem, | ||
| int max_batch_size) | ||
| { | ||
| if (max_batch_size == 1) return 1; | ||
|
|
||
| // Try to quickly find what is the optimal batch size for the problem | ||
| // We run the two most ran SpMMs for both A and A_T and compute "time / batch_size" | ||
| // The one with the best ratio has the optimal batch size (since can solve most amount of work in | ||
| // least time) To try to have something representative we run each SpMM 5 times and take the | ||
| // average We do it for both A and A_T and take the sum since both will be run for each batch size | ||
|
|
||
| // We start with batch size 128 (power two better for cuSparse) and try to improve by either | ||
| // multitipling or dividing by 2 each time At max we take 5 steps of search | ||
|
|
||
| constexpr int max_steps = 4; // 4 because we already do one step for direction | ||
| constexpr int initial_batch_size = 128; | ||
| constexpr int benchmark_runs = 5; | ||
| // Take the floor power of two | ||
| // This ensures that we always start with a batch size that is a power of two or | ||
| // initial_batch_size | ||
| int current_batch_size = | ||
| std::pow(2, std::floor(std::log2(std::min(initial_batch_size, max_batch_size)))); | ||
| int optimal_batch_size = current_batch_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard max_batch_size before log2/pow.
If max_batch_size <= 0, log2(0) is undefined and will propagate NaNs into current_batch_size. Add a precondition check to avoid UB.
🐛 Proposed fix
int optimal_batch_size_handler(const optimization_problem_t<i_t, f_t>& op_problem,
int max_batch_size)
{
+ cuopt_assert(max_batch_size >= 1, "max_batch_size must be >= 1");
if (max_batch_size == 1) return 1;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| int optimal_batch_size_handler(const optimization_problem_t<i_t, f_t>& op_problem, | |
| int max_batch_size) | |
| { | |
| if (max_batch_size == 1) return 1; | |
| // Try to quickly find what is the optimal batch size for the problem | |
| // We run the two most ran SpMMs for both A and A_T and compute "time / batch_size" | |
| // The one with the best ratio has the optimal batch size (since can solve most amount of work in | |
| // least time) To try to have something representative we run each SpMM 5 times and take the | |
| // average We do it for both A and A_T and take the sum since both will be run for each batch size | |
| // We start with batch size 128 (power two better for cuSparse) and try to improve by either | |
| // multitipling or dividing by 2 each time At max we take 5 steps of search | |
| constexpr int max_steps = 4; // 4 because we already do one step for direction | |
| constexpr int initial_batch_size = 128; | |
| constexpr int benchmark_runs = 5; | |
| // Take the floor power of two | |
| // This ensures that we always start with a batch size that is a power of two or | |
| // initial_batch_size | |
| int current_batch_size = | |
| std::pow(2, std::floor(std::log2(std::min(initial_batch_size, max_batch_size)))); | |
| int optimal_batch_size = current_batch_size; | |
| int optimal_batch_size_handler(const optimization_problem_t<i_t, f_t>& op_problem, | |
| int max_batch_size) | |
| { | |
| cuopt_assert(max_batch_size >= 1, "max_batch_size must be >= 1"); | |
| if (max_batch_size == 1) return 1; | |
| // Try to quickly find what is the optimal batch size for the problem | |
| // We run the two most ran SpMMs for both A and A_T and compute "time / batch_size" | |
| // The one with the best ratio has the optimal batch size (since can solve most amount of work in | |
| // least time) To try to have something representative we run each SpMM 5 times and take the | |
| // average We do it for both A and A_T and take the sum since both will be run for each batch size | |
| // We start with batch size 128 (power two better for cuSparse) and try to improve by either | |
| // multitipling or dividing by 2 each time At max we take 5 steps of search | |
| constexpr int max_steps = 4; // 4 because we already do one step for direction | |
| constexpr int initial_batch_size = 128; | |
| constexpr int benchmark_runs = 5; | |
| // Take the floor power of two | |
| // This ensures that we always start with a batch size that is a power of two or | |
| // initial_batch_size | |
| int current_batch_size = | |
| std::pow(2, std::floor(std::log2(std::min(initial_batch_size, max_batch_size)))); | |
| int optimal_batch_size = current_batch_size; |
🤖 Prompt for AI Agents
In
`@cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu`
around lines 191 - 213, In optimal_batch_size_handler, guard against
non-positive max_batch_size before computing log2/pow: add a precondition like
"if (max_batch_size <= 0) return 1;" (or clamp max_batch_size = 1) near the top
of the function before computing current_batch_size so std::log2(std::min(...))
never receives 0 or negative values; this fixes potential NaNs/UB for
current_batch_size derived from initial_batch_size and max_batch_size.
| cub::DeviceSegmentedReduce::Sum( | ||
| nullptr, | ||
| byte_needed, | ||
| thrust::make_transform_iterator(reusable_device_scalar_1_.data(), power_two_func_t<f_t>{}), | ||
| reusable_device_scalar_1_.data(), | ||
| climber_strategies_.size(), | ||
| primal_size_h_, | ||
| stream_view_); | ||
| dot_product_bytes = std::max(dot_product_bytes, byte_needed); | ||
|
|
||
| cub::DeviceSegmentedReduce::Sum( | ||
| nullptr, | ||
| byte_needed, | ||
| thrust::make_transform_iterator(reusable_device_scalar_1_.data(), power_two_func_t<f_t>{}), | ||
| reusable_device_scalar_1_.data(), | ||
| climber_strategies_.size(), | ||
| dual_size_h_, | ||
| stream_view_); | ||
| dot_product_bytes = std::max(dot_product_bytes, byte_needed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's view the actual code at the mentioned lines
cat -n cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu | sed -n '270,300p'Repository: NVIDIA/cuopt
Length of output: 1285
🏁 Script executed:
# Also check lines 1249-1257
cat -n cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu | sed -n '1240,1265p'Repository: NVIDIA/cuopt
Length of output: 1332
🏁 Script executed:
# Search for RAFT_CUDA_TRY usage in the codebase to understand the pattern
rg "RAFT_CUDA_TRY" -A 2 -B 2 | head -50Repository: NVIDIA/cuopt
Length of output: 4526
🏁 Script executed:
# Check if RAFT_CUDA_TRY is defined in headers
rg "define RAFT_CUDA_TRY\|#define.*CUDA_TRY" -A 3Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Look for CUB error checking patterns in the file itself
rg "cub::" cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu -B 1 -A 1Repository: NVIDIA/cuopt
Length of output: 619
🏁 Script executed:
# Search for RAFT_CUDA_TRY definition
find . -name "*.h" -o -name "*.hpp" | xargs grep -l "RAFT_CUDA_TRY" | head -5Repository: NVIDIA/cuopt
Length of output: 204
🏁 Script executed:
# Look at the definition of RAFT_CUDA_TRY
rg "define.*RAFT_CUDA_TRY|RAFT_CUDA_TRY\s*=" -A 5 | head -30Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Check how DeviceSegmentedReduce is typically used in other codebases or NVIDIA projects
rg "DeviceSegmentedReduce" -B 2 -A 2 | head -80Repository: NVIDIA/cuopt
Length of output: 7548
🏁 Script executed:
# Check imports/includes in the file to understand error handling mechanisms
head -50 cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cuRepository: NVIDIA/cuopt
Length of output: 1482
🏁 Script executed:
# Check the actual definition of RAFT_CUDA_TRY by looking at cuda_utils.cuh
find . -name "cuda_utils.cuh" | xargs cat | grep -A 20 "RAFT_CUDA_TRY"Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Check if CUB documentation or any wrapper indicates DeviceSegmentedReduce returns cudaError_t
rg "cudaError_t.*DeviceSegmentedReduce|DeviceSegmentedReduce.*cudaError_t"Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Search for other similar patterns in the codebase - look for unwrapped CUB calls that might be considered acceptable
rg "cub::(Device|Block)" -B 1 -A 1 | grep -v "RAFT_CUDA_TRY\|RAFT_CUBLAS" | head -40Repository: NVIDIA/cuopt
Length of output: 3313
🏁 Script executed:
# Check the segmented_sum_handler to see if it has error checking (might be a wrapper class)
cat -n cpp/include/cuopt/linear_programming/utilities/segmented_sum_handler.cuh | head -100Repository: NVIDIA/cuopt
Length of output: 4119
Add CUDA error checks around CUB segmented reductions
cub::DeviceSegmentedReduce::Sum returns cudaError_t, but the return status is ignored in all three calls. Wrap these with RAFT_CUDA_TRY to ensure CUDA errors surface immediately rather than being silently dropped.
Proposed fix
- cub::DeviceSegmentedReduce::Sum(
+ RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
nullptr,
byte_needed,
thrust::make_transform_iterator(reusable_device_scalar_1_.data(), power_two_func_t<f_t>{}),
reusable_device_scalar_1_.data(),
climber_strategies_.size(),
primal_size_h_,
- stream_view_);
+ stream_view_));
- cub::DeviceSegmentedReduce::Sum(
+ RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
nullptr,
byte_needed,
thrust::make_transform_iterator(reusable_device_scalar_1_.data(), power_two_func_t<f_t>{}),
reusable_device_scalar_1_.data(),
climber_strategies_.size(),
dual_size_h_,
- stream_view_);
+ stream_view_));
- cub::DeviceSegmentedReduce::Sum(
+ RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
dot_product_storage.data(),
dot_product_bytes,
thrust::make_transform_iterator(tmp.data(), power_two_func_t<f_t>{}),
distance_moved.data(),
climber_strategies_.size(),
size_of_solutions_h,
- stream_view_);
+ stream_view_));🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu` around
lines 277 - 295, The calls to cub::DeviceSegmentedReduce::Sum in
pdlp_restart_strategy.cu (the three usages that pass nullptr, byte_needed,
thrust::make_transform_iterator(..., power_two_func_t<f_t>{}),
reusable_device_scalar_1_.data(), climber_strategies_.size(),
primal_size_h_/dual_size_h_, stream_view_) ignore the returned cudaError_t; wrap
each call with RAFT_CUDA_TRY so failures are checked and propagated (i.e.,
replace the bare cub::DeviceSegmentedReduce::Sum(...) invocations with
RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(...)) for the segments computing
dot_product_bytes).
| template <typename i_t, typename f_t> | ||
| void matrix_swap(rmm::device_uvector<f_t>& matrix, | ||
| i_t vector_size, | ||
| const thrust::universal_host_pinned_vector<swap_pair_t<i_t>>& swap_pairs) | ||
| { | ||
| if (swap_pairs.empty()) { return; } | ||
|
|
||
| const i_t batch_size = matrix.size() / vector_size; | ||
| cuopt_assert(vector_size > 0, "Vector size must be greater than 0"); | ||
| cuopt_assert(batch_size > 0, "Batch size must be greater than 0"); | ||
|
|
||
| const size_t swap_count = swap_pairs.size(); | ||
| const size_t total_items = swap_count * static_cast<size_t>(vector_size); | ||
|
|
||
| auto counting = thrust::make_counting_iterator<size_t>(0); | ||
| auto left_index = thrust::make_transform_iterator( | ||
| counting, | ||
| matrix_swap_index_functor<i_t>{thrust::raw_pointer_cast(swap_pairs.data()), vector_size, true}); | ||
| auto right_index = thrust::make_transform_iterator( | ||
| counting, | ||
| matrix_swap_index_functor<i_t>{ | ||
| thrust::raw_pointer_cast(swap_pairs.data()), vector_size, false}); | ||
|
|
||
| auto left_perm = thrust::make_permutation_iterator(matrix.data(), left_index); | ||
| auto right_perm = thrust::make_permutation_iterator(matrix.data(), right_index); | ||
| auto in_zip = thrust::make_zip_iterator(left_perm, right_perm); | ||
| auto out_zip = thrust::make_zip_iterator(left_perm, right_perm); | ||
|
|
||
| cub::DeviceTransform::Transform( | ||
| in_zip, | ||
| out_zip, | ||
| total_items, | ||
| [] HD(thrust::tuple<f_t, f_t> values) -> thrust::tuple<f_t, f_t> { | ||
| return thrust::make_tuple(thrust::get<1>(values), thrust::get<0>(values)); | ||
| }, | ||
| matrix.stream()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
CUB DeviceTransform::Transform in-place operations overlapping input output iterators
💡 Result:
cub::DeviceTransform::Transform does not document any guarantees about overlapping input/output ranges (unlike some other CUB algorithms that explicitly spell out when in-place is supported). The header shows it’s a device-wide parallel transform that takes random-access input iterator(s) and an output iterator, with no special “overlap” contract stated in the API. [1]
Because the operation is executed in parallel, you should assume the same rule as thrust::transform: only exact in-place is OK (i.e., output == input for a unary transform), but any partial overlap (e.g., output = input + k, or overlapping subranges) is not safe / effectively undefined due to read-after-write races. Thrust explicitly allows first == result but forbids overlapping ranges otherwise. [2]
Practical guidance
- Safe:
d_out == d_in(true in-place, same base and same indexing). - Unsafe:
d_outpoints into[d_in, d_in + num_items)but is not equal tod_in(any shifted/partial overlap), or overlaps across multiple input ranges. - If you need shifted/overlapping behavior, use a separate output buffer (or stage through a temp buffer) and then copy back.
Sources
cub/device/device_transform.cuhprogram listing (shows signatures/dispatch, no overlap preconditions stated). [1]- Thrust
transformdocumentation precondition on overlap. [2]
🏁 Script executed:
# Find files related to swap_pairs and matrix_swap to understand construction and validation
rg -l "swap_pair" --type cpp --type cu --type cuhRepository: NVIDIA/cuopt
Length of output: 82
🏁 Script executed:
# Examine the swap_and_resize_helper.cuh file in detail
cat -n cpp/src/linear_programming/swap_and_resize_helper.cuhRepository: NVIDIA/cuopt
Length of output: 4089
🏁 Script executed:
rg "swap_pair" --type-list | head -20Repository: NVIDIA/cuopt
Length of output: 445
🏁 Script executed:
rg "swap_pairs" -A 3 -B 3Repository: NVIDIA/cuopt
Length of output: 39341
🏁 Script executed:
rg "matrix_swap" -A 5 -B 5Repository: NVIDIA/cuopt
Length of output: 9571
Use a temporary buffer or custom kernel for in-place matrix swap.
cub::DeviceTransform::Transform does not support partial overlap between input and output iterators. Here, in_zip and out_zip are created from the same iterator objects but access different memory locations via permutation (left_index and right_index compute different addresses). This violates CUB's precondition that only exact in-place (identical addresses with identical indexing) is safe.
Although calling code validates that pair.left < pair.right, this only ensures distinct indices within each pair. The documented API contract still forbids overlapping access patterns like this. Either:
- Allocate a temporary buffer, transform into it, then copy back, or
- Implement a custom CUDA kernel that explicitly handles in-place swaps with proper synchronization semantics.
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/swap_and_resize_helper.cuh` around lines 48 - 84,
matrix_swap currently calls cub::DeviceTransform with in_zip and out_zip built
from the same permutation iterators (left_index/right_index), which can produce
overlapping partial-addresses and violates CUB's no-overlap precondition; fix by
making the transform non-overlapping: allocate a temporary
rmm::device_uvector<f_t> temp(total_items) (or a temp buffer sized to
vector_size * swap_count), run DeviceTransform writing into temp (using
left_index/right_index-derived input and a linear output), then copy/scatter the
swapped values from temp back into matrix at the target permuted positions (or
implement a custom CUDA kernel that performs the swaps in-place safely); update
references in this function (matrix_swap, left_index, right_index, in_zip,
out_zip) accordingly and ensure matrix.stream() and proper synchronization are
preserved.
… reduce cost indexing, returning optimal climber results even if overall batch hit iteration/time limit (with test), corrected memory estimator with solution, use fast mod div in pdhg, correct row sense handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
🤖 Fix all issues with AI agents
In `@cpp/src/linear_programming/pdhg.cu`:
- Around line 767-775: The kernel launch of
refine_initial_primal_projection_kernel lacks CUDA error checking; after the
launch that uses grid_size, block_size and stream_view_, add a CUDA error check
(e.g., wrap a RAFT_CUDA_TRY(cudaPeekAtLastError()) or
CUDA_CHECK(cudaPeekAtLastError()) call) to catch launch failures and ensure
stream synchronization errors are surfaced; update the surrounding code where
refine_initial_primal_projection_kernel is invoked to call the chosen macro
immediately after the <<<...>>> invocation using the same stream_view_ context.
- Around line 827-843: The kernel launch of
refine_primal_projection_major_batch_kernel currently has no CUDA error
checking; after the launch add a CUDA error check (e.g., call cudaGetLastError()
and wrap it with your project’s CUDA_CHECK macro) and also check the stream
(synchronize or cudaStreamQuery on stream_view_ as your codebase prefers) to
surface async launch failures; reference the
refine_primal_projection_major_batch_kernel launch and the stream_view_ used for
the call when adding the checks.
- Line 193: The cuopt_assert call using the condition new_size < batch_size has
a mismatched message; update the assertion to make the message and condition
consistent by either changing the condition to new_size <= batch_size or
changing the message text to "New size must be less than batch size"; locate the
cuopt_assert call that references new_size and batch_size (cuopt_assert(new_size
< batch_size, "...")) and apply one of the two fixes so the condition and the
string match.
- Around line 931-944: After launching refine_primal_projection_batch_kernel,
add CUDA error checking and stream synchronization to catch launch failures:
immediately call the project's CUDA error-check macro or use cudaGetLastError()
and check the return, and also synchronize or check the associated stream
(stream_view_) to surface asynchronous errors; place this right after the kernel
launch that passes new_bounds_idx_, new_bounds_lower_, new_bounds_upper_,
current_saddle_point_state_.get_primal_solution(),
problem_ptr->objective_coefficients,
current_saddle_point_state_.get_current_AtY(), primal_step_size, and
reflected_primal_ to ensure any launch or async execution errors are reported.
- Line 309: The code currently reads the hard-coded global
deterministic_batch_pdlp (defined in pdlp_constants.hpp) when choosing
CUSPARSE_SPMM_CSR_ALG3 vs ALG2 in pdhg.cu; make this behavior
runtime-configurable by adding a boolean field (e.g., deterministic_batch_pdlp)
to the pdlp_solver_settings_t struct, give it the existing default (true), and
thread that setting through all places that construct or receive solver settings
so the PDHG code can read settings->deterministic_batch_pdlp instead of the
global. Update any factory/constructor/parsing logic that builds
pdlp_solver_settings_t to accept/configure the new flag, and replace references
in pdhg.cu (the SpMM algorithm selection sites) and any other files that
currently reference the global to use the settings field. Ensure backward
compatibility by initializing the new field to the previous default when
settings are not explicitly provided.
In `@cpp/src/linear_programming/solve.cu`:
- Around line 956-958: The code currently leaves only_upper=false, which uses
lower bounds that can create undetected infeasible instances; change only_upper
to true (so batch_size = fractional.size()) until the infeasibility-detection
work is completed, and add a clear inline comment at the only_upper declaration
referencing the known detection limitation and the TODO to re-enable lower
bounds once detect_infeasible_instances (or the partial implementation) is
finished; update any logic that depends on batch_size or lower-bound paths
(symbols: only_upper, batch_size, fractional) to behave correctly under
only_upper=true.
- Around line 807-812: Decide and implement behavior for PDLP concerning
pre-solving/scaling: either run on the original unpresolved/unscaled problem by
skipping any presolve/scale steps or add a clear boolean flag (e.g.,
run_on_unscaled or apply_presolve_scale) and ensure PDLP's entry point checks
that flag (refer to variables
primal_dual_init/primal_weight_init/use_optimal_batch_size to find the PDLP init
area) and document the choice in comments; make iteration_limit configurable by
replacing the constexpr int iteration_limit with a configurable parameter read
from a settings struct or function argument (keep the current 100000 as the
default) and ensure the solver returns a partial solution plus a status/flag
when the iteration limit is hit so callers can detect and handle incomplete
results.
- Around line 742-800: The function batch_pdlp_memory_estimator contains an
orphaned trailing comment and a typo in a comment: inside
batch_pdlp_memory_estimator(...) remove the stray comment "Data from saddle
point state" immediately before the return statement, and correct the
misspelling "convergeance" to "convergence" in the comment that currently reads
"Data for the convergeance information"; no code logic changes are needed—only
update those two comment lines within batch_pdlp_memory_estimator.
In `@cpp/src/linear_programming/termination_strategy/convergence_information.cu`:
- Around line 262-269: The assertion in convergence_information_t<i_t,
f_t>::resize_context incorrectly forbids new_size == batch_size while the
message says "less than or equal"; modify the check to allow no-op resizes by
changing the condition to permit equality (i.e., assert new_size <= batch_size)
or, if you prefer to keep strictness, update the assertion message to remove "or
equal"; update the assertion referencing primal_objective_.size() and the
cuopt_assert that currently checks new_size < batch_size so it matches the
intended behavior.
In `@cpp/src/linear_programming/utils.cuh`:
- Around line 241-248: The call to cub::DeviceTransform::Transform (invoked with
problem_wrap_container(op_problem.constraint_lower_bounds/upper_bounds),
combined_bounds, combine_finite_abs_bounds<f_t>(),
op_problem.handle_ptr->get_stream()) can fail and its return status must be
checked; update this call to capture the returned cub::Result (or cudaError_t as
appropriate for your CUB wrapper) and verify it with the project's CUDA
error-checking macro/function (e.g., CUDA_CHECK or equivalent), logging or
propagating the error if the call indicates failure so that failures in
DeviceTransform are detected and handled.
In `@cpp/tests/linear_programming/pdlp_test.cu`:
- Around line 326-357: The test currently sets solver_settings.hyper_params
flags after constructing
cuopt::linear_programming::detail::pdlp_solver_t<int,double> (symbol:
pdlp_solver_t), so the solver instance never observes those toggles; move each
modification of solver_settings.hyper_params
(update_step_size_on_initial_solution, update_primal_weight_on_initial_solution)
to occur before constructing the corresponding pdlp_solver_t instance and
creating the timer_t, then construct the solver, run_solver, synchronize, assert
EXPECT_NEAR, and finally reset the hyper_params flags afterward as currently
done.
🧹 Nitpick comments (10)
cpp/src/linear_programming/termination_strategy/convergence_information.cu (1)
336-349: Clean up duplicate assert and fix the size-mismatch message.Line 337’s message references
primal_size_h_while the check usesdual_size_h_, and thel2_primal_residual_assert is duplicated. This is low-risk but improves correctness and readability.🧹 Suggested cleanup
- cuopt_assert( - primal_residual_.size() == dual_size_h_ * climber_strategies_.size(), - "primal_residual_ size must be equal to primal_size_h_ * climber_strategies_.size()"); + cuopt_assert( + primal_residual_.size() == dual_size_h_ * climber_strategies_.size(), + "primal_residual_ size must be equal to dual_size_h_ * climber_strategies_.size()"); cuopt_assert(primal_iterate.size() == primal_size_h_ * climber_strategies_.size(), "primal_iterate size must be equal to primal_size_h_ * climber_strategies_.size()"); cuopt_assert(dual_residual_.size() == primal_size_h_ * climber_strategies_.size(), "dual_residual_ size must be equal to primal_size_h_ * climber_strategies_.size()"); cuopt_assert(dual_iterate.size() == dual_size_h_ * climber_strategies_.size(), "dual_iterate size must be equal to dual_size_h_ * climber_strategies_.size()"); cuopt_assert(l2_primal_residual_.size() == climber_strategies_.size(), "l2_primal_residual_ size must be equal to climber_strategies_.size()"); - cuopt_assert(l2_primal_residual_.size() == climber_strategies_.size(), - "l2_primal_residual_ size must be equal to climber_strategies_.size()"); cuopt_assert(l2_dual_residual_.size() == climber_strategies_.size(), "l2_dual_residual_ size must be equal to climber_strategies_.size()");cpp/src/linear_programming/solve.cu (2)
847-848: Remove unused type alias.The type alias
f_t2is defined but never used in the function.🧹 Proposed cleanup
- using f_t2 = typename type_2<f_t>::type; - // If need warm start, solve the LP alone
834-839: Consider binary search for optimal batch size.The current linear search for the right batch size (decrementing by 1) could be slow when memory is very constrained. Consider using binary search for better performance.
⚡ Proposed optimization
- while (memory_max_batch_size > 1) { - const size_t memory_estimate = - batch_pdlp_memory_estimator(problem, memory_max_batch_size, max_batch_size); - if (memory_estimate <= free_mem) { break; } - memory_max_batch_size--; - } + int left = 1, right = memory_max_batch_size; + while (left < right) { + int mid = left + (right - left + 1) / 2; + const size_t memory_estimate = + batch_pdlp_memory_estimator(problem, mid, max_batch_size); + if (memory_estimate <= free_mem) { + left = mid; + } else { + right = mid - 1; + } + } + memory_max_batch_size = left;cpp/src/linear_programming/pdhg.cu (2)
447-668: Significant code duplication in projection functors.The batch and non-batch versions of projection functors (
primal_reflected_major_projection,dual_reflected_major_projection, etc.) have nearly identical logic, differing only in whether step size is a pointer or parameter. Similarly, the bulk operation functors duplicate the projection logic.Consider templating these functors or using a unified implementation to reduce duplication and maintenance burden.
Based on learnings, as per coding guidelines: refactor code duplication in solver components (3+ occurrences) into shared utilities.
556-577: Verify memory coalescing in batch bulk operations.The bulk operation at Lines 560-577 uses the indexing pattern:
batch_idx = idx % batch_size var_idx = idx / batch_sizeThis results in accesses to
primal_solution[var_idx * batch_size + batch_idx],objective_coefficients[var_idx], andcurrent_AtY[var_idx * batch_size + batch_idx].For optimal coalescing, consecutive threads should access consecutive memory addresses. With the current pattern, threads in a warp access different batch indices of the same variable, which may lead to strided access patterns depending on batch size. Consider whether a different data layout (e.g., batch-major storage) would improve memory coalescing.
As per coding guidelines: detect inefficient GPU kernel memory access patterns; optimize for coalesced memory access.
cpp/src/linear_programming/utils.cuh (3)
91-98: Inconsistent assertion with fallback behavior.The assertion fires when
b == 0, but execution continues and returnsaanyway. In debug builds this creates confusing behavior (assertion fails but code proceeds), and in release builds the assertion may be stripped, silently masking potential bugs.Consider either:
- Making the assertion the only guard (remove the conditional return), or
- Removing the assertion and documenting that returning
aon zero divisor is intentional behavior.Option 1: Assertion-only (fail fast)
template <typename f_t> struct batch_safe_div { HDI f_t operator()(f_t a, f_t b) { cuopt_assert(b != f_t(0), "Division by zero"); - return b != f_t(0) ? a / b : a; + return a / b; } };Option 2: Document intentional fallback (remove assertion)
template <typename f_t> struct batch_safe_div { + // Returns a / b, or a unchanged if b is zero (safe fallback for batch operations) HDI f_t operator()(f_t a, f_t b) { - cuopt_assert(b != f_t(0), "Division by zero"); return b != f_t(0) ? a / b : a; } };
194-205: Integer type limits problem scalability.Using
intforproblem_size_and the index parameter limits batch sizes to ~2.1 billion elements. For large-scale LP problems with millions of variables across many climbers, this could overflow. The TODO comment (line 203) acknowledges this.Consider templating on the index type to match the solver's
i_tparameter, ensuring consistency across problem transformations as per coding guidelines.♻️ Suggested refactor
-template <typename f_t> +template <typename i_t, typename f_t> struct problem_wrapped_iterator { - problem_wrapped_iterator(const f_t* problem_input, int problem_size) + problem_wrapped_iterator(const f_t* problem_input, i_t problem_size) : problem_input_(problem_input), problem_size_(problem_size) { } - HDI f_t operator()(int id) { return problem_input_[id % problem_size_]; } + HDI f_t operator()(i_t id) { return problem_input_[id % problem_size_]; } const f_t* problem_input_; - // TODO use i_t - int problem_size_; + i_t problem_size_; };
207-213: Counting iterator type should match index type.
thrust::make_counting_iterator(0)produces anintiterator. Ifproblem_wrapped_iteratoris updated to usei_t, this function should also be templated on the index type and usethrust::make_counting_iterator(i_t(0))for consistency.♻️ Suggested refactor (if adopting i_t)
-template <typename f_t> +template <typename i_t, typename f_t> static inline auto problem_wrap_container(const rmm::device_uvector<f_t>& in) { - return thrust::make_transform_iterator(thrust::make_counting_iterator(0), - problem_wrapped_iterator<f_t>(in.data(), in.size())); + return thrust::make_transform_iterator(thrust::make_counting_iterator(i_t(0)), + problem_wrapped_iterator<i_t, f_t>(in.data(), in.size())); }cpp/tests/linear_programming/pdlp_test.cu (2)
602-604: Remove the duplicateset_initial_dual_solutioncall.
It’s redundant and can be confusing if the setter manages ownership/state.🧹 Proposed cleanup
- solver3.set_initial_dual_solution(d_initial_dual); solver3.set_initial_dual_solution(d_initial_dual);
1412-1573: Consider gating the 1000-batch stress test if CI time is tight.
If runtime becomes an issue, marking it as slow or parameterizing the batch size could help.
| cusparse_view_.batch_dual_solutions, | ||
| reusable_device_scalar_value_0_.data(), | ||
| cusparse_view_.batch_current_AtYs, | ||
| (deterministic_batch_pdlp) ? CUSPARSE_SPMM_CSR_ALG3 : CUSPARSE_SPMM_CSR_ALG2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
rg -nC5 'deterministic_batch_pdlp' --type=cpp --type=cuRepository: NVIDIA/cuopt
Length of output: 82
🏁 Script executed:
rg -nC5 'deterministic_batch_pdlp'Repository: NVIDIA/cuopt
Length of output: 15847
🏁 Script executed:
# Check if there's any solver settings or parameters class that might expose deterministic_batch_pdlp
rg -n 'class.*[Ss]etting|class.*[Cc]onfig|class.*[Pp]arameter' cpp/src/linear_programming | head -20Repository: NVIDIA/cuopt
Length of output: 258
🏁 Script executed:
# Find the solver_settings header file
find . -name "*solver_settings*" -type fRepository: NVIDIA/cuopt
Length of output: 779
🏁 Script executed:
# Check the PDLP solver settings header for deterministic_batch_pdlp
cat -n cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp | head -100Repository: NVIDIA/cuopt
Length of output: 4124
🏁 Script executed:
# Continue reading the PDLP solver settings header - search for deterministic or batch
rg -n 'deterministic|batch' cpp/include/cuopt/linear_programming/pdlp/solver_settings.hppRepository: NVIDIA/cuopt
Length of output: 275
🏁 Script executed:
# Check the context around the batch-related TODO comments and cudss_deterministic
cat -n cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp | sed -n '115,260p'Repository: NVIDIA/cuopt
Length of output: 7260
🏁 Script executed:
# Check if deterministic_batch_pdlp is ever modified at runtime
rg -n 'deterministic_batch_pdlp\s*=' --type=cpp --type=cuRepository: NVIDIA/cuopt
Length of output: 82
🏁 Script executed:
# Check if deterministic_batch_pdlp is ever modified at runtime
rg -n 'deterministic_batch_pdlp\s*='Repository: NVIDIA/cuopt
Length of output: 149
Make deterministic_batch_pdlp runtime-configurable through solver settings.
The flag deterministic_batch_pdlp is currently a hard-coded static global (initialized to true in pdlp_constants.hpp line 37) that controls algorithm selection for batch SpMM operations across multiple locations (lines 309, 341 in pdhg.cu and elsewhere). This flag must be exposed as a configurable parameter in pdlp_solver_settings_t to allow users to trade off between determinism (ALG3) and performance (ALG2) at runtime, consistent with the PR's goal of making hyperparameters configurable.
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/pdhg.cu` at line 309, The code currently reads the
hard-coded global deterministic_batch_pdlp (defined in pdlp_constants.hpp) when
choosing CUSPARSE_SPMM_CSR_ALG3 vs ALG2 in pdhg.cu; make this behavior
runtime-configurable by adding a boolean field (e.g., deterministic_batch_pdlp)
to the pdlp_solver_settings_t struct, give it the existing default (true), and
thread that setting through all places that construct or receive solver settings
so the PDHG code can read settings->deterministic_batch_pdlp instead of the
global. Update any factory/constructor/parsing logic that builds
pdlp_solver_settings_t to accept/configure the new flag, and replace references
in pdhg.cu (the SpMM algorithm selection sites) and any other files that
currently reference the global to use the settings field. Ensure backward
compatibility by initializing the new field to the previous default when
settings are not explicitly provided.
cpp/src/linear_programming/pdhg.cu
Outdated
| refine_initial_primal_projection_kernel<<<grid_size, block_size, 0, stream_view_>>>( | ||
| (i_t)climber_strategies_.size(), | ||
| problem_ptr->n_variables, | ||
| raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()), | ||
| raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()), | ||
| raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()), | ||
| raft::device_span<f_t>(current_saddle_point_state_.get_primal_solution().data(), | ||
| current_saddle_point_state_.get_primal_solution().size())); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing CUDA error checking after kernel launch.
The kernel launch for refine_initial_primal_projection_kernel lacks error checking. Per coding guidelines, every CUDA kernel launch must have error checking with CUDA_CHECK or RAFT_CUDA_TRY(cudaPeekAtLastError()).
🐛 Proposed fix
refine_initial_primal_projection_kernel<<<grid_size, block_size, 0, stream_view_>>>(
(i_t)climber_strategies_.size(),
problem_ptr->n_variables,
raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()),
raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()),
raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()),
raft::device_span<f_t>(current_saddle_point_state_.get_primal_solution().data(),
current_saddle_point_state_.get_primal_solution().size()));
+ RAFT_CUDA_TRY(cudaPeekAtLastError());As per coding guidelines.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| refine_initial_primal_projection_kernel<<<grid_size, block_size, 0, stream_view_>>>( | |
| (i_t)climber_strategies_.size(), | |
| problem_ptr->n_variables, | |
| raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()), | |
| raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()), | |
| raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()), | |
| raft::device_span<f_t>(current_saddle_point_state_.get_primal_solution().data(), | |
| current_saddle_point_state_.get_primal_solution().size())); | |
| } | |
| refine_initial_primal_projection_kernel<<<grid_size, block_size, 0, stream_view_>>>( | |
| (i_t)climber_strategies_.size(), | |
| problem_ptr->n_variables, | |
| raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()), | |
| raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()), | |
| raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()), | |
| raft::device_span<f_t>(current_saddle_point_state_.get_primal_solution().data(), | |
| current_saddle_point_state_.get_primal_solution().size())); | |
| RAFT_CUDA_TRY(cudaPeekAtLastError()); | |
| } |
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/pdhg.cu` around lines 767 - 775, The kernel launch
of refine_initial_primal_projection_kernel lacks CUDA error checking; after the
launch that uses grid_size, block_size and stream_view_, add a CUDA error check
(e.g., wrap a RAFT_CUDA_TRY(cudaPeekAtLastError()) or
CUDA_CHECK(cudaPeekAtLastError()) call) to catch launch failures and ensure
stream synchronization errors are surfaced; update the surrounding code where
refine_initial_primal_projection_kernel is invoked to call the chosen macro
immediately after the <<<...>>> invocation using the same stream_view_ context.
cpp/src/linear_programming/pdhg.cu
Outdated
| refine_primal_projection_batch_kernel<<<grid_size, block_size, 0, stream_view_>>>( | ||
| (i_t)climber_strategies_.size(), | ||
| problem_ptr->n_variables, | ||
| raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()), | ||
| raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()), | ||
| raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()), | ||
| raft::device_span<const f_t>(current_saddle_point_state_.get_primal_solution().data(), | ||
| current_saddle_point_state_.get_primal_solution().size()), | ||
| raft::device_span<const f_t>(problem_ptr->objective_coefficients.data(), | ||
| problem_ptr->objective_coefficients.size()), | ||
| raft::device_span<const f_t>(current_saddle_point_state_.get_current_AtY().data(), | ||
| current_saddle_point_state_.get_current_AtY().size()), | ||
| raft::device_span<const f_t>(primal_step_size.data(), primal_step_size.size()), | ||
| raft::device_span<f_t>(reflected_primal_.data(), reflected_primal_.size())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing CUDA error checking after kernel launch.
The kernel launch for refine_primal_projection_batch_kernel lacks error checking. Per coding guidelines, every CUDA kernel launch must have error checking.
🐛 Proposed fix
refine_primal_projection_batch_kernel<<<grid_size, block_size, 0, stream_view_>>>(
(i_t)climber_strategies_.size(),
problem_ptr->n_variables,
raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()),
raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()),
raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()),
raft::device_span<const f_t>(current_saddle_point_state_.get_primal_solution().data(),
current_saddle_point_state_.get_primal_solution().size()),
raft::device_span<const f_t>(problem_ptr->objective_coefficients.data(),
problem_ptr->objective_coefficients.size()),
raft::device_span<const f_t>(current_saddle_point_state_.get_current_AtY().data(),
current_saddle_point_state_.get_current_AtY().size()),
raft::device_span<const f_t>(primal_step_size.data(), primal_step_size.size()),
raft::device_span<f_t>(reflected_primal_.data(), reflected_primal_.size()));
+ RAFT_CUDA_TRY(cudaPeekAtLastError());
}As per coding guidelines.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| refine_primal_projection_batch_kernel<<<grid_size, block_size, 0, stream_view_>>>( | |
| (i_t)climber_strategies_.size(), | |
| problem_ptr->n_variables, | |
| raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()), | |
| raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()), | |
| raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()), | |
| raft::device_span<const f_t>(current_saddle_point_state_.get_primal_solution().data(), | |
| current_saddle_point_state_.get_primal_solution().size()), | |
| raft::device_span<const f_t>(problem_ptr->objective_coefficients.data(), | |
| problem_ptr->objective_coefficients.size()), | |
| raft::device_span<const f_t>(current_saddle_point_state_.get_current_AtY().data(), | |
| current_saddle_point_state_.get_current_AtY().size()), | |
| raft::device_span<const f_t>(primal_step_size.data(), primal_step_size.size()), | |
| raft::device_span<f_t>(reflected_primal_.data(), reflected_primal_.size())); | |
| refine_primal_projection_batch_kernel<<<grid_size, block_size, 0, stream_view_>>>( | |
| (i_t)climber_strategies_.size(), | |
| problem_ptr->n_variables, | |
| raft::device_span<const i_t>(new_bounds_idx_.data(), new_bounds_idx_.size()), | |
| raft::device_span<const f_t>(new_bounds_lower_.data(), new_bounds_lower_.size()), | |
| raft::device_span<const f_t>(new_bounds_upper_.data(), new_bounds_upper_.size()), | |
| raft::device_span<const f_t>(current_saddle_point_state_.get_primal_solution().data(), | |
| current_saddle_point_state_.get_primal_solution().size()), | |
| raft::device_span<const f_t>(problem_ptr->objective_coefficients.data(), | |
| problem_ptr->objective_coefficients.size()), | |
| raft::device_span<const f_t>(current_saddle_point_state_.get_current_AtY().data(), | |
| current_saddle_point_state_.get_current_AtY().size()), | |
| raft::device_span<const f_t>(primal_step_size.data(), primal_step_size.size()), | |
| raft::device_span<f_t>(reflected_primal_.data(), reflected_primal_.size())); | |
| RAFT_CUDA_TRY(cudaPeekAtLastError()); |
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/pdhg.cu` around lines 931 - 944, After launching
refine_primal_projection_batch_kernel, add CUDA error checking and stream
synchronization to catch launch failures: immediately call the project's CUDA
error-check macro or use cudaGetLastError() and check the return, and also
synchronize or check the associated stream (stream_view_) to surface
asynchronous errors; place this right after the kernel launch that passes
new_bounds_idx_, new_bounds_lower_, new_bounds_upper_,
current_saddle_point_state_.get_primal_solution(),
problem_ptr->objective_coefficients,
current_saddle_point_state_.get_current_AtY(), primal_step_size, and
reflected_primal_ to ensure any launch or async execution errors are reported.
| bool primal_dual_init = true; | ||
| bool primal_weight_init = true; | ||
| bool use_optimal_batch_size = false; | ||
| constexpr int iteration_limit = 100000; | ||
| // Shouldn't we work on the unpresolved and/or unscaled problem for PDLP? | ||
| // Shouldn't we put an iteration limit? If yes what should we do with the partial solutions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Resolve TODO and make iteration_limit configurable.
Two issues:
-
Line 811: The TODO comment raises an important question about whether batch PDLP should operate on unresolved/unscaled problems. This should be resolved before merging.
-
Line 810: The
iteration_limit = 100000is hardcoded. Consider making this configurable via settings or document why this specific value was chosen.
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/solve.cu` around lines 807 - 812, Decide and
implement behavior for PDLP concerning pre-solving/scaling: either run on the
original unpresolved/unscaled problem by skipping any presolve/scale steps or
add a clear boolean flag (e.g., run_on_unscaled or apply_presolve_scale) and
ensure PDLP's entry point checks that flag (refer to variables
primal_dual_init/primal_weight_init/use_optimal_batch_size to find the PDLP init
area) and document the choice in comments; make iteration_limit configurable by
replacing the constexpr int iteration_limit with a configurable parameter read
from a settings struct or function argument (keep the current 100000 as the
default) and ensure the solver returns a partial solution plus a status/flag
when the iteration limit is hit so callers can detect and handle incomplete
results.
| constexpr bool only_upper = false; | ||
| int batch_size = only_upper ? fractional.size() : fractional.size() * 2; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document or address the infeasibility detection limitation.
The comment indicates a known issue: "Lower bounds can sometimes generate infeasible instances that we struggle to detect." However, only_upper is set to false, meaning lower bounds are still being used. This could lead to incorrect results if infeasibility goes undetected.
Please either:
- Address the infeasibility detection issue (as mentioned in PR objectives, this is partially implemented)
- Document this limitation clearly in user-facing documentation
- Enable
only_upper = trueuntil the detection is reliable
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/solve.cu` around lines 956 - 958, The code
currently leaves only_upper=false, which uses lower bounds that can create
undetected infeasible instances; change only_upper to true (so batch_size =
fractional.size()) until the infeasibility-detection work is completed, and add
a clear inline comment at the only_upper declaration referencing the known
detection limitation and the TODO to re-enable lower bounds once
detect_infeasible_instances (or the partial implementation) is finished; update
any logic that depends on batch_size or lower-bound paths (symbols: only_upper,
batch_size, fractional) to behave correctly under only_upper=true.
| template <typename i_t, typename f_t> | ||
| void convergence_information_t<i_t, f_t>::resize_context(i_t new_size) | ||
| { | ||
| [[maybe_unused]] const auto batch_size = static_cast<i_t>(primal_objective_.size()); | ||
| cuopt_assert(batch_size > 0, "Batch size must be greater than 0"); | ||
| cuopt_assert(new_size > 0, "New size must be greater than 0"); | ||
| cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size"); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make resize_context accept a no-op resize (or fix the assertion message).
Line 268 currently rejects new_size == batch_size, but the error message says “less than or equal.” Either allow equality (no-op) or adjust the message to avoid confusion.
🔧 Suggested fix (allow no-op resize)
- cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size");
+ cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size");
+ if (new_size == batch_size) { return; }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| template <typename i_t, typename f_t> | |
| void convergence_information_t<i_t, f_t>::resize_context(i_t new_size) | |
| { | |
| [[maybe_unused]] const auto batch_size = static_cast<i_t>(primal_objective_.size()); | |
| cuopt_assert(batch_size > 0, "Batch size must be greater than 0"); | |
| cuopt_assert(new_size > 0, "New size must be greater than 0"); | |
| cuopt_assert(new_size < batch_size, "New size must be less than or equal to batch size"); | |
| template <typename i_t, typename f_t> | |
| void convergence_information_t<i_t, f_t>::resize_context(i_t new_size) | |
| { | |
| [[maybe_unused]] const auto batch_size = static_cast<i_t>(primal_objective_.size()); | |
| cuopt_assert(batch_size > 0, "Batch size must be greater than 0"); | |
| cuopt_assert(new_size > 0, "New size must be greater than 0"); | |
| cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size"); | |
| if (new_size == batch_size) { return; } | |
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/termination_strategy/convergence_information.cu`
around lines 262 - 269, The assertion in convergence_information_t<i_t,
f_t>::resize_context incorrectly forbids new_size == batch_size while the
message says "less than or equal"; modify the check to allow no-op resizes by
changing the condition to permit equality (i.e., assert new_size <= batch_size)
or, if you prefer to keep strictness, update the assertion message to remove "or
equal"; update the assertion referencing primal_objective_.size() and the
cuopt_assert that currently checks new_size < batch_size so it matches the
intended behavior.
| cub::DeviceTransform::Transform( | ||
| cuda::std::make_tuple(problem_wrap_container(op_problem.constraint_lower_bounds), | ||
| problem_wrap_container(op_problem.constraint_upper_bounds)), | ||
| combined_bounds.data(), | ||
| combined_bounds.size(), | ||
| combine_finite_abs_bounds<f_t>(), | ||
| op_problem.handle_ptr->get_stream()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing CUDA error check on DeviceTransform.
cub::DeviceTransform::Transform can fail, but its return status isn't checked. Per coding guidelines, CUDA operations require error verification.
🔧 Suggested fix
- cub::DeviceTransform::Transform(
+ RAFT_CUDA_TRY(cub::DeviceTransform::Transform(
cuda::std::make_tuple(problem_wrap_container(op_problem.constraint_lower_bounds),
problem_wrap_container(op_problem.constraint_upper_bounds)),
combined_bounds.data(),
combined_bounds.size(),
combine_finite_abs_bounds<f_t>(),
- op_problem.handle_ptr->get_stream());
+ op_problem.handle_ptr->get_stream()));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cub::DeviceTransform::Transform( | |
| cuda::std::make_tuple(problem_wrap_container(op_problem.constraint_lower_bounds), | |
| problem_wrap_container(op_problem.constraint_upper_bounds)), | |
| combined_bounds.data(), | |
| combined_bounds.size(), | |
| combine_finite_abs_bounds<f_t>(), | |
| op_problem.handle_ptr->get_stream()); | |
| } | |
| RAFT_CUDA_TRY(cub::DeviceTransform::Transform( | |
| cuda::std::make_tuple(problem_wrap_container(op_problem.constraint_lower_bounds), | |
| problem_wrap_container(op_problem.constraint_upper_bounds)), | |
| combined_bounds.data(), | |
| combined_bounds.size(), | |
| combine_finite_abs_bounds<f_t>(), | |
| op_problem.handle_ptr->get_stream())); |
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/utils.cuh` around lines 241 - 248, The call to
cub::DeviceTransform::Transform (invoked with
problem_wrap_container(op_problem.constraint_lower_bounds/upper_bounds),
combined_bounds, combine_finite_abs_bounds<f_t>(),
op_problem.handle_ptr->get_stream()) can fail and its return status must be
checked; update this call to capture the returned cub::Result (or cudaError_t as
appropriate for your CUB wrapper) and verify it with the project's CUDA
error-checking macro/function (e.g., CUDA_CHECK or equivalent), logging or
propagating the error if the call indicates failure so that failures in
DeviceTransform are detected and handled.
| { | ||
| cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings); | ||
| auto pdlp_timer = timer_t(solver_settings.time_limit); | ||
| cuopt::linear_programming::pdlp_hyper_params::update_step_size_on_initial_solution = true; | ||
| solver_settings.hyper_params.update_step_size_on_initial_solution = true; | ||
| solver.run_solver(pdlp_timer); | ||
| RAFT_CUDA_TRY(cudaStreamSynchronize(handle_.get_stream())); | ||
| EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(), factor_tolerance); | ||
| EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(), factor_tolerance); | ||
| cuopt::linear_programming::pdlp_hyper_params::update_step_size_on_initial_solution = false; | ||
| EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(0), factor_tolerance); | ||
| EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(0), factor_tolerance); | ||
| solver_settings.hyper_params.update_step_size_on_initial_solution = false; | ||
| } | ||
| { | ||
| cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings); | ||
| auto pdlp_timer = timer_t(solver_settings.time_limit); | ||
| cuopt::linear_programming::pdlp_hyper_params::update_primal_weight_on_initial_solution = true; | ||
| solver_settings.hyper_params.update_primal_weight_on_initial_solution = true; | ||
| solver.run_solver(pdlp_timer); | ||
| RAFT_CUDA_TRY(cudaStreamSynchronize(handle_.get_stream())); | ||
| EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(), factor_tolerance); | ||
| EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(), factor_tolerance); | ||
| cuopt::linear_programming::pdlp_hyper_params::update_primal_weight_on_initial_solution = false; | ||
| EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(0), factor_tolerance); | ||
| EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(0), factor_tolerance); | ||
| solver_settings.hyper_params.update_primal_weight_on_initial_solution = false; | ||
| } | ||
| { | ||
| cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings); | ||
| auto pdlp_timer = timer_t(solver_settings.time_limit); | ||
| cuopt::linear_programming::pdlp_hyper_params::update_primal_weight_on_initial_solution = true; | ||
| cuopt::linear_programming::pdlp_hyper_params::update_step_size_on_initial_solution = true; | ||
| solver_settings.hyper_params.update_primal_weight_on_initial_solution = true; | ||
| solver_settings.hyper_params.update_step_size_on_initial_solution = true; | ||
| solver.run_solver(pdlp_timer); | ||
| RAFT_CUDA_TRY(cudaStreamSynchronize(handle_.get_stream())); | ||
| EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(), factor_tolerance); | ||
| EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(), factor_tolerance); | ||
| cuopt::linear_programming::pdlp_hyper_params::update_primal_weight_on_initial_solution = false; | ||
| cuopt::linear_programming::pdlp_hyper_params::update_step_size_on_initial_solution = false; | ||
| EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(0), factor_tolerance); | ||
| EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(0), factor_tolerance); | ||
| solver_settings.hyper_params.update_primal_weight_on_initial_solution = false; | ||
| solver_settings.hyper_params.update_step_size_on_initial_solution = false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move hyper_params toggles before solver construction.
The flags are set after pdlp_solver_t is created, so the solver may never see the new values and the test won’t actually validate the intended behavior.
🔧 Proposed fix
- {
- cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
- auto pdlp_timer = timer_t(solver_settings.time_limit);
- solver_settings.hyper_params.update_step_size_on_initial_solution = true;
+ {
+ solver_settings.hyper_params.update_step_size_on_initial_solution = true;
+ cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
+ auto pdlp_timer = timer_t(solver_settings.time_limit);
solver.run_solver(pdlp_timer);
RAFT_CUDA_TRY(cudaStreamSynchronize(handle_.get_stream()));
EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(0), factor_tolerance);
EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(0), factor_tolerance);
solver_settings.hyper_params.update_step_size_on_initial_solution = false;
}
- {
- cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
- auto pdlp_timer = timer_t(solver_settings.time_limit);
- solver_settings.hyper_params.update_primal_weight_on_initial_solution = true;
+ {
+ solver_settings.hyper_params.update_primal_weight_on_initial_solution = true;
+ cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
+ auto pdlp_timer = timer_t(solver_settings.time_limit);
solver.run_solver(pdlp_timer);
RAFT_CUDA_TRY(cudaStreamSynchronize(handle_.get_stream()));
EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(0), factor_tolerance);
EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(0), factor_tolerance);
solver_settings.hyper_params.update_primal_weight_on_initial_solution = false;
}
- {
- cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
- auto pdlp_timer = timer_t(solver_settings.time_limit);
- solver_settings.hyper_params.update_primal_weight_on_initial_solution = true;
- solver_settings.hyper_params.update_step_size_on_initial_solution = true;
+ {
+ solver_settings.hyper_params.update_primal_weight_on_initial_solution = true;
+ solver_settings.hyper_params.update_step_size_on_initial_solution = true;
+ cuopt::linear_programming::detail::pdlp_solver_t<int, double> solver(problem, solver_settings);
+ auto pdlp_timer = timer_t(solver_settings.time_limit);
solver.run_solver(pdlp_timer);
RAFT_CUDA_TRY(cudaStreamSynchronize(handle_.get_stream()));
EXPECT_NEAR(initial_step_size_afiro, solver.get_step_size_h(0), factor_tolerance);
EXPECT_NEAR(initial_primal_weight_afiro, solver.get_primal_weight_h(0), factor_tolerance);
solver_settings.hyper_params.update_primal_weight_on_initial_solution = false;
solver_settings.hyper_params.update_step_size_on_initial_solution = false;
}🤖 Prompt for AI Agents
In `@cpp/tests/linear_programming/pdlp_test.cu` around lines 326 - 357, The test
currently sets solver_settings.hyper_params flags after constructing
cuopt::linear_programming::detail::pdlp_solver_t<int,double> (symbol:
pdlp_solver_t), so the solver instance never observes those toggles; move each
modification of solver_settings.hyper_params
(update_step_size_on_initial_solution, update_primal_weight_on_initial_solution)
to occur before constructing the corresponding pdlp_solver_t instance and
creating the timer_t, then construct the solver, run_solver, synchronize, assert
EXPECT_NEAR, and finally reset the hyper_params flags afterward as currently
done.
…ut of bound, reenable initial solution checking if not in batch pdlp mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp (1)
95-147: Update Doxygen for batch-aware constructors and per-id getters.Docs still describe singular termination data and the new overload at Lines 139-147 lacks a Doxygen block. Please document the batch vector semantics and the meaning/bounds of the
idparameter in the per-id accessors.As per coding guidelines, for public header files (C++ API): Check if new public functions/classes have documentation comments (Doxygen format) and verify parameter descriptions match actual types/behavior.
Also applies to: 198-213
cpp/src/linear_programming/solver_solution.cu (1)
48-89: Enforce termination vector size invariants to prevent OOB access.Accessors index
termination_stats_based ontermination_status_.size(). If the vectors differ in size or are empty, this can go out-of-bounds. Add a size/non-empty validation in the vector-based constructors (or a shared validator) so accessor checks remain safe.🛠️ Suggested guard (apply to both vector constructors)
: primal_solution_(std::move(final_primal_solution)), dual_solution_(std::move(final_dual_solution)), reduced_cost_(std::move(final_reduced_cost)), pdlp_warm_start_data_(std::move(warm_start_data)), objective_name_(objective_name), var_names_(std::move(var_names)), row_names_(std::move(row_names)), termination_stats_(std::move(termination_stats)), termination_status_(std::move(termination_status)), error_status_(cuopt::logic_error("", cuopt::error_type_t::Success)) { + cuopt_expects(termination_stats_.size() == termination_status_.size() && + !termination_stats_.empty(), + error_type_t::ValidationError, + "termination_stats and termination_status must have the same non-zero size"); }Also applies to: 321-332, 387-392
🧹 Nitpick comments (2)
cpp/src/linear_programming/solve.cu (1)
862-865: Consider using consistent stream for device_uvector copies.The warm start device_uvectors are constructed using the source solution's stream. While this likely works since
warm_start_settingsuses the same handle, explicitly using thestreamvariable from line 814 would make the intent clearer and ensure consistency.♻️ Optional clarification
if (primal_dual_init) { - initial_primal = rmm::device_uvector<double>( - original_solution.get_primal_solution(), original_solution.get_primal_solution().stream()); - initial_dual = rmm::device_uvector<double>(original_solution.get_dual_solution(), - original_solution.get_dual_solution().stream()); + initial_primal = rmm::device_uvector<double>( + original_solution.get_primal_solution(), stream); + initial_dual = rmm::device_uvector<double>( + original_solution.get_dual_solution(), stream); initial_step_size = original_solution.get_pdlp_warm_start_data().initial_step_size_; }cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp (1)
243-259: Consider const overloads to avoid copies and unintended mutation.
get_additional_termination_informations() constreturns by value, and there’s noconstoverload forget_terminations_status(). Aconst&return for both would avoid copies and accidental mutation in read-only contexts.
…roblem from public facing api, default init mip_batch_pdlp to 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@cpp/src/dual_simplex/pseudo_costs.cpp`:
- Around line 263-264: The batch_pdlp_solve call is missing the configured
settings and thus uses defaults; update the call to pass the existing settings
variable (batch_pdlp_solve(..., fractional, fraction_values, settings)) but
first adjust the solver time limit in settings for strong-branching use (e.g.,
reduce settings.time_limit or set a short per-branch timeout) so the PDLP solver
runs with the intended hyper-parameters; locate the call to batch_pdlp_solve and
the nearby settings variable to apply this change.
🧹 Nitpick comments (1)
cpp/src/dual_simplex/pseudo_costs.cpp (1)
145-146: Use template typei_tinstead ofintfor consistency.Variables
mandnare declared asintbut should use the template typei_tfor consistency with the rest of the codebase.📝 Proposed fix
cuopt::mps_parser::mps_data_model_t<i_t, f_t> mps_model; - int m = user_problem.num_rows; - int n = user_problem.num_cols; + i_t m = user_problem.num_rows; + i_t n = user_problem.num_cols;
| const auto mps_model = simplex_problem_to_mps_data_model(original_problem); | ||
| const auto solutions = batch_pdlp_solve(original_problem.handle_ptr, mps_model, fractional, fraction_values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Search for batch_pdlp_solve definition and all call sites
rg -n 'batch_pdlp_solve' --type cpp --type hpp -A 2 -B 1Repository: NVIDIA/cuopt
Length of output: 83
🏁 Script executed:
# Also search for the function definition in header files
fd . --name "*.hpp" --name "*.h" | xargs rg -l 'batch_pdlp_solve' | head -5Repository: NVIDIA/cuopt
Length of output: 468
🏁 Script executed:
# Get context around lines 263-264 in pseudo_costs.cpp
sed -n '250,280p' cpp/src/dual_simplex/pseudo_costs.cpp | cat -nRepository: NVIDIA/cuopt
Length of output: 1641
🏁 Script executed:
# Find batch_pdlp_solve signature in solve.hpp
cat cpp/include/cuopt/linear_programming/solve.hpp | grep -A 20 'batch_pdlp_solve'Repository: NVIDIA/cuopt
Length of output: 987
🏁 Script executed:
# Check all call sites of batch_pdlp_solve in solve.cu
rg -n 'batch_pdlp_solve' cpp/src/linear_programming/solve.cu -A 1 -B 1Repository: NVIDIA/cuopt
Length of output: 486
🏁 Script executed:
# Check all call sites in pdlp_test.cu
rg -n 'batch_pdlp_solve' cpp/tests/linear_programming/pdlp_test.cu -A 1 -B 1Repository: NVIDIA/cuopt
Length of output: 297
🏁 Script executed:
# Get full context around the batch_pdlp_solve call to see what 'settings' is available
sed -n '240,320p' cpp/src/dual_simplex/pseudo_costs.cpp | cat -nRepository: NVIDIA/cuopt
Length of output: 4069
🏁 Script executed:
# Check what 'settings' contains - look for settings type and hyper_params usage
rg -n 'settings\.' cpp/src/dual_simplex/pseudo_costs.cpp | head -20Repository: NVIDIA/cuopt
Length of output: 1371
Pass configured settings to batch_pdlp_solve call.
The call to batch_pdlp_solve at line 264 omits the settings parameter and will use the default pdlp_solver_settings_t<i_t, f_t>{}. This ignores configured hyper-parameters that should influence the solver behavior during strong branching. The settings variable is already in scope and should be passed with appropriate time-limit adjustment.
🔧 Proposed fix
const auto mps_model = simplex_problem_to_mps_data_model(original_problem);
- const auto solutions = batch_pdlp_solve(original_problem.handle_ptr, mps_model, fractional, fraction_values);
+ // Create PDLP settings from simplex settings
+ cuopt::linear_programming::pdlp_solver_settings_t<i_t, f_t> pdlp_settings;
+ pdlp_settings.hyper_params = settings.hyper_params;
+ pdlp_settings.time_limit = std::max(0.0, settings.time_limit - toc(start_time));
+ const auto solutions = batch_pdlp_solve(original_problem.handle_ptr, mps_model, fractional, fraction_values, pdlp_settings);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const auto mps_model = simplex_problem_to_mps_data_model(original_problem); | |
| const auto solutions = batch_pdlp_solve(original_problem.handle_ptr, mps_model, fractional, fraction_values); | |
| const auto mps_model = simplex_problem_to_mps_data_model(original_problem); | |
| // Create PDLP settings from simplex settings | |
| cuopt::linear_programming::pdlp_solver_settings_t<i_t, f_t> pdlp_settings; | |
| pdlp_settings.hyper_params = settings.hyper_params; | |
| pdlp_settings.time_limit = std::max(0.0, settings.time_limit - toc(start_time)); | |
| const auto solutions = batch_pdlp_solve(original_problem.handle_ptr, mps_model, fractional, fraction_values, pdlp_settings); |
🤖 Prompt for AI Agents
In `@cpp/src/dual_simplex/pseudo_costs.cpp` around lines 263 - 264, The
batch_pdlp_solve call is missing the configured settings and thus uses defaults;
update the call to pass the existing settings variable (batch_pdlp_solve(...,
fractional, fraction_values, settings)) but first adjust the solver time limit
in settings for strong-branching use (e.g., reduce settings.time_limit or set a
short per-branch timeout) so the PDLP solver runs with the intended
hyper-parameters; locate the call to batch_pdlp_solve and the nearby settings
variable to apply this change.
…usparse macro, call spmm preprocess only after a certain cuda version, optimal batch size risking 0 evaluation
…ays be equal, use bulk instead of custom kernel so that it's more efficient limits risk of out of bound and have an always good thread number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@cpp/src/linear_programming/cusparse_view.cu`:
- Around line 236-262: The call to cusparseSpMM_preprocess in
my_cusparsespmm_preprocess must be guarded by a runtime dlsym check like SpMV;
add a typedef cusparseSpMM_preprocess_sig matching cusparseSpMM_preprocess's
signature, use dynamic_load_runtime::function<cusparseSpMM_preprocess_sig> to
lookup the symbol (handle missing symbol by returning/throwing similarly to
my_cusparsespmmv_preprocess), then invoke the resolved function instead of
calling cusparseSpMM_preprocess directly and wrap the call with
RAFT_CUSPARSE_TRY (keep the existing cusparseSetStream(handle, stream) before
the call).
In
`@cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu`:
- Around line 40-46: The assignments of size_t current_batch_size to int
variables (col_primal, ld_primal, col_dual, ld_dual) can silently truncate large
values; update the code to either keep these variables as a size-safe type or
validate-and-cast before use: add an explicit runtime check/assert that
current_batch_size <= INT_MAX (or the max required by the cuSPARSE API) and then
use a checked cast (e.g., static_cast<int>) when assigning to
col_primal/ld_primal/col_dual/ld_dual, or change their declarations from int to
a wider unsigned type that matches primal_size/dual_size and only convert when
calling cuSPARSE. Ensure you reference and modify the variables
current_batch_size, col_primal, ld_primal, col_dual, and ld_dual (and any call
sites that pass them into cuSPARSE descriptors) so no silent narrowing occurs.
- Around line 244-250: The batch-size doubling and sum expressions can overflow
32-bit ints; change multiplications to use a wider intermediate (e.g., int64_t /
long long) before clamping back to int: for the right_node assignment (currently
computing right_node = std::min(current_batch_size * 2, max_batch_size)) and for
the sum expression used elsewhere (the (current_batch_size * 2) +
current_batch_size calculation), compute the doubled/added value in an int64_t
variable (e.g., int64_t doubled = static_cast<int64_t>(current_batch_size) * 2;
int64_t sum = doubled + static_cast<int64_t>(current_batch_size);), then use
std::min with static_cast<int>(...) to clamp against max_batch_size and assign
to right_node or the target variable; apply the same pattern for the other
occurrence referenced in the review to avoid overflow during the multiplication
step.
🧹 Nitpick comments (3)
cpp/src/linear_programming/pdhg.cu (1)
774-782: Consider adding CUDA error check after CUB device operation.
cub::DeviceFor::Bulklaunches a kernel internally. While CUB operations are generally reliable, addingRAFT_CUDA_TRY(cudaPeekAtLastError())after the call would align with the project's error-checking guidelines and catch any launch failures.As per coding guidelines: "Every CUDA kernel launch and memory operation must have error checking."
🔧 Suggested addition
cub::DeviceFor::Bulk(climber_strategies_.size(), refine_initial_primal_projection_bulk_op<i_t, f_t>{ make_span(new_bounds_idx_), make_span(new_bounds_lower_), make_span(new_bounds_upper_), make_span(current_saddle_point_state_.get_primal_solution()), problem_ptr->n_variables}, stream_view_.value()); + RAFT_CUDA_TRY(cudaPeekAtLastError());cpp/src/linear_programming/solver_solution.cu (2)
278-286: Clarify batch semantics for solve time.In batch mode this writes/reads a single solve time across all climbers. Consider storing per-id solve times (or documenting that this is a global solve time) to avoid misleading per-solution stats.
Also applies to: 298-304
375-418: Avoid exposing mutable vectors that can break size invariants.Returning non-const refs lets callers resize and violate the
termination_stats_/termination_status_size invariant. Consider returningconst&/span and add explicit setters that enforce size equality (or document and enforce “no resize”).
| // TODO add proper checking | ||
| #if CUDA_VER_12_4_UP | ||
| template <typename T, | ||
| typename std::enable_if_t<std::is_same_v<T, float> || std::is_same_v<T, double>>*> | ||
| void my_cusparsespmm_preprocess(cusparseHandle_t handle, | ||
| cusparseOperation_t opA, | ||
| cusparseOperation_t opB, | ||
| const T* alpha, | ||
| const cusparseSpMatDescr_t matA, | ||
| const cusparseDnMatDescr_t matB, | ||
| const T* beta, | ||
| const cusparseDnMatDescr_t matC, | ||
| cusparseSpMMAlg_t alg, | ||
| void* externalBuffer, | ||
| cudaStream_t stream) | ||
| { | ||
| auto constexpr float_type = []() constexpr { | ||
| if constexpr (std::is_same_v<T, float>) { | ||
| return CUDA_R_32F; | ||
| } else if constexpr (std::is_same_v<T, double>) { | ||
| return CUDA_R_64F; | ||
| } | ||
| }(); | ||
| CUSPARSE_CHECK(cusparseSetStream(handle, stream)); | ||
| RAFT_CUSPARSE_TRY(cusparseSpMM_preprocess( | ||
| handle, opA, opB, alpha, matA, matB, beta, matC, float_type, alg, externalBuffer)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "cusparse_view.cu" | head -5Repository: NVIDIA/cuopt
Length of output: 141
🏁 Script executed:
# Once we find the file, let's examine the relevant lines
git ls-files | grep -E "(cusparse|linear_programming)" | head -20Repository: NVIDIA/cuopt
Length of output: 1217
🏁 Script executed:
wc -l ./cpp/src/linear_programming/cusparse_view.cuRepository: NVIDIA/cuopt
Length of output: 105
🏁 Script executed:
sed -n '230,270p' ./cpp/src/linear_programming/cusparse_view.cuRepository: NVIDIA/cuopt
Length of output: 1680
🏁 Script executed:
# Search for SpMV implementation to understand the guard pattern being referenced
rg -A 15 "my_cusparsespmmv_preprocess|dlsym|dynamic_load_runtime" ./cpp/src/linear_programming/cusparse_view.cu | head -50Repository: NVIDIA/cuopt
Length of output: 2339
🏁 Script executed:
# Check for CUDA version requirements in the file
rg -B 5 -A 5 "CUDA_VER_12_4|cusparseSpMM_preprocess" ./cpp/src/linear_programming/cusparse_view.cuRepository: NVIDIA/cuopt
Length of output: 3035
🏁 Script executed:
# Search for cusparseSpMM_preprocess_sig definition
rg "cusparseSpMM_preprocess_sig" ./cpp/src/linear_programming/cusparse_view.cuRepository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Check where my_cusparsespmm_preprocess is actually called to understand impact
rg -B 3 -A 3 "my_cusparsespmm_preprocess\(" ./cpp/src/linear_programming/cusparse_view.cu | head -40Repository: NVIDIA/cuopt
Length of output: 1785
🏁 Script executed:
# Check the actual signature of cusparseSpMM_preprocess in CUDA headers or documentation
# to understand what signature typedef would be needed
rg "cusparseSpMV_preprocess_sig|using cusparse_sig" ./cpp/src/linear_programming/cusparse_view.cu -A 5 | head -30Repository: NVIDIA/cuopt
Length of output: 834
🏁 Script executed:
# Verify the pattern: check if similar runtime guards exist elsewhere in the codebase
rg -B 2 "dynamic_load_runtime::function" ./cpp/src/linear_programming/cusparse_view.cu | head -20Repository: NVIDIA/cuopt
Length of output: 296
Add runtime symbol guard to cusparseSpMM_preprocess to match SpMV's dlsym protection.
The my_cusparsespmm_preprocess function (lines 242–260) directly calls cusparseSpMM_preprocess without runtime symbol verification. If the binary runs with pre-12.4 cuSPARSE, this call will fail to load. The my_cusparsespmmv_preprocess (SpMV) implementation correctly uses dynamic_load_runtime::function with runtime dlsym checking; SpMM needs the same guard.
Add a cusparseSpMM_preprocess_sig typedef matching the function's arguments and wrap the call with the dlsym guard pattern already demonstrated in SpMV.
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/cusparse_view.cu` around lines 236 - 262, The call
to cusparseSpMM_preprocess in my_cusparsespmm_preprocess must be guarded by a
runtime dlsym check like SpMV; add a typedef cusparseSpMM_preprocess_sig
matching cusparseSpMM_preprocess's signature, use
dynamic_load_runtime::function<cusparseSpMM_preprocess_sig> to lookup the symbol
(handle missing symbol by returning/throwing similarly to
my_cusparsespmmv_preprocess), then invoke the resolved function instead of
calling cusparseSpMM_preprocess directly and wrap the call with
RAFT_CUSPARSE_TRY (keep the existing cusparseSetStream(handle, stream) before
the call).
| int rows_primal = primal_size; | ||
| int col_primal = current_batch_size; | ||
| int ld_primal = current_batch_size; | ||
|
|
||
| int rows_dual = dual_size; | ||
| int col_dual = current_batch_size; | ||
| int ld_dual = current_batch_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Narrowing conversion from size_t to int may truncate large batch sizes.
current_batch_size is size_t but is assigned to int variables col_primal and ld_primal. If current_batch_size > INT_MAX, this truncates silently, causing incorrect cuSPARSE descriptors and potential memory corruption.
🐛 Proposed fix
- int rows_primal = primal_size;
- int col_primal = current_batch_size;
- int ld_primal = current_batch_size;
-
- int rows_dual = dual_size;
- int col_dual = current_batch_size;
- int ld_dual = current_batch_size;
+ int64_t rows_primal = primal_size;
+ int64_t col_primal = current_batch_size;
+ int64_t ld_primal = current_batch_size;
+
+ int64_t rows_dual = dual_size;
+ int64_t col_dual = current_batch_size;
+ int64_t ld_dual = current_batch_size;Alternatively, add an assertion to validate that current_batch_size fits in the type expected by the cuSPARSE API.
🤖 Prompt for AI Agents
In
`@cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu`
around lines 40 - 46, The assignments of size_t current_batch_size to int
variables (col_primal, ld_primal, col_dual, ld_dual) can silently truncate large
values; update the code to either keep these variables as a size-safe type or
validate-and-cast before use: add an explicit runtime check/assert that
current_batch_size <= INT_MAX (or the max required by the cuSPARSE API) and then
use a checked cast (e.g., static_cast<int>) when assigning to
col_primal/ld_primal/col_dual/ld_dual, or change their declarations from int to
a wider unsigned type that matches primal_size/dual_size and only convert when
calling cuSPARSE. Ensure you reference and modify the variables
current_batch_size, col_primal, ld_primal, col_dual, and ld_dual (and any call
sites that pass them into cuSPARSE descriptors) so no silent narrowing occurs.
| // Sync before starting anything to make sure everything is done | ||
| RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view)); | ||
|
|
||
| // Evaluate current, left and right nodes to pick a direction | ||
|
|
||
| const int left_node = std::max(1, current_batch_size / 2); | ||
| const int right_node = std::min(current_batch_size * 2, max_batch_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, locate the file
find . -name "optimal_batch_size_handler.cu" -type fRepository: NVIDIA/cuopt
Length of output: 141
🏁 Script executed:
# Search for call sites and function definition
rg -n "optimal_batch_size_handler" --type cpp --type cu -A 3 -B 3Repository: NVIDIA/cuopt
Length of output: 82
🏁 Script executed:
# Look at the file structure and function signatures
rg -n "max_batch_size" -B 5 -A 10 --type cpp --type cu | head -100Repository: NVIDIA/cuopt
Length of output: 82
🏁 Script executed:
# Read the file to examine the code at specified lines
cat -n ./cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu | sed -n '240,260p'Repository: NVIDIA/cuopt
Length of output: 1303
🏁 Script executed:
# Also check the other lines mentioned
cat -n ./cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu | sed -n '280,295p'Repository: NVIDIA/cuopt
Length of output: 767
🏁 Script executed:
# Check lines around 321, 365, 400-401
cat -n ./cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu | sed -n '315,330p'Repository: NVIDIA/cuopt
Length of output: 903
🏁 Script executed:
# Check line 365
cat -n ./cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu | sed -n '360,375p'Repository: NVIDIA/cuopt
Length of output: 912
🏁 Script executed:
# Check lines around 400-401
cat -n ./cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu | sed -n '395,410p'Repository: NVIDIA/cuopt
Length of output: 855
🏁 Script executed:
# Find the function definition to understand the parameter types
cat -n ./cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu | head -50Repository: NVIDIA/cuopt
Length of output: 2238
🏁 Script executed:
# Search for calls to optimal_batch_size_handler without type restrictions
grep -rn "optimal_batch_size_handler" --include="*.cu" --include="*.cpp" --include="*.hpp" -A 3 -B 1Repository: NVIDIA/cuopt
Length of output: 4071
🏁 Script executed:
# Find memory_max_batch_size definition and usage
grep -rn "memory_max_batch_size" --include="*.cu" --include="*.cpp" --include="*.hpp" -B 2 -A 2Repository: NVIDIA/cuopt
Length of output: 1709
🏁 Script executed:
# Check if there's any validation or constraint on max_batch_size in the function
cat -n ./cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu | sed -n '196,230p'Repository: NVIDIA/cuopt
Length of output: 1971
🏁 Script executed:
# Look for any constants or limits related to batch size
grep -rn "INT_MAX\|INT_MIN\|batch_size.*limit\|max.*batch" --include="*.hpp" --include="*.h" | grep -i batch | head -20Repository: NVIDIA/cuopt
Length of output: 195
Verify integer overflow protection for batch size calculations.
On line 250 and 365, current_batch_size * 2 computes in int and can overflow if current_batch_size is large. While the result is capped by max_batch_size, the overflow happens before the comparison. Similar risk exists on line 321 where (current_batch_size * 2) + current_batch_size can overflow during the multiplication step.
However, line 287 performs division (current_batch_size / 2), not multiplication, and line 400-401 doesn't involve doubling ((current_batch_size / 2) + current_batch_size), so these don't have the same overflow risk.
To fix: Use wider intermediate types for all multiplications involving batch sizes:
const int left_node = std::max(1, current_batch_size / 2);
- const int right_node = std::min(current_batch_size * 2, max_batch_size);
+ const int right_node = std::min(static_cast<int>(static_cast<int64_t>(current_batch_size) * 2),
+ max_batch_size);Apply similar fixes to lines 321 and 365.
🤖 Prompt for AI Agents
In
`@cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu`
around lines 244 - 250, The batch-size doubling and sum expressions can overflow
32-bit ints; change multiplications to use a wider intermediate (e.g., int64_t /
long long) before clamping back to int: for the right_node assignment (currently
computing right_node = std::min(current_batch_size * 2, max_batch_size)) and for
the sum expression used elsewhere (the (current_batch_size * 2) +
current_batch_size calculation), compute the doubled/added value in an int64_t
variable (e.g., int64_t doubled = static_cast<int64_t>(current_batch_size) * 2;
int64_t sum = doubled + static_cast<int64_t>(current_batch_size);), then use
std::min with static_cast<int>(...) to clamp against max_batch_size and assign
to right_node or the target variable; apply the same pattern for the other
occurrence referenced in the review to avoid overflow during the multiplication
step.
…assert size comment, changed from non-const static to constexpr, fix potential segfault in non-concurrent mode, correction use f_t instead of double, protect again wrong size in matrix swap and optimal batch size handler, guarded again SegemendeReduce buffer allocation issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/src/linear_programming/saddle_point.cu (1)
94-107: Add validation or documentation clarifyingcopy()batch semantics.The
copy()method copies onlyprimal_size_anddual_size_elements, while all other state buffers are allocated asbatch_size * dimension. Unlikeprimal_gradient_(which has an explicit comment stating it does not support batch mode),copy()lacks documentation clarifying whether it intentionally copies only a single problem's solution or if this is a batch-aware design oversight.Since
swap_context()andresize_context()properly handle batch data, add either:
- A validation check asserting
batch_size == 1before copying, or- Documentation explicitly stating that
copy()only operates on single-batch problems.
🤖 Fix all issues with AI agents
In `@cpp/src/linear_programming/cusparse_view.cu`:
- Line 261: Replace the inconsistent use of CUSPARSE_CHECK with the project's
standard RAFT_CUSPARSE_TRY macro for the cusparseSetStream call: change the call
that currently invokes CUSPARSE_CHECK(cusparseSetStream(handle, stream)) to use
RAFT_CUSPARSE_TRY(cusparseSetStream(handle, stream)) so error handling matches
the rest of the file and uses the same macro semantics for the symbols
cusparseSetStream, handle, and stream.
In `@cpp/src/linear_programming/pdhg.cu`:
- Around line 170-177: The include for utilities/copy_helpers.hpp (which
provides make_span) is currently guarded by CUPDLP_DEBUG_MODE causing
compilation failures in release; remove the debug-only guard so make_span is
available in all builds, ensuring calls like make_span(...) passed into
pdhg_swap_bounds_kernel (and any other kernel launches) compile; update the
include that provides make_span to be unconditional (no CUPDLP_DEBUG_MODE
`#if/`#ifdef) so symbols like make_span are defined for pdhg_swap_bounds_kernel
and related uses.
- Around line 774-781: The CUB bulk launches (cub::DeviceFor::Bulk calls)
currently lack CUDA error checks; wrap each call with RAFT_CUDA_TRY so failures
are surfaced. Specifically, replace each raw call like
cub::DeviceFor::Bulk(climber_strategies_.size(),
refine_initial_primal_projection_bulk_op<i_t, f_t>{...}, stream_view_.value())
(and the similar calls using other bulk op functors at the other locations) with
RAFT_CUDA_TRY(cub::DeviceFor::Bulk(...)); keeping the same arguments (e.g.,
climber_strategies_.size(), the respective refine_*_bulk_op functor instances,
and stream_view_.value()) so the return cudaError_t is checked.
In `@cpp/src/linear_programming/solve.cu`:
- Around line 728-743: The loop that decrements memory_max_batch_size using
batch_pdlp_memory_estimator can exit at 1 while the estimated memory still
exceeds free_mem, risking OOM; after the while loop (or when setting
use_optimal_batch_size) recompute or recheck
batch_pdlp_memory_estimator(problem, memory_max_batch_size, max_batch_size)
against free_mem and if it still exceeds free_mem, fail fast with a clear
error/exception or return an error code (include context like
memory_max_batch_size, free_mem and the failing estimate) instead of proceeding;
update the logic around use_optimal_batch_size and memory_max_batch_size to
ensure callers see the failure.
In
`@cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu`:
- Around line 483-516: The three calls to cub::DeviceSegmentedReduce::Sum (the
calls that use tuple_multiplies<f_t> and power_two_func_t<f_t> and the
temporaries dot_product_storage/dot_product_bytes on stream_view_) are missing
CUDA error checking; capture the cudaError_t returned by each
DeviceSegmentedReduce::Sum call and pass it to the project's error-checking
macro (e.g., CUDA_CHECK) or explicitly check and log/fail on non-success, so
each segmented-reduce invocation is validated (include a short context message
identifying the call when reporting the error).
In `@cpp/src/linear_programming/swap_and_resize_helper.cuh`:
- Around line 78-85: cub::DeviceTransform::Transform returns a cudaError_t and
its return value must be checked; wrap the call to
cub::DeviceTransform::Transform (the lambda swap transform invocation) with the
project's CUDA error macro (e.g. RAFT_CUDA_TRY or CUDA_CHECK) to surface launch
failures and propagate errors, i.e. call
RAFT_CUDA_TRY(cub::DeviceTransform::Transform(...)) or equivalent around the
existing call and keep passing matrix.stream() and the same
in_zip/out_zip/total_items and lambda as before.
In `@cpp/src/linear_programming/termination_strategy/convergence_information.cu`:
- Around line 371-385: The two CUB calls (cub::DeviceSegmentedReduce::Sum and
cub::DeviceTransform::Transform) in the batch residual path ignore their return
codes; capture their returned status and check it using the project's CUDA
error-checking macro (e.g., RAFT_CUDA_TRY or CUDA_TRY) or by testing the
returned cudaError_t and handling/logging on failure. Specifically, wrap the
call to cub::DeviceSegmentedReduce::Sum(...) (using dot_product_storage_,
dot_product_bytes_, thrust::make_transform_iterator(primal_residual_.data(),
power_two_func_t<f_t>{}), l2_primal_residual_.data(),
climber_strategies_.size(), dual_size_h_, stream_view_) and the call to
cub::DeviceTransform::Transform(...) (using l2_primal_residual_.data(),
l2_primal_residual_.data(), l2_primal_residual_.size(), the sqrt lambda,
stream_view_) with the error check so any CUDA/CUB error is propagated or logged
consistently.
- Around line 137-188: The sizing calls to cub::DeviceSegmentedReduce::Sum (the
nullptr temp-storage queries using variables like byte_needed, dual_dot_,
power_two_func_t, tuple_multiplies, primal_objective_, stream_view_, and
dot_product_bytes_) are not checking the returned cudaError_t; assign each
call's return to a cudaError_t (e.g., cudaError_t err =
cub::DeviceSegmentedReduce::Sum(...)) and immediately validate it with the
project error macro (CUDA_CHECK(err) or equivalent) after every sizing
invocation to ensure failures are caught during the segmented-reduce sizing
phase.
- Around line 766-793: Wrap each CUB call in the dual objective computation (the
two cub::DeviceSegmentedReduce::Sum calls and the
cub::DeviceTransform::Transform call) with CUDA error checking: capture the
returned cudaError_t from each call and call the project's CUDA error-handling
utility (or check for != cudaSuccess and log/return) using the same context
variables (dot_product_storage_, dot_product_bytes_,
thrust::make_transform_iterator(...), dual_dot_.data(), primal_slack_.data(),
sum_primal_slack_.data(), dual_objective_.data(), dual_objective_.size(),
stream_view_) so any failure is detected and reported immediately; ensure you do
this in the block that computes the dual objective so failures in segmented
reduce or transform are handled.
🧹 Nitpick comments (13)
cpp/src/linear_programming/saddle_point.cu (2)
23-38: Missing validation forbatch_sizeparameter.The constructor validates
primal_size > 0anddual_size > 0but does not validatebatch_size > 0. While zero-sized allocations may be handled by RMM, it would be more consistent to explicitly validate this parameter, especially sinceswap_contextandresize_contextassertbatch_size > 0.🔧 Suggested fix
{ EXE_CUOPT_EXPECTS(primal_size > 0, "Size of the primal problem must be larger than 0"); EXE_CUOPT_EXPECTS(dual_size > 0, "Size of the dual problem must be larger than 0"); + EXE_CUOPT_EXPECTS(batch_size > 0, "Batch size must be larger than 0");
77-92: Implementation looks correct; minor clarification suggestion.The resize logic correctly shrinks all batched buffers while preserving the stream associations. The strict inequality
new_size < batch_sizeis appropriate for the use case of removing terminated climbers.Consider clarifying the assertion message to indicate it refers to the current batch size:
📝 Optional clarification
- cuopt_assert(new_size < batch_size, "New size must be less than batch size"); + cuopt_assert(new_size < batch_size, "New size must be less than current batch size");cpp/src/linear_programming/optimal_batch_size_handler/optimal_batch_size_handler.cu (3)
130-143: Incorrect cast of buffer data to(f_t*).The SpMM buffer is a
void*opaque workspace returned by cuSPARSE, not typed data. Casting it to(f_t*)is semantically incorrect. While this may work because cuSPARSE interprets the pointer asvoid*anyway, the cast is misleading and could cause issues if the buffer is used elsewhere expectingf_t*.♻️ Suggested fix
RAFT_CUSPARSE_TRY(raft::sparse::detail::cusparsespmm( handle_ptr->get_cusparse_handle(), CUSPARSE_OPERATION_NON_TRANSPOSE, CUSPARSE_OPERATION_NON_TRANSPOSE, alpha.data(), A, x_descr, beta.data(), y_descr, (deterministic_batch_pdlp) ? CUSPARSE_SPMM_CSR_ALG3 : CUSPARSE_SPMM_CSR_ALG2, - (f_t*)buffer_non_transpose_batch.data(), + buffer_non_transpose_batch.data(), stream_view)); RAFT_CUSPARSE_TRY(raft::sparse::detail::cusparsespmm( handle_ptr->get_cusparse_handle(), CUSPARSE_OPERATION_NON_TRANSPOSE, CUSPARSE_OPERATION_NON_TRANSPOSE, alpha.data(), A_T, y_descr, beta.data(), x_descr, (deterministic_batch_pdlp) ? CUSPARSE_SPMM_CSR_ALG3 : CUSPARSE_SPMM_CSR_ALG2, - (f_t*)buffer_transpose_batch.data(), + buffer_transpose_batch.data(), stream_view));
221-221: Unused variablestream_view.The variable
stream_viewis declared on line 221 but is only used once on line 246 for synchronization. Consider inlining the call or removing the variable if the handle's stream can be used directly.♻️ Suggested fix
- rmm::cuda_stream_view stream_view = op_problem.get_handle_ptr()->get_stream(); detail::problem_t<i_t, f_t> problem(op_problem); // ... (cusparse view creation) ... // Sync before starting anything to make sure everything is done - RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view)); + RAFT_CUDA_TRY(cudaStreamSynchronize(op_problem.get_handle_ptr()->get_stream()));
425-435: Unreachable code block is intentional but may trigger compiler warnings.The
elseblock at line 426-431 handles the case wherecurrent_ratiois best, and it returns. The code at lines 433-434 is never reached. While this is good defensive programming, some compilers may warn about unreachable code.Consider whether to keep the defensive assert or remove it to avoid warnings, depending on your CI/compiler warning policy.
cpp/src/linear_programming/cusparse_view.cu (3)
339-396: Batch mode dense matrix descriptors created without error checking on data validity.The batch descriptor creation (lines 343-396) uses raw data pointers from
current_saddle_point_stateand input buffers. While there's an assert for_reflected_primal_solutionat line 363, other pointers likeget_dual_solution().data(),get_current_AtY().data()etc. are used directly without validation. Consider adding similar guards for critical pointers, especially since batch mode is new functionality.Also note the typo on line 339: "mdoe" should be "mode".
✏️ Fix typo
- // TODO batch mdoe: convert those to RAII views + // TODO batch mode: convert those to RAII views
481-512: Batch-mode buffer allocation lacks error checking after resize.The
buffer_transpose_batch_row_row_andbuffer_non_transpose_batch_row_row_buffers are resized (lines 495-496, 510-511) but there's no subsequent check that the allocation succeeded before using.data()in the preprocessing calls (lines 570, 582). While RMM typically throws on allocation failure, consider adding explicit size validation for robustness in this new batch path.
559-584: Missing CUDA error check aftermy_cusparsespmm_preprocesscalls in batch mode.The preprocessing calls inside the
if (batch_mode_)block (lines 560-571, 572-583) don't have explicit error checking after the function calls. Whilemy_cusparsespmm_preprocessusesRAFT_CUSPARSE_TRYinternally, it's good practice to verify the stream state after a sequence of GPU operations, especially in new batch-mode code paths.cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu (2)
154-165:resize_contextassertion may be overly restrictive.Line 160 asserts
new_size < batch_size, which prevents resizing to the same size. This might be intentional, but typically resize operations allownew_size <= batch_size(no-op when equal). If resizing to current size should be valid, consider:✏️ Allow no-op resize
- cuopt_assert(new_size < batch_size, "New size must be less than batch size"); + cuopt_assert(new_size <= batch_size, "New size must be less than or equal to batch size"); + if (new_size == batch_size) { return; } // No-op
167-268: Kernel asserts batch mode not supported, but batch path exists elsewhere.The
compute_step_sizes_from_movement_and_interactionkernel (lines 176-177) asserts thatprimal_weight.size() == 1, meaning it doesn't support batch mode. However, thecompute_step_sizesfunction (line 329) that calls this kernel also asserts!batch_mode_.This is internally consistent, but the kernel assertion message could be more informative about why batch is unsupported here (presumably because per-climber adaptive step sizing isn't implemented yet).
✏️ Clarify assertion message
- cuopt_assert(step_size_strategy_view.primal_weight.size() == 1, - "compute_step_sizes_from_movement_and_interaction not supported in batch"); + cuopt_assert(step_size_strategy_view.primal_weight.size() == 1, + "Per-climber adaptive step size computation not yet implemented in batch mode");cpp/src/linear_programming/restart_strategy/pdlp_restart_strategy.cu (3)
980-985: Redundant assignment in loop.The
weighted_average_solution_.iterations_since_last_restart_ = 0assignment is set multiple times inside the loop but only needs to be set once outside.♻️ Suggested fix
// TODO later batch mode: remove if you have per climber restart + weighted_average_solution_.iterations_since_last_restart_ = 0; for (size_t i = 0; i < climber_strategies_.size(); ++i) { - weighted_average_solution_.iterations_since_last_restart_ = 0; last_trial_fixed_point_error_[i] = std::numeric_limits<f_t>::infinity(); }
995-1003: Consider adding bounds validation for swap indices.Unlike the similar
swap_contextinconvergence_information.cuwhich validatespair.left < batch_sizeandpair.right < batch_size, this function skips bounds checking before accessing vector elements.♻️ Suggested validation
template <typename i_t, typename f_t> void pdlp_restart_strategy_t<i_t, f_t>::swap_context( const thrust::universal_host_pinned_vector<swap_pair_t<i_t>>& swap_pairs) { if (swap_pairs.empty()) { return; } last_restart_duality_gap_.swap_context(swap_pairs); cuopt_assert(!fixed_point_error_.empty(), "Fixed point error must not be empty"); + const auto batch_size = static_cast<i_t>(fixed_point_error_.size()); for (const auto& pair : swap_pairs) { + cuopt_assert(pair.left < batch_size && pair.right < batch_size, "Swap index out of bounds"); host_vector_swap(fixed_point_error_, pair.left, pair.right);
709-717: Consider parallel execution for large batch sizes.The TODO comment is valid - if batch sizes grow large,
std::reducewithout an execution policy may become a bottleneck. For now, this is acceptable, but consider usingstd::reduce(std::execution::par_unseq, ...)if batch sizes exceed a threshold.
| return CUDA_R_64F; | ||
| } | ||
| }(); | ||
| CUSPARSE_CHECK(cusparseSetStream(handle, stream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent error-checking macro: use RAFT_CUSPARSE_TRY instead of CUSPARSE_CHECK.
Line 261 uses CUSPARSE_CHECK while the rest of the file consistently uses RAFT_CUSPARSE_TRY. This inconsistency could lead to different error handling behavior.
🔧 Proposed fix
- CUSPARSE_CHECK(cusparseSetStream(handle, stream));
+ RAFT_CUSPARSE_TRY(cusparseSetStream(handle, stream));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CUSPARSE_CHECK(cusparseSetStream(handle, stream)); | |
| RAFT_CUSPARSE_TRY(cusparseSetStream(handle, stream)); |
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/cusparse_view.cu` at line 261, Replace the
inconsistent use of CUSPARSE_CHECK with the project's standard RAFT_CUSPARSE_TRY
macro for the cusparseSetStream call: change the call that currently invokes
CUSPARSE_CHECK(cusparseSetStream(handle, stream)) to use
RAFT_CUSPARSE_TRY(cusparseSetStream(handle, stream)) so error handling matches
the rest of the file and uses the same macro semantics for the symbols
cusparseSetStream, handle, and stream.
| const auto [grid_size, block_size] = | ||
| kernel_config_from_batch_size(static_cast<i_t>(swap_pairs.size())); | ||
| pdhg_swap_bounds_kernel<i_t, f_t> | ||
| <<<grid_size, block_size, 0, stream_view_>>>(thrust::raw_pointer_cast(swap_pairs.data()), | ||
| static_cast<i_t>(swap_pairs.size()), | ||
| make_span(new_bounds_idx_), | ||
| make_span(new_bounds_lower_), | ||
| make_span(new_bounds_upper_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for make_span definition and declarations
echo "=== Searching for make_span definition ==="
rg -n "make_span" cpp/src -g'*.{cu,cpp,cuh,hpp,h}' --max-count=20
echo ""
echo "=== Checking pdhg.cu includes ==="
head -50 cpp/src/linear_programming/pdhg.cu | grep -n "#include"
echo ""
echo "=== Checking lines around 170-177 in pdhg.cu with context ==="
sed -n '1,50p' cpp/src/linear_programming/pdhg.cu | grep -n "CUPDLP_DEBUG_MODE\|copy_helpers"Repository: NVIDIA/cuopt
Length of output: 49128
🏁 Script executed:
#!/bin/bash
# Get the exact context around lines 170-177
echo "=== Code context at lines 165-180 ==="
sed -n '165,180p' cpp/src/linear_programming/pdhg.cu | cat -n
echo ""
echo "=== Checking if make_span is defined in copy_helpers.hpp ==="
rg -n "make_span" cpp/src/utilities/copy_helpers.hppRepository: NVIDIA/cuopt
Length of output: 1713
🏁 Script executed:
#!/bin/bash
# Check all includes in pdhg.cu to understand the full picture
echo "=== All includes in pdhg.cu ==="
grep -n "#include" cpp/src/linear_programming/pdhg.cu | head -30
echo ""
echo "=== Check for CUPDLP_DEBUG_MODE guards around copy_helpers include ==="
grep -B2 -A2 "copy_helpers" cpp/src/linear_programming/pdhg.cuRepository: NVIDIA/cuopt
Length of output: 1006
Ensure make_span is available in non-debug builds.
make_span(...) is used at lines 175-177 outside debug-only blocks, but utilities/copy_helpers.hpp (where it's defined) is only included when CUPDLP_DEBUG_MODE is enabled. In release builds, this causes a compilation error. Remove the debug guard on the include.
Proposed fix
-#ifdef CUPDLP_DEBUG_MODE
-#include <utilities/copy_helpers.hpp>
-#endif
+#include <utilities/copy_helpers.hpp>🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/pdhg.cu` around lines 170 - 177, The include for
utilities/copy_helpers.hpp (which provides make_span) is currently guarded by
CUPDLP_DEBUG_MODE causing compilation failures in release; remove the debug-only
guard so make_span is available in all builds, ensuring calls like
make_span(...) passed into pdhg_swap_bounds_kernel (and any other kernel
launches) compile; update the include that provides make_span to be
unconditional (no CUPDLP_DEBUG_MODE `#if/`#ifdef) so symbols like make_span are
defined for pdhg_swap_bounds_kernel and related uses.
| cub::DeviceFor::Bulk(climber_strategies_.size(), | ||
| refine_initial_primal_projection_bulk_op<i_t, f_t>{ | ||
| make_span(new_bounds_idx_), | ||
| make_span(new_bounds_lower_), | ||
| make_span(new_bounds_upper_), | ||
| make_span(current_saddle_point_state_.get_primal_solution()), | ||
| problem_ptr->n_variables}, | ||
| stream_view_.value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the actual code in pdhg.cu at the mentioned line ranges
fd "pdhg.cu" -t f
Repository: NVIDIA/cuopt
Length of output: 90
🌐 Web query:
cub::DeviceFor::Bulk return type cudaError_t CUB CUDA error checking
💡 Result:
cub::DeviceFor::Bulk(...) returns cudaError_t (just like other CUB device-scope algorithms). You should treat it as a CUDA runtime call: check the return immediately, and (if you need to catch asynchronous kernel failures) add a synchronize + last-error check. [1], [5]
What the return value means
- The returned
cudaError_ttypically reflects immediate/launch-time issues (bad args, invalid device state, etc.). [3] - Because GPU work is asynchronous, runtime kernel failures may only surface later unless you explicitly synchronize. [3]
Recommended error-checking pattern
`#include` <cub/device/device_for.cuh>
`#include` <cuda_runtime.h>
`#define` CUDA_TRY(call) do { \
cudaError_t _e = (call); \
if (_e != cudaSuccess) return _e; \
} while(0)
cudaError_t run(int n, MyOp op, cudaStream_t s)
{
// If using the temp-storage overload:
void* d_temp = nullptr;
size_t temp_bytes = 0;
CUDA_TRY(cub::DeviceFor::Bulk(d_temp, temp_bytes, n, op, s)); // size query [1]
CUDA_TRY(cudaMalloc(&d_temp, temp_bytes));
CUDA_TRY(cub::DeviceFor::Bulk(d_temp, temp_bytes, n, op, s)); // launch [1]
// Catch async failures (debug / when correctness matters):
CUDA_TRY(cudaGetLastError()); // last launch error reporting guidance [3]
CUDA_TRY(cudaDeviceSynchronize()); // forces async errors to surface [3]
cudaFree(d_temp);
return cudaSuccess;
}
Notes:
cub::DeviceFor::Bulkhas both overloads (with and without temp storage), and the temp-storage form uses the standard CUB “nullptrto querytemp_storage_bytes” convention. [1]- CUB device-scope algorithms “always return
cudaError_t” by design. [5]
🏁 Script executed:
# Check what CUB operations are already being used with error checking in the codebase
rg "RAFT_CUDA_TRY.*cub::" -A 2 -B 2
Repository: NVIDIA/cuopt
Length of output: 2013
🏁 Script executed:
# Look for existing patterns of DeviceFor or DeviceTransform usage
rg "DeviceFor::|DeviceTransform" -A 2 -B 2
Repository: NVIDIA/cuopt
Length of output: 50368
🏁 Script executed:
# Get line count to understand file size
wc -l cpp/src/linear_programming/pdhg.cu
Repository: NVIDIA/cuopt
Length of output: 95
🏁 Script executed:
# Extract the specific line ranges mentioned in the review
echo "=== Lines 774-781 ==="
sed -n '774,781p' cpp/src/linear_programming/pdhg.cu
echo "=== Lines 813-824 ==="
sed -n '813,824p' cpp/src/linear_programming/pdhg.cu
echo "=== Lines 873-884 ==="
sed -n '873,884p' cpp/src/linear_programming/pdhg.cu
echo "=== Lines 920-930 ==="
sed -n '920,930p' cpp/src/linear_programming/pdhg.cu
echo "=== Lines 981-990 ==="
sed -n '981,990p' cpp/src/linear_programming/pdhg.cu
Repository: NVIDIA/cuopt
Length of output: 3556
🏁 Script executed:
# Search for RAFT_CUDA_TRY usage with DeviceFor or DeviceTransform in pdhg.cu specifically
rg "RAFT_CUDA_TRY.*Device(For|Transform)" cpp/src/linear_programming/pdhg.cu
Repository: NVIDIA/cuopt
Length of output: 38
Add CUDA error checks for CUB bulk operations.
The cub::DeviceFor::Bulk calls at lines 774–781, 813–824, 873–884, 920–930, and 981–990 lack error checking despite returning cudaError_t. Wrap them with RAFT_CUDA_TRY to surface launch failures, consistent with error-handling patterns used for CUB operations elsewhere in the codebase.
🐛 Example fix (apply similarly to all five instances)
- cub::DeviceFor::Bulk(climber_strategies_.size(),
- refine_initial_primal_projection_bulk_op<i_t, f_t>{
- make_span(new_bounds_idx_),
- make_span(new_bounds_lower_),
- make_span(new_bounds_upper_),
- make_span(current_saddle_point_state_.get_primal_solution()),
- problem_ptr->n_variables},
- stream_view_.value());
+ RAFT_CUDA_TRY(cub::DeviceFor::Bulk(climber_strategies_.size(),
+ refine_initial_primal_projection_bulk_op<i_t, f_t>{
+ make_span(new_bounds_idx_),
+ make_span(new_bounds_lower_),
+ make_span(new_bounds_upper_),
+ make_span(current_saddle_point_state_.get_primal_solution()),
+ problem_ptr->n_variables},
+ stream_view_.value()));
Per coding guidelines: Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cub::DeviceFor::Bulk(climber_strategies_.size(), | |
| refine_initial_primal_projection_bulk_op<i_t, f_t>{ | |
| make_span(new_bounds_idx_), | |
| make_span(new_bounds_lower_), | |
| make_span(new_bounds_upper_), | |
| make_span(current_saddle_point_state_.get_primal_solution()), | |
| problem_ptr->n_variables}, | |
| stream_view_.value()); | |
| RAFT_CUDA_TRY(cub::DeviceFor::Bulk(climber_strategies_.size(), | |
| refine_initial_primal_projection_bulk_op<i_t, f_t>{ | |
| make_span(new_bounds_idx_), | |
| make_span(new_bounds_lower_), | |
| make_span(new_bounds_upper_), | |
| make_span(current_saddle_point_state_.get_primal_solution()), | |
| problem_ptr->n_variables}, | |
| stream_view_.value())); |
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/pdhg.cu` around lines 774 - 781, The CUB bulk
launches (cub::DeviceFor::Bulk calls) currently lack CUDA error checks; wrap
each call with RAFT_CUDA_TRY so failures are surfaced. Specifically, replace
each raw call like cub::DeviceFor::Bulk(climber_strategies_.size(),
refine_initial_primal_projection_bulk_op<i_t, f_t>{...}, stream_view_.value())
(and the similar calls using other bulk op functors at the other locations) with
RAFT_CUDA_TRY(cub::DeviceFor::Bulk(...)); keeping the same arguments (e.g.,
climber_strategies_.size(), the respective refine_*_bulk_op functor instances,
and stream_view_.value()) so the return cudaError_t is checked.
| // Check if we don't hit the limit using max_batch_size | ||
| const size_t memory_estimate = | ||
| batch_pdlp_memory_estimator(problem, max_batch_size, max_batch_size); | ||
| size_t free_mem, total_mem; | ||
| RAFT_CUDA_TRY(cudaMemGetInfo(&free_mem, &total_mem)); | ||
|
|
||
| if (memory_estimate > free_mem) { | ||
| use_optimal_batch_size = true; | ||
| // Decrement batch size iteratively until we find a batch size that fits | ||
| while (memory_max_batch_size > 1) { | ||
| const size_t memory_estimate = | ||
| batch_pdlp_memory_estimator(problem, memory_max_batch_size, max_batch_size); | ||
| if (memory_estimate <= free_mem) { break; } | ||
| memory_max_batch_size--; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle the case where even batch_size=1 exceeds available memory.
The loop stops at 1 even if the estimate still exceeds free memory, which can lead to OOM later. Consider a hard failure with a clear error when no batch size fits.
🐛 Proposed fix
if (memory_estimate > free_mem) {
use_optimal_batch_size = true;
// Decrement batch size iteratively until we find a batch size that fits
while (memory_max_batch_size > 1) {
const size_t memory_estimate =
batch_pdlp_memory_estimator(problem, memory_max_batch_size, max_batch_size);
if (memory_estimate <= free_mem) { break; }
memory_max_batch_size--;
}
+ const size_t min_estimate =
+ batch_pdlp_memory_estimator(problem, memory_max_batch_size, max_batch_size);
+ cuopt_expects(min_estimate <= free_mem,
+ error_type_t::RuntimeError,
+ "Insufficient GPU memory for batch PDLP (even batch_size=1)");
}🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/solve.cu` around lines 728 - 743, The loop that
decrements memory_max_batch_size using batch_pdlp_memory_estimator can exit at 1
while the estimated memory still exceeds free_mem, risking OOM; after the while
loop (or when setting use_optimal_batch_size) recompute or recheck
batch_pdlp_memory_estimator(problem, memory_max_batch_size, max_batch_size)
against free_mem and if it still exceeds free_mem, fail fast with a clear
error/exception or return an error code (include context like
memory_max_batch_size, free_mem and the failing estimate) instead of proceeding;
update the logic around use_optimal_batch_size and memory_max_batch_size to
ensure callers see the failure.
| } else { | ||
| // TODO later batch mode: remove this once you want to do per climber restart | ||
| cub::DeviceSegmentedReduce::Sum( | ||
| dot_product_storage.data(), | ||
| dot_product_bytes, | ||
| thrust::make_transform_iterator( | ||
| thrust::make_zip_iterator(tmp_primal.data(), | ||
| current_saddle_point_state.get_delta_primal().data()), | ||
| tuple_multiplies<f_t>{}), | ||
| interaction_.data(), | ||
| climber_strategies_.size(), | ||
| primal_size_, | ||
| stream_view_); | ||
|
|
||
| cub::DeviceSegmentedReduce::Sum( | ||
| dot_product_storage.data(), | ||
| dot_product_bytes, | ||
| thrust::make_transform_iterator(current_saddle_point_state.get_delta_primal().data(), | ||
| power_two_func_t<f_t>{}), | ||
| norm_squared_delta_primal_.data(), | ||
| climber_strategies_.size(), | ||
| primal_size_, | ||
| stream_view_); | ||
|
|
||
| cub::DeviceSegmentedReduce::Sum( | ||
| dot_product_storage.data(), | ||
| dot_product_bytes, | ||
| thrust::make_transform_iterator(current_saddle_point_state.get_delta_dual().data(), | ||
| power_two_func_t<f_t>{}), | ||
| norm_squared_delta_dual_.data(), | ||
| climber_strategies_.size(), | ||
| dual_size_, | ||
| stream_view_); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing CUDA error checks after cub::DeviceSegmentedReduce::Sum calls.
The three segmented reduce calls (lines 485-495, 497-505, 507-515) don't have error checking after them. CUB functions can fail due to insufficient temporary storage or other runtime issues. Add error checking for robustness.
🔧 Add error checking
cub::DeviceSegmentedReduce::Sum(
dot_product_storage.data(),
dot_product_bytes,
thrust::make_transform_iterator(...),
interaction_.data(),
climber_strategies_.size(),
primal_size_,
stream_view_);
+ RAFT_CUDA_TRY(cudaPeekAtLastError());As per coding guidelines: "Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification."
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/step_size_strategy/adaptive_step_size_strategy.cu`
around lines 483 - 516, The three calls to cub::DeviceSegmentedReduce::Sum (the
calls that use tuple_multiplies<f_t> and power_two_func_t<f_t> and the
temporaries dot_product_storage/dot_product_bytes on stream_view_) are missing
CUDA error checking; capture the cudaError_t returned by each
DeviceSegmentedReduce::Sum call and pass it to the project's error-checking
macro (e.g., CUDA_CHECK) or explicitly check and log/fail on non-success, so
each segmented-reduce invocation is validated (include a short context message
identifying the call when reporting the error).
| cub::DeviceTransform::Transform( | ||
| in_zip, | ||
| out_zip, | ||
| total_items, | ||
| [] HD(thrust::tuple<f_t, f_t> values) -> thrust::tuple<f_t, f_t> { | ||
| return thrust::make_tuple(thrust::get<1>(values), thrust::get<0>(values)); | ||
| }, | ||
| matrix.stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check CUB DeviceTransform return status.
cub::DeviceTransform::Transform returns cudaError_t; it should be wrapped in RAFT_CUDA_TRY (or equivalent) to surface launch failures.
🐛 Proposed fix
- cub::DeviceTransform::Transform(
+ RAFT_CUDA_TRY(cub::DeviceTransform::Transform(
in_zip,
out_zip,
total_items,
[] HD(thrust::tuple<f_t, f_t> values) -> thrust::tuple<f_t, f_t> {
return thrust::make_tuple(thrust::get<1>(values), thrust::get<0>(values));
},
- matrix.stream());
+ matrix.stream()));What is the return type of cub::DeviceTransform::Transform and the recommended way to check errors?
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/swap_and_resize_helper.cuh` around lines 78 - 85,
cub::DeviceTransform::Transform returns a cudaError_t and its return value must
be checked; wrap the call to cub::DeviceTransform::Transform (the lambda swap
transform invocation) with the project's CUDA error macro (e.g. RAFT_CUDA_TRY or
CUDA_CHECK) to surface launch failures and propagate errors, i.e. call
RAFT_CUDA_TRY(cub::DeviceTransform::Transform(...)) or equivalent around the
existing call and keep passing matrix.stream() and the same
in_zip/out_zip/total_items and lambda as before.
| cub::DeviceSegmentedReduce::Sum( | ||
| nullptr, | ||
| byte_needed, | ||
| thrust::make_transform_iterator(dual_dot_.data(), power_two_func_t<f_t>{}), | ||
| dual_dot_.data(), | ||
| climber_strategies_.size(), | ||
| dual_size, | ||
| stream_view_); | ||
| dot_product_bytes_ = std::max(dot_product_bytes_, byte_needed); | ||
|
|
||
| cub::DeviceSegmentedReduce::Sum( | ||
| nullptr, | ||
| byte_needed, | ||
| thrust::make_transform_iterator(dual_dot_.data(), power_two_func_t<f_t>{}), | ||
| dual_dot_.data(), | ||
| climber_strategies_.size(), | ||
| primal_size, | ||
| stream_view_); | ||
| dot_product_bytes_ = std::max(dot_product_bytes_, byte_needed); | ||
|
|
||
| cub::DeviceSegmentedReduce::Sum( | ||
| nullptr, | ||
| byte_needed, | ||
| thrust::make_transform_iterator(thrust::make_zip_iterator(dual_dot_.data(), dual_dot_.data()), | ||
| tuple_multiplies<f_t>{}), | ||
| dual_dot_.data(), | ||
| climber_strategies_.size(), | ||
| primal_size, | ||
| stream_view_); | ||
| dot_product_bytes_ = std::max(dot_product_bytes_, byte_needed); | ||
|
|
||
| cub::DeviceSegmentedReduce::Sum(nullptr, | ||
| byte_needed, | ||
| dual_dot_.data(), | ||
| dual_dot_.data(), | ||
| climber_strategies_.size(), | ||
| dual_size_h_, | ||
| stream_view_); | ||
| dot_product_bytes_ = std::max(dot_product_bytes_, byte_needed); | ||
|
|
||
| cub::DeviceSegmentedReduce::Sum( | ||
| nullptr, | ||
| dot_product_bytes_, | ||
| thrust::make_transform_iterator( | ||
| thrust::make_zip_iterator(dual_dot_.data(), | ||
| problem_wrap_container(problem_ptr->objective_coefficients)), | ||
| tuple_multiplies<f_t>{}), | ||
| primal_objective_.data(), | ||
| climber_strategies_.size(), | ||
| primal_size_h_, | ||
| stream_view_); | ||
| dot_product_bytes_ = std::max(dot_product_bytes_, byte_needed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add CUDA error checks around CUB segmented reduction sizing calls.
The multiple cub::DeviceSegmentedReduce::Sum calls used to determine dot_product_bytes_ ignore the returned cudaError_t. While these are sizing queries (passing nullptr for temp storage), error checking is still recommended.
🐛 Suggested fix pattern
- cub::DeviceSegmentedReduce::Sum(
+ RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
nullptr,
byte_needed,
thrust::make_transform_iterator(dual_dot_.data(), power_two_func_t<f_t>{}),
dual_dot_.data(),
climber_strategies_.size(),
dual_size,
- stream_view_);
+ stream_view_));Apply this pattern to all sizing calls in lines 137-188.
As per coding guidelines: "Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification."
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/termination_strategy/convergence_information.cu`
around lines 137 - 188, The sizing calls to cub::DeviceSegmentedReduce::Sum (the
nullptr temp-storage queries using variables like byte_needed, dual_dot_,
power_two_func_t, tuple_multiplies, primal_objective_, stream_view_, and
dot_product_bytes_) are not checking the returned cudaError_t; assign each
call's return to a cudaError_t (e.g., cudaError_t err =
cub::DeviceSegmentedReduce::Sum(...)) and immediately validate it with the
project error macro (CUDA_CHECK(err) or equivalent) after every sizing
invocation to ensure failures are caught during the segmented-reduce sizing
phase.
| cub::DeviceSegmentedReduce::Sum( | ||
| dot_product_storage_.data(), | ||
| dot_product_bytes_, | ||
| thrust::make_transform_iterator(primal_residual_.data(), power_two_func_t<f_t>{}), | ||
| l2_primal_residual_.data(), | ||
| climber_strategies_.size(), | ||
| dual_size_h_, | ||
| stream_view_); | ||
| cub::DeviceTransform::Transform( | ||
| l2_primal_residual_.data(), | ||
| l2_primal_residual_.data(), | ||
| l2_primal_residual_.size(), | ||
| [] HD(f_t x) { return raft::sqrt(x); }, | ||
| stream_view_); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add CUDA error checks around CUB calls in batch mode residual computation.
The cub::DeviceSegmentedReduce::Sum and cub::DeviceTransform::Transform calls ignore returned error codes.
🐛 Suggested fix
} else {
- cub::DeviceSegmentedReduce::Sum(
+ RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
dot_product_storage_.data(),
dot_product_bytes_,
thrust::make_transform_iterator(primal_residual_.data(), power_two_func_t<f_t>{}),
l2_primal_residual_.data(),
climber_strategies_.size(),
dual_size_h_,
- stream_view_);
- cub::DeviceTransform::Transform(
+ stream_view_));
+ RAFT_CUDA_TRY(cub::DeviceTransform::Transform(
l2_primal_residual_.data(),
l2_primal_residual_.data(),
l2_primal_residual_.size(),
[] HD(f_t x) { return raft::sqrt(x); },
- stream_view_);
+ stream_view_));
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cub::DeviceSegmentedReduce::Sum( | |
| dot_product_storage_.data(), | |
| dot_product_bytes_, | |
| thrust::make_transform_iterator(primal_residual_.data(), power_two_func_t<f_t>{}), | |
| l2_primal_residual_.data(), | |
| climber_strategies_.size(), | |
| dual_size_h_, | |
| stream_view_); | |
| cub::DeviceTransform::Transform( | |
| l2_primal_residual_.data(), | |
| l2_primal_residual_.data(), | |
| l2_primal_residual_.size(), | |
| [] HD(f_t x) { return raft::sqrt(x); }, | |
| stream_view_); | |
| } | |
| cub::DeviceSegmentedReduce::Sum( | |
| dot_product_storage_.data(), | |
| dot_product_bytes_, | |
| thrust::make_transform_iterator(primal_residual_.data(), power_two_func_t<f_t>{}), | |
| l2_primal_residual_.data(), | |
| climber_strategies_.size(), | |
| dual_size_h_, | |
| stream_view_)); | |
| RAFT_CUDA_TRY(cub::DeviceTransform::Transform( | |
| l2_primal_residual_.data(), | |
| l2_primal_residual_.data(), | |
| l2_primal_residual_.size(), | |
| [] HD(f_t x) { return raft::sqrt(x); }, | |
| stream_view_)); |
| cub::DeviceSegmentedReduce::Sum( | |
| dot_product_storage_.data(), | |
| dot_product_bytes_, | |
| thrust::make_transform_iterator(primal_residual_.data(), power_two_func_t<f_t>{}), | |
| l2_primal_residual_.data(), | |
| climber_strategies_.size(), | |
| dual_size_h_, | |
| stream_view_); | |
| cub::DeviceTransform::Transform( | |
| l2_primal_residual_.data(), | |
| l2_primal_residual_.data(), | |
| l2_primal_residual_.size(), | |
| [] HD(f_t x) { return raft::sqrt(x); }, | |
| stream_view_); | |
| } | |
| RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum( | |
| dot_product_storage_.data(), | |
| dot_product_bytes_, | |
| thrust::make_transform_iterator(primal_residual_.data(), power_two_func_t<f_t>{}), | |
| l2_primal_residual_.data(), | |
| climber_strategies_.size(), | |
| dual_size_h_, | |
| stream_view_)); | |
| RAFT_CUDA_TRY(cub::DeviceTransform::Transform( | |
| l2_primal_residual_.data(), | |
| l2_primal_residual_.data(), | |
| l2_primal_residual_.size(), | |
| [] HD(f_t x) { return raft::sqrt(x); }, | |
| stream_view_)); | |
| } |
🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/termination_strategy/convergence_information.cu`
around lines 371 - 385, The two CUB calls (cub::DeviceSegmentedReduce::Sum and
cub::DeviceTransform::Transform) in the batch residual path ignore their return
codes; capture their returned status and check it using the project's CUDA
error-checking macro (e.g., RAFT_CUDA_TRY or CUDA_TRY) or by testing the
returned cudaError_t and handling/logging on failure. Specifically, wrap the
call to cub::DeviceSegmentedReduce::Sum(...) (using dot_product_storage_,
dot_product_bytes_, thrust::make_transform_iterator(primal_residual_.data(),
power_two_func_t<f_t>{}), l2_primal_residual_.data(),
climber_strategies_.size(), dual_size_h_, stream_view_) and the call to
cub::DeviceTransform::Transform(...) (using l2_primal_residual_.data(),
l2_primal_residual_.data(), l2_primal_residual_.size(), the sqrt lambda,
stream_view_) with the error check so any CUDA/CUB error is propagated or logged
consistently.
| } else { | ||
| cub::DeviceSegmentedReduce::Sum( | ||
| dot_product_storage_.data(), | ||
| dot_product_bytes_, | ||
| thrust::make_transform_iterator( | ||
| thrust::make_zip_iterator(dual_slack.data(), primal_solution.data()), | ||
| tuple_multiplies<f_t>{}), | ||
| dual_dot_.data(), | ||
| climber_strategies_.size(), | ||
| primal_size_h_, | ||
| stream_view_); | ||
|
|
||
| cub::DeviceSegmentedReduce::Sum(dot_product_storage_.data(), | ||
| dot_product_bytes_, | ||
| primal_slack_.data(), | ||
| sum_primal_slack_.data(), | ||
| climber_strategies_.size(), | ||
| dual_size_h_, | ||
| stream_view_); | ||
| } | ||
|
|
||
| const f_t sum = dual_dot_.value(stream_view_) + sum_primal_slack_.value(stream_view_); | ||
| dual_objective_.set_value_async(sum, stream_view_); | ||
| cub::DeviceTransform::Transform( | ||
| cuda::std::make_tuple(dual_dot_.data(), sum_primal_slack_.data()), | ||
| dual_objective_.data(), | ||
| dual_objective_.size(), | ||
| cuda::std::plus<>{}, | ||
| stream_view_); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add CUDA error checks around CUB calls in dual objective computation.
The batch mode cub::DeviceSegmentedReduce::Sum and cub::DeviceTransform::Transform calls should have error checking.
🐛 Suggested fix
} else {
- cub::DeviceSegmentedReduce::Sum(
+ RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(
dot_product_storage_.data(),
dot_product_bytes_,
thrust::make_transform_iterator(
thrust::make_zip_iterator(dual_slack.data(), primal_solution.data()),
tuple_multiplies<f_t>{}),
dual_dot_.data(),
climber_strategies_.size(),
primal_size_h_,
- stream_view_);
+ stream_view_));
- cub::DeviceSegmentedReduce::Sum(dot_product_storage_.data(),
+ RAFT_CUDA_TRY(cub::DeviceSegmentedReduce::Sum(dot_product_storage_.data(),
dot_product_bytes_,
primal_slack_.data(),
sum_primal_slack_.data(),
climber_strategies_.size(),
dual_size_h_,
- stream_view_);
+ stream_view_));
}🤖 Prompt for AI Agents
In `@cpp/src/linear_programming/termination_strategy/convergence_information.cu`
around lines 766 - 793, Wrap each CUB call in the dual objective computation
(the two cub::DeviceSegmentedReduce::Sum calls and the
cub::DeviceTransform::Transform call) with CUDA error checking: capture the
returned cudaError_t from each call and call the project's CUDA error-handling
utility (or check for != cudaSuccess and log/return) using the same context
variables (dot_product_storage_, dot_product_bytes_,
thrust::make_transform_iterator(...), dual_dot_.data(), primal_slack_.data(),
sum_primal_slack_.data(), dual_objective_.data(), dual_objective_.size(),
stream_view_) so any failure is detected and reported immediately; ensure you do
this in the block that computes the dual objective so failures in segmented
reduce or transform are handled.
Closes #791
This PR allows PDLP to solve a batch of problems. Currently the only supported difference among problems is one variable bound being different per climber.
This PR also includes:
This PR is not ready to be merged:
Summary by CodeRabbit
Release Notes
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.