Skip to content

Conversation

@itsibitzi
Copy link
Contributor

No description provided.

@itsibitzi itsibitzi marked this pull request as ready for review January 16, 2026 00:33
@itsibitzi itsibitzi requested a review from a team as a code owner January 16, 2026 00:33
Copilot AI review requested due to automatic review settings January 16, 2026 00:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces HRIBLT (Hierarchical Rateless Bloom Lookup Tables), a novel set reconciliation algorithm that computes the symmetric difference between sets with data transfer proportional to the size of the difference rather than the overall set size. The implementation includes encoding and decoding sessions, comprehensive test coverage, and documentation.

Changes:

  • Added a new hriblt crate implementing a rateless set reconciliation algorithm using XOR-based hash functions
  • Includes core algorithm implementation with encoding/decoding sessions, error types, and public API traits
  • Added comprehensive test suite with performance benchmarking and documentation including README and usage guides

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
crates/hriblt/Cargo.toml Package definition for the new hriblt crate
crates/hriblt/README.md Usage documentation with example code
crates/hriblt/src/lib.rs Core algorithm implementation with hash functions and utilities
crates/hriblt/src/error.rs Error types for set reconciliation operations
crates/hriblt/src/encoding_session.rs Encoding session for creating coded symbols from sets
crates/hriblt/src/decoding_session.rs Decoding session for recovering differences from coded symbols
crates/hriblt/src/decoded_value.rs Enum representing decoded additions or deletions
crates/hriblt/src/coded_symbol.rs Coded symbol structure for invertible bloom filter
crates/hriblt/docs/sizing.md Guide for sizing HRIBLT sessions
crates/hriblt/docs/hashing_functions.md Documentation on hash function implementation
crates/hriblt/docs/assets/coded-symbol-multiplier.png Chart showing coded symbol efficiency
Cargo.toml Updated workspace member formatting
Comments suppressed due to low confidence (2)

crates/hriblt/src/encoding_session.rs:64

  • Spelling error: "te vector" should be "the vector".
    /// Returns an error if the split is out of range or if the length of te vector

crates/hriblt/docs/hashing_functions.md:11

  • Grammatical error: "the hashes produces" should be "the hashes produced".
We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guarenteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


When using HRIBLT in production systems it is important to consider the stability of your hash functions.

We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guarenteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "guarenteed" should be "guaranteed".

This issue also appears in the following locations of the same file:

  • line 11
Suggested change
We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guarenteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.
We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.

Copilot uses AI. Check for mistakes.

/// Create a EncodingSession from a vector of coded symbols.
///
/// Panics if the split is out of range or if the length of te vector
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "te vector" should be "the vector".

This issue also appears in the following locations of the same file:

  • line 64

Copilot uses AI. Check for mistakes.
if split < range.start || split > range.end {
return Err(SetReconciliationError::SplitOutOfRange);
}
if coded_symbols.len() > range.len() {
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent validation logic: line 53 uses assert_eq!(coded_symbols.len(), range.len()) to ensure exact equality, but line 75 only checks if coded_symbols.len() > range.len(). The validation in try_from_coded_symbols should check for inequality (!=) rather than just greater-than (>) to match the behavior of from_coded_symbols and properly reject cases where the vector is too short.

Suggested change
if coded_symbols.len() > range.len() {
if coded_symbols.len() != range.len() {

Copilot uses AI. Check for mistakes.
Comment on lines +93 to +94
pub fn extend(&mut self, entities: impl Iterator<Item = T>) {
for entity in entities {
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extend method should accept impl IntoIterator<Item = T> instead of impl Iterator<Item = T> to follow Rust's standard library conventions. This would allow callers to pass ranges, vectors, or other collections directly without needing to call .into_iter() first. The current signature forces unnecessary verbosity (e.g., requiring .iter().cloned() in line 225 of lib.rs).

Suggested change
pub fn extend(&mut self, entities: impl Iterator<Item = T>) {
for entity in entities {
pub fn extend(&mut self, entities: impl IntoIterator<Item = T>) {
for entity in entities.into_iter() {

Copilot uses AI. Check for mistakes.

This library has a trait, `HashFunctions` which is used to create the hashes required to place your symbol into the range of coded symbols.

The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation reference to a non-existent file: The documentation mentions "How and why this is done is explained in the overview.md documentation" but no such file exists in the docs directory. This reference should either be removed or the file should be created.

Suggested change
The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.
The following documentation provides more details on this trait in particular.

Copilot uses AI. Check for mistakes.
let mut bits = Stats::default();
let mut encoding_time = Stats::default();
let mut decoding_time = Stats::default();
let mut deocding_time_fast = Stats::default();
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "deocding_time_fast" should be "decoding_time_fast" (typo in variable name).

Copilot uses AI. Check for mistakes.
/// Note: we want an odd number of hash functions, so that collapsing the stream to a single coded symbol
/// (or small number of coded symbols) won't erase the value information.
///
/// The second return value indicates whether the entry should be stored negated.
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect or outdated documentation: The comment states "The second return value indicates whether the entry should be stored negated" but the function only returns a single usize value, not a tuple. This documentation should be removed or corrected.

Suggested change
/// The second return value indicates whether the entry should be stored negated.

Copilot uses AI. Check for mistakes.

Create two encoding sessions, one containing your data, and another containing the counter-parties data. This counterparty data might have been sent to you over a network for example.

The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has recieved some symbols from "Alice".
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "recieved" should be "received".

Suggested change
The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has recieved some symbols from "Alice".
The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has received some symbols from "Alice".

Copilot uses AI. Check for mistakes.

## Coded Symbol Multiplier

The number of coded symbols required to find the difference between two sets is proportional to the difference between the two sets. The following chart shows the relationship between the number of coded symbols required to decode HRIBLT and the size of the diff. Note that the size of the base set (before diffs were added) was fixed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the size of the base set (before diffs were added) was fixed.

I don't quite understand what that sentence means :)

When decoding, only the size of the diff matters, i.e. you don't necessarily need to build a base set or two sets that you actually diff. You just need to create a random diff set (i.e. with insertions and deletions).


`y = len(coded_symbols) / diff_size`

![Coded symbol multiplier](./assets/coded-symbol-multiplier.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to visualize the standard deviation (or maybe percentiles like 90%-ile, 99%-ile). Max doesn't really make sense, since in theory it is infinite.

Those are almost more important than the average.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm. I couldn't find how you generated this file... We should/must merge that code.


/// Checks whether this coded symbol is pure, i.e., whether it represents a single entity
/// A pure coded symbol must satisfy the following conditions:
/// - The 1-bit counter must be 1 or -1 (which are both represented by the bit being set)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> by the least significant bit being set

/// Checks whether this coded symbol is pure, i.e., whether it represents a single entity
/// A pure coded symbol must satisfy the following conditions:
/// - The 1-bit counter must be 1 or -1 (which are both represented by the bit being set)
/// - The checksum must match the checksum of the value.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must match the absolute value of the checksum of the value (since the sign tells you whether it is an insertion or deletion).

/// A pure coded symbol must satisfy the following conditions:
/// - The 1-bit counter must be 1 or -1 (which are both represented by the bit being set)
/// - The checksum must match the checksum of the value.
/// - The indices of the value must match the index of this coded symbol.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(exactly) one of the indices of the value must match the index of this coded symbol.
Note: in theory it would be possible to also accept coded symbols which got added an odd number of times to the same bucket.
But this only makes a difference when the total number of buckets is <= 32 which is not really a performance critical case.

if stream_len > 32 && i % 32 != 0 {
let seed = i % 4;
let j = index_for_seed(state, value, stream_len, seed as u32);
if i == j { 1 } else { 0 }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be written differently in Rust? (same couple lines below)

pub struct DecodingSession<T: Encodable, H: HashFunctions<T>> {
/// The encoded stream of hashes.
/// All recovered coded symbols have been removed from this stream.
/// If decoded failed, then one can simply append more data and continue decoding.
Copy link
Collaborator

@aneubeck aneubeck Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// If decoded failed, then one can simply append more data and continue decoding.
/// If decoding fails, then one can simply append more data and continue decoding.


/// For statistical purposes: this number informs how many bits of the
/// checksum were required to identify pure coded symbols.
pub(crate) required_bits: usize,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering whether we still want/need this...
Essentially, the smaller this number, the more robust the procedure.

/// The encoded stream of hashes.
/// All recovered coded symbols have been removed from this stream.
/// If decoded failed, then one can simply append more data and continue decoding.
encoded: EncodingSession<T, H>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add somewhere that all coded symbols must be computed with the same seed.


/// Create a EncodingSession from a vector of coded symbols.
///
/// Returns an error if the split is out of range or if the length of te vector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns an error if the split is out of range or if the length of te vector
/// Returns an error if the split is out of range or if the length of the vector

//! The core algorithm is based on the idea of set similarity sketching where pure hashes are identified
//! by testing whether one of the 5 hash functions would map the candidate value back to the index
//! of the value. Taking advantage of this necessary condition reduces the number of necessary checksum
//! bits by a bit.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a bit"... i.e. one bit?
This is ambiguous ;)

//! This algorithm has similar properties to the "Practical Rateless Set Reconciliation" algorithm.
//! Main differences are:
//! * We only use 5 hash functions instead of log(n).
//! * As a result we only require 5 independent hash functions instead of log(n) many.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I don't recall whether they need log(n) !independent! hash functions...
It might be that independence is not a required in their scheme?

//! and a stable math library must be used!)
//! * Encoding/decoding is faster due to the fixed number of hash functions and the simpler operations.
//! * Since we have a fixed number of hash functions, we can utilize the coded symbol index as
//! an additional condition. In fact, we need to compute just a single hash function (on average).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! an additional condition. In fact, we need to compute just a single hash function (on average).
//! an additional condition. In fact, we need to compute just a single hash function (on average) for this check.

}
}

fn indices<T>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this function be moved into a separate file, since it isn't public?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe some index.rs or bucket_index.rs file?

/// (or small number of coded symbols) won't erase the value information.
///
/// The second return value indicates whether the entry should be stored negated.
fn index_for_seed<T>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for these...

/// - Fewer (ideally 3) hash functions with larger partitions lead to a higher chance to
/// find a pure value in the stream, i.e. the stream can be decoded with fewer coded symbols.
///
/// After testing various schemes, I settled for this one which uses 4 equally sized partitions,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I => ...

/// Note: we want an odd number of hash functions, so that collapsing the stream to a single coded symbol
/// (or small number of coded symbols) won't erase the value information.
///
/// The second return value indicates whether the entry should be stored negated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this last sentence, since there isn't a second return value anymore ;)


Create two encoding sessions, one containing your data, and another containing the counter-parties data. This counterparty data might have been sent to you over a network for example.

The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has recieved some symbols from "Alice".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has recieved some symbols from "Alice".
The following example attempts to reconcile the differences between such two sets of `u64` integers, and is done from the perspective of "Bob", who has received some symbols from "Alice".

hriblt = "0.1"
```

Create two encoding sessions, one containing your data, and another containing the counter-parties data. This counterparty data might have been sent to you over a network for example.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace your and counter-party with Alice and Bob from the start...


diff.sort();

assert_eq!(diff, [0, 1, 2, 3, 4, 11, 12, 13, 14, 15]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to show here that it ALSO tells you whether a value was Inserted or Removed.
Maybe use the sign of the value for demonstration purposes?


assert_eq!(diff, [0, 1, 2, 3, 4, 11, 12, 13, 14, 15]);

```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also want an example showing the hierarchical aspect of this.
I.e. both sides compute an encoding session of size 1024 and convert it into a hierarchy. Then, one side transfers 128 blocks at a time until decoding succeeds.

keywords = ["set-reconciliation", "sync", "algorithm", "probabilistic"]
categories = ["algorithms", "data-structures", "mathematics", "science"]

[dependencies]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also want some benchmarks to show throughput


If the value you're inserting into the encoding session is a high entropy random value, such as a cryptographic hash digest, you can recycle the bytes in that value to produce the coded symbol indexing hashes, instead of hashing that value again. This results in a constant-factor speed up.

For example if you were trying to find the difference between two sets of documents, instead of each coded symbol being the whole document it could instead just be a SHA1 hash of the document content. Since each SHA1 digest has 20 bytes of high entropy bits, instead of hashing this value five times again to produce the five coded symbol indices we can simply slice out five `u32` values from the digest itself.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a feature flag which provides the SHA1 functionality for you...

@@ -0,0 +1,400 @@
//! This module implements a set reconciliation algorithm using XOR-based hashes.
//!
//! The core algorithm is based on the idea of set similarity sketching where pure hashes are identified
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we ever talk about set similarity sketching anywhere else.
Dropping this term here is therefore confusing IMO.
(I'm also not quite sure what the relation is except that both procedures use some kind of hashing and an array :) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants