Open source HRIBLT #94

itsibitzi · 2026-01-16T00:33:09Z

No description provided.

Copilot

Pull request overview

This PR introduces HRIBLT (Hierarchical Rateless Bloom Lookup Tables), a novel set reconciliation algorithm that computes the symmetric difference between sets with data transfer proportional to the size of the difference rather than the overall set size. The implementation includes encoding and decoding sessions, comprehensive test coverage, and documentation.

Changes:

Added a new hriblt crate implementing a rateless set reconciliation algorithm using XOR-based hash functions
Includes core algorithm implementation with encoding/decoding sessions, error types, and public API traits
Added comprehensive test suite with performance benchmarking and documentation including README and usage guides

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
crates/hriblt/Cargo.toml	Package definition for the new hriblt crate
crates/hriblt/README.md	Usage documentation with example code
crates/hriblt/src/lib.rs	Core algorithm implementation with hash functions and utilities
crates/hriblt/src/error.rs	Error types for set reconciliation operations
crates/hriblt/src/encoding_session.rs	Encoding session for creating coded symbols from sets
crates/hriblt/src/decoding_session.rs	Decoding session for recovering differences from coded symbols
crates/hriblt/src/decoded_value.rs	Enum representing decoded additions or deletions
crates/hriblt/src/coded_symbol.rs	Coded symbol structure for invertible bloom filter
crates/hriblt/docs/sizing.md	Guide for sizing HRIBLT sessions
crates/hriblt/docs/hashing_functions.md	Documentation on hash function implementation
crates/hriblt/docs/assets/coded-symbol-multiplier.png	Chart showing coded symbol efficiency
Cargo.toml	Updated workspace member formatting

Comments suppressed due to low confidence (2)

crates/hriblt/src/encoding_session.rs:64

Spelling error: "te vector" should be "the vector".

    /// Returns an error if the split is out of range or if the length of te vector

crates/hriblt/docs/hashing_functions.md:11

Grammatical error: "the hashes produces" should be "the hashes produced".

We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guarenteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-16T00:38:23Z

crates/hriblt/docs/hashing_functions.md

+
+When using HRIBLT in production systems it is important to consider the stability of your hash functions.
+
+We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guarenteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.


Spelling error: "guarenteed" should be "guaranteed".

This issue also appears in the following locations of the same file:

line 11

Suggested change

We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guarenteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.

We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.

Copilot · 2026-01-16T00:38:24Z

crates/hriblt/src/encoding_session.rs

+
+    /// Create a EncodingSession from a vector of coded symbols.
+    ///
+    /// Panics if the split is out of range or if the length of te vector


Spelling error: "te vector" should be "the vector".

This issue also appears in the following locations of the same file:

line 64

Copilot · 2026-01-16T00:38:24Z

crates/hriblt/src/encoding_session.rs

+        if split < range.start || split > range.end {
+            return Err(SetReconciliationError::SplitOutOfRange);
+        }
+        if coded_symbols.len() > range.len() {


Inconsistent validation logic: line 53 uses assert_eq!(coded_symbols.len(), range.len()) to ensure exact equality, but line 75 only checks if coded_symbols.len() > range.len(). The validation in try_from_coded_symbols should check for inequality (!=) rather than just greater-than (>) to match the behavior of from_coded_symbols and properly reject cases where the vector is too short.

Suggested change

if coded_symbols.len() > range.len() {

if coded_symbols.len() != range.len() {

Copilot · 2026-01-16T00:38:24Z

crates/hriblt/src/encoding_session.rs

+    pub fn extend(&mut self, entities: impl Iterator<Item = T>) {
+        for entity in entities {


The extend method should accept impl IntoIterator<Item = T> instead of impl Iterator<Item = T> to follow Rust's standard library conventions. This would allow callers to pass ranges, vectors, or other collections directly without needing to call .into_iter() first. The current signature forces unnecessary verbosity (e.g., requiring .iter().cloned() in line 225 of lib.rs).

Suggested change

pub fn extend(&mut self, entities: impl Iterator<Item = T>) {

for entity in entities {

pub fn extend(&mut self, entities: impl IntoIterator<Item = T>) {

for entity in entities.into_iter() {

Copilot · 2026-01-16T00:38:24Z

crates/hriblt/docs/hashing_functions.md

+
+This library has a trait, `HashFunctions` which is used to create the hashes required to place your symbol into the range of coded symbols.
+
+The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.


Documentation reference to a non-existent file: The documentation mentions "How and why this is done is explained in the overview.md documentation" but no such file exists in the docs directory. This reference should either be removed or the file should be created.

Suggested change

The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.

The following documentation provides more details on this trait in particular.

Copilot · 2026-01-16T00:38:25Z

crates/hriblt/src/lib.rs

+        let mut bits = Stats::default();
+        let mut encoding_time = Stats::default();
+        let mut decoding_time = Stats::default();
+        let mut deocding_time_fast = Stats::default();


Spelling error: "deocding_time_fast" should be "decoding_time_fast" (typo in variable name).

Copilot · 2026-01-16T00:38:25Z

crates/hriblt/src/lib.rs

+/// Note: we want an odd number of hash functions, so that collapsing the stream to a single coded symbol
+/// (or small number of coded symbols) won't erase the value information.
+///
+/// The second return value indicates whether the entry should be stored negated.


Incorrect or outdated documentation: The comment states "The second return value indicates whether the entry should be stored negated" but the function only returns a single usize value, not a tuple. This documentation should be removed or corrected.

Suggested change

/// The second return value indicates whether the entry should be stored negated.

Copilot · 2026-01-16T00:38:25Z

crates/hriblt/README.md

+
+Create two encoding sessions, one containing your data, and another containing the counter-parties data. This counterparty data might have been sent to you over a network for example.
+
+The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has recieved some symbols from "Alice".


Spelling error: "recieved" should be "received".

Suggested change

The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has recieved some symbols from "Alice".

The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has received some symbols from "Alice".

aneubeck · 2026-01-16T14:02:40Z

crates/hriblt/docs/sizing.md

+
+## Coded Symbol Multiplier
+
+The number of coded symbols required to find the difference between two sets is proportional to the difference between the two sets. The following chart shows the relationship between the number of coded symbols required to decode HRIBLT and the size of the diff. Note that the size of the base set (before diffs were added) was fixed.


Note that the size of the base set (before diffs were added) was fixed.

I don't quite understand what that sentence means :)

When decoding, only the size of the diff matters, i.e. you don't necessarily need to build a base set or two sets that you actually diff. You just need to create a random diff set (i.e. with insertions and deletions).

aneubeck · 2026-01-16T14:04:49Z

crates/hriblt/docs/sizing.md

+
+`y = len(coded_symbols) / diff_size`
+
+![Coded symbol multiplier](./assets/coded-symbol-multiplier.png)


We also need to visualize the standard deviation (or maybe percentiles like 90%-ile, 99%-ile). Max doesn't really make sense, since in theory it is infinite.

Those are almost more important than the average.

mmm. I couldn't find how you generated this file... We should/must merge that code.

aneubeck · 2026-01-16T14:08:41Z

crates/hriblt/src/coded_symbol.rs

+
+    /// Checks whether this coded symbol is pure, i.e., whether it represents a single entity
+    /// A pure coded symbol must satisfy the following conditions:
+    /// - The 1-bit counter must be 1 or -1 (which are both represented by the bit being set)


-> by the least significant bit being set

aneubeck · 2026-01-16T14:09:22Z

crates/hriblt/src/coded_symbol.rs

+    /// Checks whether this coded symbol is pure, i.e., whether it represents a single entity
+    /// A pure coded symbol must satisfy the following conditions:
+    /// - The 1-bit counter must be 1 or -1 (which are both represented by the bit being set)
+    /// - The checksum must match the checksum of the value.


must match the absolute value of the checksum of the value (since the sign tells you whether it is an insertion or deletion).

aneubeck · 2026-01-16T14:11:56Z

crates/hriblt/src/coded_symbol.rs

+    /// A pure coded symbol must satisfy the following conditions:
+    /// - The 1-bit counter must be 1 or -1 (which are both represented by the bit being set)
+    /// - The checksum must match the checksum of the value.
+    /// - The indices of the value must match the index of this coded symbol.


(exactly) one of the indices of the value must match the index of this coded symbol.
Note: in theory it would be possible to also accept coded symbols which got added an odd number of times to the same bucket.
But this only makes a difference when the total number of buckets is <= 32 which is not really a performance critical case.

aneubeck · 2026-01-16T14:14:32Z

crates/hriblt/src/coded_symbol.rs

+    if stream_len > 32 && i % 32 != 0 {
+        let seed = i % 4;
+        let j = index_for_seed(state, value, stream_len, seed as u32);
+        if i == j { 1 } else { 0 }


Should this be written differently in Rust? (same couple lines below)

aneubeck · 2026-01-16T14:15:21Z

crates/hriblt/src/decoding_session.rs

+pub struct DecodingSession<T: Encodable, H: HashFunctions<T>> {
+    /// The encoded stream of hashes.
+    /// All recovered coded symbols have been removed from this stream.
+    /// If decoded failed, then one can simply append more data and continue decoding.


Suggested change

/// If decoded failed, then one can simply append more data and continue decoding.

/// If decoding fails, then one can simply append more data and continue decoding.

aneubeck · 2026-01-16T14:18:26Z

crates/hriblt/src/decoding_session.rs

+
+    /// For statistical purposes: this number informs how many bits of the
+    /// checksum were required to identify pure coded symbols.
+    pub(crate) required_bits: usize,


Wondering whether we still want/need this...
Essentially, the smaller this number, the more robust the procedure.

aneubeck · 2026-01-16T14:19:53Z

crates/hriblt/src/decoding_session.rs

+    /// The encoded stream of hashes.
+    /// All recovered coded symbols have been removed from this stream.
+    /// If decoded failed, then one can simply append more data and continue decoding.
+    encoded: EncodingSession<T, H>,


We need to add somewhere that all coded symbols must be computed with the same seed.

aneubeck · 2026-01-16T14:32:49Z

crates/hriblt/src/encoding_session.rs

+
+    /// Create a EncodingSession from a vector of coded symbols.
+    ///
+    /// Returns an error if the split is out of range or if the length of te vector


Suggested change

/// Returns an error if the split is out of range or if the length of te vector

/// Returns an error if the split is out of range or if the length of the vector

aneubeck · 2026-01-16T14:39:31Z

crates/hriblt/src/lib.rs

+//! The core algorithm is based on the idea of set similarity sketching where pure hashes are identified
+//! by testing whether one of the 5 hash functions would map the candidate value back to the index
+//! of the value. Taking advantage of this necessary condition reduces the number of necessary checksum
+//! bits by a bit.


"a bit"... i.e. one bit?
This is ambiguous ;)

aneubeck · 2026-01-16T14:42:06Z

crates/hriblt/src/lib.rs

+//! This algorithm has similar properties to the "Practical Rateless Set Reconciliation" algorithm.
+//! Main differences are:
+//! * We only use 5 hash functions instead of log(n).
+//! * As a result we only require 5 independent hash functions instead of log(n) many.


actually I don't recall whether they need log(n) !independent! hash functions...
It might be that independence is not a required in their scheme?

aneubeck · 2026-01-16T14:43:26Z

crates/hriblt/src/lib.rs

+//!   and a stable math library must be used!)
+//! * Encoding/decoding is faster due to the fixed number of hash functions and the simpler operations.
+//! * Since we have a fixed number of hash functions, we can utilize the coded symbol index as
+//!   an additional condition. In fact, we need to compute just a single hash function (on average).


Suggested change

//! an additional condition. In fact, we need to compute just a single hash function (on average).

//! an additional condition. In fact, we need to compute just a single hash function (on average) for this check.

aneubeck · 2026-01-16T14:45:37Z

crates/hriblt/src/lib.rs

+    }
+}
+
+fn indices<T>(


should this function be moved into a separate file, since it isn't public?

maybe some index.rs or bucket_index.rs file?

aneubeck · 2026-01-16T14:45:54Z

crates/hriblt/src/lib.rs

+/// (or small number of coded symbols) won't erase the value information.
+///
+/// The second return value indicates whether the entry should be stored negated.
+fn index_for_seed<T>(


same for these...

aneubeck · 2026-01-16T14:47:05Z

crates/hriblt/src/lib.rs

+/// - Fewer (ideally 3) hash functions with larger partitions lead to a higher chance to
+///   find a pure value in the stream, i.e. the stream can be decoded with fewer coded symbols.
+///
+/// After testing various schemes, I settled for this one which uses 4 equally sized partitions,


aneubeck · 2026-01-16T14:48:32Z

crates/hriblt/src/lib.rs

+/// Note: we want an odd number of hash functions, so that collapsing the stream to a single coded symbol
+/// (or small number of coded symbols) won't erase the value information.
+///
+/// The second return value indicates whether the entry should be stored negated.


remove this last sentence, since there isn't a second return value anymore ;)

aneubeck · 2026-01-16T14:50:12Z

crates/hriblt/README.md

+
+Create two encoding sessions, one containing your data, and another containing the counter-parties data. This counterparty data might have been sent to you over a network for example.
+
+The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has recieved some symbols from "Alice".


Suggested change

The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has recieved some symbols from "Alice".

The following example attempts to reconcile the differences between such two sets of `u64` integers, and is done from the perspective of "Bob", who has received some symbols from "Alice".

aneubeck · 2026-01-16T14:50:55Z

crates/hriblt/README.md

+hriblt = "0.1"
+```
+
+Create two encoding sessions, one containing your data, and another containing the counter-parties data. This counterparty data might have been sent to you over a network for example.


replace your and counter-party with Alice and Bob from the start...

aneubeck · 2026-01-16T14:53:00Z

crates/hriblt/README.md

+
+diff.sort();
+
+assert_eq!(diff, [0, 1, 2, 3, 4, 11, 12, 13, 14, 15]);


You want to show here that it ALSO tells you whether a value was Inserted or Removed.
Maybe use the sign of the value for demonstration purposes?

aneubeck · 2026-01-16T14:54:36Z

crates/hriblt/README.md

+
+assert_eq!(diff, [0, 1, 2, 3, 4, 11, 12, 13, 14, 15]);
+
+```


I think we also want an example showing the hierarchical aspect of this.
I.e. both sides compute an encoding session of size 1024 and convert it into a hierarchy. Then, one side transfers 128 blocks at a time until decoding succeeds.

aneubeck · 2026-01-16T14:55:10Z

crates/hriblt/Cargo.toml

+keywords = ["set-reconciliation", "sync", "algorithm", "probabilistic"]
+categories = ["algorithms", "data-structures", "mathematics", "science"]
+
+[dependencies]


we also want some benchmarks to show throughput

aneubeck · 2026-01-16T14:55:42Z

crates/hriblt/docs/hashing_functions.md

+
+If the value you're inserting into the encoding session is a high entropy random value, such as a cryptographic hash digest, you can recycle the bytes in that value to produce the coded symbol indexing hashes, instead of hashing that value again. This results in a constant-factor speed up.
+
+For example if you were trying to find the difference between two sets of documents, instead of each coded symbol being the whole document it could instead just be a SHA1 hash of the document content. Since each SHA1 digest has 20 bytes of high entropy bits, instead of hashing this value five times again to produce the five coded symbol indices we can simply slice out five `u32` values from the digest itself.


I think we should have a feature flag which provides the SHA1 functionality for you...

aneubeck · 2026-01-16T15:00:08Z

crates/hriblt/src/lib.rs

@@ -0,0 +1,400 @@
+//! This module implements a set reconciliation algorithm using XOR-based hashes.
+//!
+//! The core algorithm is based on the idea of set similarity sketching where pure hashes are identified


I don't think we ever talk about set similarity sketching anywhere else.
Dropping this term here is therefore confusing IMO.
(I'm also not quite sure what the relation is except that both procedures use some kind of hashing and an array :) )

initial import

b6a04ef

itsibitzi marked this pull request as ready for review January 16, 2026 00:33

itsibitzi requested a review from a team as a code owner January 16, 2026 00:33

Copilot AI review requested due to automatic review settings January 16, 2026 00:33

Copilot started reviewing on behalf of itsibitzi January 16, 2026 00:34 View session

Copilot AI reviewed Jan 16, 2026

View reviewed changes

aneubeck reviewed Jan 16, 2026

View reviewed changes


		When using HRIBLT in production systems it is important to consider the stability of your hash functions.

		We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are not guarenteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.

	if coded_symbols.len() > range.len() {
	if coded_symbols.len() != range.len() {

		pub fn extend(&mut self, entities: impl Iterator<Item = T>) {
		for entity in entities {


		This library has a trait, `HashFunctions` which is used to create the hashes required to place your symbol into the range of coded symbols.

		The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.


		Create two encoding sessions, one containing your data, and another containing the counter-parties data. This counterparty data might have been sent to you over a network for example.

		The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has recieved some symbols from "Alice".


		## Coded Symbol Multiplier

		The number of coded symbols required to find the difference between two sets is proportional to the difference between the two sets. The following chart shows the relationship between the number of coded symbols required to decode HRIBLT and the size of the diff. Note that the size of the base set (before diffs were added) was fixed.


		`y = len(coded_symbols) / diff_size`

		![Coded symbol multiplier](./assets/coded-symbol-multiplier.png)

	/// If decoded failed, then one can simply append more data and continue decoding.
	/// If decoding fails, then one can simply append more data and continue decoding.

	/// Returns an error if the split is out of range or if the length of te vector
	/// Returns an error if the split is out of range or if the length of the vector

	//! an additional condition. In fact, we need to compute just a single hash function (on average).
	//! an additional condition. In fact, we need to compute just a single hash function (on average) for this check.


		diff.sort();

		assert_eq!(diff, [0, 1, 2, 3, 4, 11, 12, 13, 14, 15]);


		If the value you're inserting into the encoding session is a high entropy random value, such as a cryptographic hash digest, you can recycle the bytes in that value to produce the coded symbol indexing hashes, instead of hashing that value again. This results in a constant-factor speed up.

		For example if you were trying to find the difference between two sets of documents, instead of each coded symbol being the whole document it could instead just be a SHA1 hash of the document content. Since each SHA1 digest has 20 bytes of high entropy bits, instead of hashing this value five times again to produce the five coded symbol indices we can simply slice out five `u32` values from the digest itself.

Open source HRIBLT #94

Are you sure you want to change the base?

Open source HRIBLT #94

Uh oh!

Conversation

itsibitzi commented Jan 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aneubeck Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aneubeck Jan 16, 2026 •

edited

Loading