Skip to content

PYTHON-5683: Spike: Investigate using Rust for Extension Modules#2695

Draft
aclark4life wants to merge 1 commit intomongodb:PYTHON-5683from
aclark4life:PYTHON-5683
Draft

PYTHON-5683: Spike: Investigate using Rust for Extension Modules#2695
aclark4life wants to merge 1 commit intomongodb:PYTHON-5683from
aclark4life:PYTHON-5683

Conversation

@aclark4life
Copy link
Contributor

@aclark4life aclark4life commented Feb 4, 2026

PYTHON-5683

Changes in this PR

  • Implement comprehensive Rust BSON encoder/decoder
  • Add Evergreen CI configuration and test scripts
  • Add GitHub Actions workflow for Rust testing
  • Add runtime selection via PYMONGO_USE_RUST environment variable
  • Add performance benchmarking suite
  • Update build system to support Rust extension
  • Add documentation for Rust extension usage and testing

Test Plan

Rely on existing bson tests

Checklist

Checklist for Author

  • Did you update the changelog (if necessary)?
  • Is there test coverage?
  • Is any followup work tracked in a JIRA ticket? If so, add link(s).

Checklist for Reviewer

  • Does the title of the PR reference a JIRA Ticket?
  • Do you fully understand the implementation? (Would you be comfortable explaining how this code works to someone else?)
  • Is all relevant documentation (README or docstring) updated?

@aclark4life aclark4life changed the base branch from master to PYTHON-5683 February 4, 2026 21:01
@aclark4life aclark4life force-pushed the PYTHON-5683 branch 18 times, most recently from 493ee02 to e970962 Compare February 5, 2026 03:09
- Implement comprehensive Rust BSON encoder/decoder
- Add Evergreen CI configuration and test scripts
- Add GitHub Actions workflow for Rust testing
- Add runtime selection via PYMONGO_USE_RUST environment variable
- Add performance benchmarking suite
- Update build system to support Rust extension
- Add documentation for Rust extension usage and testing"

Fix Rust extension to respect document_class in CodecOptions

- Extract document_class from codec_options and call it to create document instances
- Fixes test_custom_class test
- Test pass rate: 70/88 (80%)

Fix Rust extension to respect tzinfo in CodecOptions for datetime decoding

- Extract tzinfo from codec_options and convert datetime to that timezone using astimezone()
- Matches C extension behavior
- Fixes test_local_datetime test
- Test pass rate: 71/88 (81%)

Add BSON validation to Rust extension

- Validate minimum size (5 bytes)
- Validate size field matches actual data length
- Validate document ends with null terminator (0x00)
- Validate no extra bytes after document
- Fixes test_basic_validation test
- Test pass rate: 72/88 (82%)

Update README to reflect comprehensive Rust extension implementation

- Changed 'Current Limitations' to 'Current Status'
- Listed all implemented BSON types and features
- Updated test pass rate to 82% (72/88 tests)
- Clarified remaining work items
- Removed outdated limitation about missing BSON types

Fix unknown BSON type error message format to match C extension

- Parse BSON data to extract field name when unknown type is encountered
- Use last non-numeric field name before unknown type (handles nested structures)
- Matches C extension error format: 'Detected unknown BSON type b'\xNN' for fieldname 'foo''
- Fixes test_unknown_type test
- Test pass rate: 73/88 (83%)

Update GitHub Actions workflow to test Rust extension capabilities instead of limitations

- Changed 'Test Rust extension limitations' to 'Test Rust extension with complex types'
- Now tests that ObjectId, DateTime, and Decimal128 work correctly
- Removed outdated tests expecting TypeError for these types
- Reflects comprehensive BSON implementation in Rust extension

Add UUID representation validation to Rust extension

- Check uuid_representation from codec_options when encoding native UUID
- Raise ValueError if uuid_representation is UNSPECIFIED (0)
- Use appropriate Binary subtype based on uuid_representation value
- Matches C extension behavior and error message
- Fixes test_uuid and test_decode_all_defaults tests
- Test pass rate: 75/88 (85%)

Add buffer protocol support for decode in Rust extension

- Support memoryview, array.array, and mmap objects
- Try multiple methods: extract, __bytes__, tobytes(), read()
- Handles all buffer protocol objects that C extension supports
- Fixes test_decode_buffer_protocol test
- Test pass rate: 76/88 (86%)

Fix UUID representation values in Rust extension

- Correct UUID representation values: PYTHON_LEGACY=3, STANDARD=4, JAVA_LEGACY=5, CSHARP_LEGACY=6
- STANDARD now correctly uses Binary subtype 4 instead of 3
- UUID decoding now works correctly with all representations
- Fixes test_decode_all_kwarg test
- Test pass rate: 77/88 (88%)

Add DBPointer support to Rust extension

- DBPointer is deprecated BSON type that decodes to DBRef
- Parse DbPointer Debug output to extract namespace and ObjectId
- Fixes test_dbpointer test
- Test pass rate: 78/88 (89%)

Add DatetimeMS support to Rust extension

- Check datetime_conversion from codec_options
- Return DatetimeMS objects when datetime_conversion=DATETIME_MS (value 3)
- Validate that DatetimeMS.__int__() returns integer, not float
- Fixes test_class_conversions test
- Test pass rate: 80/88 (91%)

Add datetime clamping support to Rust extension

- Implement DATETIME_CLAMP mode (value 2) to clamp out-of-range values
- Clamp to Python datetime range: -62135596800000 to 253402300799999 ms
- Raise OverflowError for out-of-range values in DATETIME_AUTO mode
- Fixes test_clamping, test_tz_clamping_local, test_tz_clamping_non_hashable, test_tz_clamping_utc tests
- Test pass rate: 84/88 (95%)

Add DATETIME_AUTO support to Rust extension

- DATETIME_AUTO (value 4) returns DatetimeMS for out-of-range values
- Default to DATETIME_AUTO when no datetime_conversion specified
- Fixes test_datetime_auto test
- Test pass rate: 85/88 (97%)

Add InvalidBSON error for extremely out-of-range datetime values

- Raise InvalidBSON with helpful error message for values beyond ±2^52
- Include suggestion to use DATETIME_AUTO mode
- Fixes test_millis_from_datetime_ms test
- Test pass rate: 86/88 (98%)

Fix timezone clamping with non-UTC timezones

- Track original millis value before clamping
- Handle OverflowError during astimezone() by checking if datetime is at min or max
- Return datetime.min or datetime.max with target tzinfo when overflow occurs
- Fixes test_tz_clamping_non_utc test
- Test pass rate: 87/88 (99%)

Implement unicode_decode_error_handler support

- When UTF-8 error is detected and unicode_decode_error_handler is not 'strict', fall back to Python implementation
- Python implementation correctly handles all error handlers (replace, backslashreplace, surrogateescape, ignore)
- Saves reference to Python _bson_to_dict implementation before it gets overridden by Rust extension
- Removed unused decode_bson_with_utf8_handler function
- Fixes test_unicode_decode_error_handler test
- Test pass rate: 88/88 (100%)

Update README with 100% test pass rate

- All 88 tests now passing
- Complete codec_options support implemented
- Datetime clamping and unicode error handlers working
- Ready for performance benchmarking

Add performance benchmarking and analysis

- Created comprehensive benchmark suite comparing C vs Rust extensions
- Added profiling script to identify bottlenecks
- Added micro-benchmarks for specific document types
- Updated README.md with performance analysis and recommendations

Optimize Rust extension: fast-path for common types and efficient _id handling

- Added fast-path that checks int/str/float/bool/None FIRST before expensive module lookups
- Moved _type_marker check before UUID/datetime/regex checks
- Optimized _id field handling to avoid creating new document and copying all fields
- Simplified mapping item processing

Implemented datetime_to_millis() function in Rust that:

- Extracts datetime components (year, month, day, hour, minute, second, microsecond)
- Checks for timezone offset using utcoffset() method
- Uses Python's calendar.timegm() for accurate epoch calculation
- Adjusts for timezone offset
- Converts to milliseconds

Add type object caching to avoid repeated module imports

- UUID class from uuid module
- datetime class from datetime module
- Pattern class from re module

Fixed mypy errors by changing type ignore comments from attr-defined to union-attr:

- bson/__init__.py: Fixed union-attr errors for _cbson and _rbson module attributes
- tools/fail_if_no_c.py: Removed unnecessary type ignore (C extension is built)
- tools/clean.py: Removed unnecessary type ignore (C extension is built)

Also fixed typing issues in performance test files:

- test/performance/benchmark_bson.py: Added type annotations for function signatures
- test/performance/micro_benchmark.py: Added explicit type annotations for dict literals

Fix test_default_exports by cleaning up spec variable

- The 'spec' variable used during module initialization was being left in the
  module namespace, causing test_default_exports.py to fail. Added 'del spec'
  to clean up the variable after use.

Fix shellcheck warning in bson/_rbson/build.sh

- Changed trap command to use single quotes instead of double quotes to ensure
  \ expands when the trap is executed (on EXIT) rather than when the
  trap is set.

Fix Windows path handling in bson/_rbson/build.sh

- Changed the Python script to receive paths as command-line arguments via
  sys.argv instead of embedding them in the script string. This ensures proper
  path handling on Windows where paths like 'C:\...' would be mangled when
  embedded in shell strings.
- Also use pathlib.Path consistently for cross-platform path handling.

Remove emoji from Python print to fix Windows encoding error

- Removed emoji characters (✅ ❌) from Python print statements to avoid
  UnicodeEncodeError on Windows where the default encoding (cp1252) doesn't
  support these characters.

Remove all emoji characters from Python print statements

- Removed emoji characters (✓, ✅, ❌, ✗) from all Python print statements
  in shell scripts and GitHub workflows to fix UnicodeEncodeError on Windows
  where the default encoding (cp1252) doesn't support these characters.

Fix cross-compatibility test to preserve extension modules

- When reloading the bson module to switch from C to Rust extension, we need
  to preserve the extension modules (_cbson and _rbson) in sys.modules.
  Otherwise, when bson is re-imported, it can't find the already-loaded
  extensions and falls back to C.

    - Save references to _cbson and _rbson before clearing sys.modules
    - Only clear bson modules that aren't the extensions
    - Restore the extension modules before re-importing bson

Fix extension module reloading to reuse already-loaded modules

- Modified bson module initialization to check if extension modules (_cbson,
  _rbson) are already loaded in sys.modules before creating new instances.
  This allows the module to be reloaded with different settings (e.g.,
  PYMONGO_USE_RUST) without losing access to already-loaded extensions.

  - Check sys.modules for 'bson._cbson' and 'bson._rbson' before using
    importlib.util.module_from_spec()
  - Reuse existing module instances when available
  - Renamed 'spec' to '_spec' to avoid namespace pollution

Fix benchmark script to preserve extension modules when reloading

- Modified benchmark_bson.py to preserve extension modules (_cbson, _rbson)
  when reloading the bson module to switch implementations. This is the same
  fix applied to the cross-compatibility test.

  - Save references to _cbson and _rbson before clearing sys.modules
  - Only clear bson modules that aren't the extensions
  - Restore extension modules before re-importing bson

Optimize Rust BSON encoding with PyDict and PyList fast paths

- Added fast-path optimizations for the most common Python types to reduce
  overhead from Python API calls:

  - PyDict fast path in python_mapping_to_bson_doc():

     - Iterate directly over PyDict items instead of calling items() method
     - Avoids creating intermediate list of tuples
     - Pre-allocate vector with known capacity
     - Added extract_dict_item() helper for dict-specific extraction

  - PyList and PyTuple fast paths in handle_remaining_python_types():

     - Check for PyList/PyTuple before generic sequence extraction
     - Use direct iteration with pre-allocated capacity
     - Avoids expensive extract::<Vec<Bound<'_, PyAny>>>() call

  - Also fixed micro_benchmark.py to preserve extension modules when reloading

Added profile_nested.py to identify specific performance bottlenecks in:

- Nested dictionaries (3 and 5 levels deep)
- Wide dictionaries (10 keys)
- Lists of dictionaries
- Lists of integers

Implement direct BSON byte writing for major performance improvement

- Major architectural change: Instead of building intermediate bson::Document
  structures and then serializing them, we now write BSON bytes directly.

  - Added write_document_bytes() - writes BSON documents directly to bytes
  - Added write_element() - writes individual BSON elements with type-specific encoding
  - Added write_array_bytes() and write_tuple_bytes() - direct array encoding
  - Added helper functions: write_cstring(), write_string(), write_bson_value()
  - Modified _dict_to_bson() to use the new direct byte writing approach

Implement direct BSON byte reading for improved decode performance

- Added direct BSON-to-Python decoding that reads bytes directly without
  the intermediate Document structure.

  - Added read_document_from_bytes() - reads BSON documents directly from bytes
  - Added read_bson_value() - reads individual BSON values
  - Added read_array_from_bytes() - reads BSON arrays
  - Modified _bson_to_dict() to use the new direct byte reading approach

Fix mypy type errors in profile_nested.py

- Added 'Any' type annotation to the 'doc' variable to handle different
  document types being assigned to the same variable. This fixes the
  dict-item type incompatibility errors.

Fix BSON decode fallback for unsupported types

- Fixed the direct BSON decoder to properly fall back to the Document-based
  approach when encountering unsupported BSON types (ObjectId, Binary, DateTime,
  Regex, etc.).

  - Modified read_bson_value() to return an error for unsupported types instead
    of trying to parse them incorrectly
  - Updated _bson_to_dict() to catch 'Unsupported BSON type' errors and fall
    back to Document::from_reader() for the entire document
  - This ensures correctness for all BSON types while maintaining performance
    for common types (int, string, bool, null, dict, list)

Fix mypy type error in profile_decode.py

- Added type annotation 'dict[str, dict[str, Any]]' to the 'docs' variable
  to handle different document structures with varying value types.

Add Rust comparison tests to perf_test.py and async_perf_test.py

- Added new test classes that compare C vs Rust BSON implementations:

  - RustSimpleIntEncodingTest/DecodingTest - Simple integer documents
  - RustMixedTypesEncodingTest - Documents with mixed types
  - RustNestedEncodingTest - Nested documents
  - RustListEncodingTest - Documents with lists

Fix TestLongLongToString test when C extension is not available

- The test was failing with AttributeError when _cbson was None (e.g.,
  when using the Rust extension). Added a check to skip the test if
  _cbson is None or not imported, since _test_long_long_to_str() is a
  C-specific test function.

Update CI workflow to use new integrated performance tests

- Updated .github/workflows/test-rust.yml to use the new integrated
  performance tests in perf_test.py instead of the removed
  benchmark_bson.py file.

  - TestRustSimpleIntEncodingC/Rust
  - TestRustMixedTypesEncodingC/Rust

Fix division by zero in performance test tearDown

- Added protection against division by zero when calculating
  megabytes_per_sec in the tearDown method. This can occur on Windows
  or when operations are extremely fast and the median time rounds to 0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant