PYTHON-5683: Spike: Investigate using Rust for Extension Modules by aclark4life · Pull Request #2695 · mongodb/mongo-python-driver

aclark4life · 2026-02-04T20:58:04Z

PYTHON-5683

Changes in this PR

Implement comprehensive Rust BSON encoder/decoder
Add Evergreen CI configuration and test scripts
Add GitHub Actions workflow for Rust testing
Add runtime selection via PYMONGO_USE_RUST environment variable
Add performance benchmarking suite
Update build system to support Rust extension
Add documentation for Rust extension usage and testing

Test Plan

Rely on existing bson tests

Checklist

Checklist for Author

Did you update the changelog (if necessary)?
Is there test coverage?
Is any followup work tracked in a JIRA ticket? If so, add link(s).

Checklist for Reviewer

Does the title of the PR reference a JIRA Ticket?
Do you fully understand the implementation? (Would you be comfortable explaining how this code works to someone else?)
Is all relevant documentation (README or docstring) updated?

.github/workflows/test-rust.yml

- Implement comprehensive Rust BSON encoder/decoder - Add Evergreen CI configuration and test scripts - Add GitHub Actions workflow for Rust testing - Add runtime selection via PYMONGO_USE_RUST environment variable - Add performance benchmarking suite - Update build system to support Rust extension - Add documentation for Rust extension usage and testing" Fix Rust extension to respect document_class in CodecOptions - Extract document_class from codec_options and call it to create document instances - Fixes test_custom_class test - Test pass rate: 70/88 (80%) Fix Rust extension to respect tzinfo in CodecOptions for datetime decoding - Extract tzinfo from codec_options and convert datetime to that timezone using astimezone() - Matches C extension behavior - Fixes test_local_datetime test - Test pass rate: 71/88 (81%) Add BSON validation to Rust extension - Validate minimum size (5 bytes) - Validate size field matches actual data length - Validate document ends with null terminator (0x00) - Validate no extra bytes after document - Fixes test_basic_validation test - Test pass rate: 72/88 (82%) Update README to reflect comprehensive Rust extension implementation - Changed 'Current Limitations' to 'Current Status' - Listed all implemented BSON types and features - Updated test pass rate to 82% (72/88 tests) - Clarified remaining work items - Removed outdated limitation about missing BSON types Fix unknown BSON type error message format to match C extension - Parse BSON data to extract field name when unknown type is encountered - Use last non-numeric field name before unknown type (handles nested structures) - Matches C extension error format: 'Detected unknown BSON type b'\xNN' for fieldname 'foo'' - Fixes test_unknown_type test - Test pass rate: 73/88 (83%) Update GitHub Actions workflow to test Rust extension capabilities instead of limitations - Changed 'Test Rust extension limitations' to 'Test Rust extension with complex types' - Now tests that ObjectId, DateTime, and Decimal128 work correctly - Removed outdated tests expecting TypeError for these types - Reflects comprehensive BSON implementation in Rust extension Add UUID representation validation to Rust extension - Check uuid_representation from codec_options when encoding native UUID - Raise ValueError if uuid_representation is UNSPECIFIED (0) - Use appropriate Binary subtype based on uuid_representation value - Matches C extension behavior and error message - Fixes test_uuid and test_decode_all_defaults tests - Test pass rate: 75/88 (85%) Add buffer protocol support for decode in Rust extension - Support memoryview, array.array, and mmap objects - Try multiple methods: extract, __bytes__, tobytes(), read() - Handles all buffer protocol objects that C extension supports - Fixes test_decode_buffer_protocol test - Test pass rate: 76/88 (86%) Fix UUID representation values in Rust extension - Correct UUID representation values: PYTHON_LEGACY=3, STANDARD=4, JAVA_LEGACY=5, CSHARP_LEGACY=6 - STANDARD now correctly uses Binary subtype 4 instead of 3 - UUID decoding now works correctly with all representations - Fixes test_decode_all_kwarg test - Test pass rate: 77/88 (88%) Add DBPointer support to Rust extension - DBPointer is deprecated BSON type that decodes to DBRef - Parse DbPointer Debug output to extract namespace and ObjectId - Fixes test_dbpointer test - Test pass rate: 78/88 (89%) Add DatetimeMS support to Rust extension - Check datetime_conversion from codec_options - Return DatetimeMS objects when datetime_conversion=DATETIME_MS (value 3) - Validate that DatetimeMS.__int__() returns integer, not float - Fixes test_class_conversions test - Test pass rate: 80/88 (91%) Add datetime clamping support to Rust extension - Implement DATETIME_CLAMP mode (value 2) to clamp out-of-range values - Clamp to Python datetime range: -62135596800000 to 253402300799999 ms - Raise OverflowError for out-of-range values in DATETIME_AUTO mode - Fixes test_clamping, test_tz_clamping_local, test_tz_clamping_non_hashable, test_tz_clamping_utc tests - Test pass rate: 84/88 (95%) Add DATETIME_AUTO support to Rust extension - DATETIME_AUTO (value 4) returns DatetimeMS for out-of-range values - Default to DATETIME_AUTO when no datetime_conversion specified - Fixes test_datetime_auto test - Test pass rate: 85/88 (97%) Add InvalidBSON error for extremely out-of-range datetime values - Raise InvalidBSON with helpful error message for values beyond ±2^52 - Include suggestion to use DATETIME_AUTO mode - Fixes test_millis_from_datetime_ms test - Test pass rate: 86/88 (98%) Fix timezone clamping with non-UTC timezones - Track original millis value before clamping - Handle OverflowError during astimezone() by checking if datetime is at min or max - Return datetime.min or datetime.max with target tzinfo when overflow occurs - Fixes test_tz_clamping_non_utc test - Test pass rate: 87/88 (99%) Implement unicode_decode_error_handler support - When UTF-8 error is detected and unicode_decode_error_handler is not 'strict', fall back to Python implementation - Python implementation correctly handles all error handlers (replace, backslashreplace, surrogateescape, ignore) - Saves reference to Python _bson_to_dict implementation before it gets overridden by Rust extension - Removed unused decode_bson_with_utf8_handler function - Fixes test_unicode_decode_error_handler test - Test pass rate: 88/88 (100%) Update README with 100% test pass rate - All 88 tests now passing - Complete codec_options support implemented - Datetime clamping and unicode error handlers working - Ready for performance benchmarking Add performance benchmarking and analysis - Created comprehensive benchmark suite comparing C vs Rust extensions - Added profiling script to identify bottlenecks - Added micro-benchmarks for specific document types - Updated README.md with performance analysis and recommendations Optimize Rust extension: fast-path for common types and efficient _id handling - Added fast-path that checks int/str/float/bool/None FIRST before expensive module lookups - Moved _type_marker check before UUID/datetime/regex checks - Optimized _id field handling to avoid creating new document and copying all fields - Simplified mapping item processing Implemented datetime_to_millis() function in Rust that: - Extracts datetime components (year, month, day, hour, minute, second, microsecond) - Checks for timezone offset using utcoffset() method - Uses Python's calendar.timegm() for accurate epoch calculation - Adjusts for timezone offset - Converts to milliseconds Add type object caching to avoid repeated module imports - UUID class from uuid module - datetime class from datetime module - Pattern class from re module Fixed mypy errors by changing type ignore comments from attr-defined to union-attr: - bson/__init__.py: Fixed union-attr errors for _cbson and _rbson module attributes - tools/fail_if_no_c.py: Removed unnecessary type ignore (C extension is built) - tools/clean.py: Removed unnecessary type ignore (C extension is built) Also fixed typing issues in performance test files: - test/performance/benchmark_bson.py: Added type annotations for function signatures - test/performance/micro_benchmark.py: Added explicit type annotations for dict literals Fix test_default_exports by cleaning up spec variable - The 'spec' variable used during module initialization was being left in the module namespace, causing test_default_exports.py to fail. Added 'del spec' to clean up the variable after use. Fix shellcheck warning in bson/_rbson/build.sh - Changed trap command to use single quotes instead of double quotes to ensure \ expands when the trap is executed (on EXIT) rather than when the trap is set. Fix Windows path handling in bson/_rbson/build.sh - Changed the Python script to receive paths as command-line arguments via sys.argv instead of embedding them in the script string. This ensures proper path handling on Windows where paths like 'C:\...' would be mangled when embedded in shell strings. - Also use pathlib.Path consistently for cross-platform path handling. Remove emoji from Python print to fix Windows encoding error - Removed emoji characters (✅ ❌) from Python print statements to avoid UnicodeEncodeError on Windows where the default encoding (cp1252) doesn't support these characters. Remove all emoji characters from Python print statements - Removed emoji characters (✓, ✅, ❌, ✗) from all Python print statements in shell scripts and GitHub workflows to fix UnicodeEncodeError on Windows where the default encoding (cp1252) doesn't support these characters. Fix cross-compatibility test to preserve extension modules - When reloading the bson module to switch from C to Rust extension, we need to preserve the extension modules (_cbson and _rbson) in sys.modules. Otherwise, when bson is re-imported, it can't find the already-loaded extensions and falls back to C. - Save references to _cbson and _rbson before clearing sys.modules - Only clear bson modules that aren't the extensions - Restore the extension modules before re-importing bson Fix extension module reloading to reuse already-loaded modules - Modified bson module initialization to check if extension modules (_cbson, _rbson) are already loaded in sys.modules before creating new instances. This allows the module to be reloaded with different settings (e.g., PYMONGO_USE_RUST) without losing access to already-loaded extensions. - Check sys.modules for 'bson._cbson' and 'bson._rbson' before using importlib.util.module_from_spec() - Reuse existing module instances when available - Renamed 'spec' to '_spec' to avoid namespace pollution Fix benchmark script to preserve extension modules when reloading - Modified benchmark_bson.py to preserve extension modules (_cbson, _rbson) when reloading the bson module to switch implementations. This is the same fix applied to the cross-compatibility test. - Save references to _cbson and _rbson before clearing sys.modules - Only clear bson modules that aren't the extensions - Restore extension modules before re-importing bson Optimize Rust BSON encoding with PyDict and PyList fast paths - Added fast-path optimizations for the most common Python types to reduce overhead from Python API calls: - PyDict fast path in python_mapping_to_bson_doc(): - Iterate directly over PyDict items instead of calling items() method - Avoids creating intermediate list of tuples - Pre-allocate vector with known capacity - Added extract_dict_item() helper for dict-specific extraction - PyList and PyTuple fast paths in handle_remaining_python_types(): - Check for PyList/PyTuple before generic sequence extraction - Use direct iteration with pre-allocated capacity - Avoids expensive extract::<Vec<Bound<'_, PyAny>>>() call - Also fixed micro_benchmark.py to preserve extension modules when reloading Added profile_nested.py to identify specific performance bottlenecks in: - Nested dictionaries (3 and 5 levels deep) - Wide dictionaries (10 keys) - Lists of dictionaries - Lists of integers Implement direct BSON byte writing for major performance improvement - Major architectural change: Instead of building intermediate bson::Document structures and then serializing them, we now write BSON bytes directly. - Added write_document_bytes() - writes BSON documents directly to bytes - Added write_element() - writes individual BSON elements with type-specific encoding - Added write_array_bytes() and write_tuple_bytes() - direct array encoding - Added helper functions: write_cstring(), write_string(), write_bson_value() - Modified _dict_to_bson() to use the new direct byte writing approach Implement direct BSON byte reading for improved decode performance - Added direct BSON-to-Python decoding that reads bytes directly without the intermediate Document structure. - Added read_document_from_bytes() - reads BSON documents directly from bytes - Added read_bson_value() - reads individual BSON values - Added read_array_from_bytes() - reads BSON arrays - Modified _bson_to_dict() to use the new direct byte reading approach Fix mypy type errors in profile_nested.py - Added 'Any' type annotation to the 'doc' variable to handle different document types being assigned to the same variable. This fixes the dict-item type incompatibility errors. Fix BSON decode fallback for unsupported types - Fixed the direct BSON decoder to properly fall back to the Document-based approach when encountering unsupported BSON types (ObjectId, Binary, DateTime, Regex, etc.). - Modified read_bson_value() to return an error for unsupported types instead of trying to parse them incorrectly - Updated _bson_to_dict() to catch 'Unsupported BSON type' errors and fall back to Document::from_reader() for the entire document - This ensures correctness for all BSON types while maintaining performance for common types (int, string, bool, null, dict, list) Fix mypy type error in profile_decode.py - Added type annotation 'dict[str, dict[str, Any]]' to the 'docs' variable to handle different document structures with varying value types. Add Rust comparison tests to perf_test.py and async_perf_test.py - Added new test classes that compare C vs Rust BSON implementations: - RustSimpleIntEncodingTest/DecodingTest - Simple integer documents - RustMixedTypesEncodingTest - Documents with mixed types - RustNestedEncodingTest - Nested documents - RustListEncodingTest - Documents with lists Fix TestLongLongToString test when C extension is not available - The test was failing with AttributeError when _cbson was None (e.g., when using the Rust extension). Added a check to skip the test if _cbson is None or not imported, since _test_long_long_to_str() is a C-specific test function. Update CI workflow to use new integrated performance tests - Updated .github/workflows/test-rust.yml to use the new integrated performance tests in perf_test.py instead of the removed benchmark_bson.py file. - TestRustSimpleIntEncodingC/Rust - TestRustMixedTypesEncodingC/Rust Fix division by zero in performance test tearDown - Added protection against division by zero when calculating megabytes_per_sec in the tearDown method. This can occur on Windows or when operations are extremely fast and the median time rounds to 0.

github-advanced-security bot found potential problems Feb 4, 2026

View reviewed changes

.github/workflows/test-rust.yml Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Feb 4, 2026

View reviewed changes

.github/workflows/test-rust.yml Fixed Show fixed Hide fixed

aclark4life changed the base branch from master to PYTHON-5683 February 4, 2026 21:01

github-advanced-security bot found potential problems Feb 4, 2026

View reviewed changes

.github/workflows/test-rust.yml Fixed Show fixed Hide fixed

.github/workflows/test-rust.yml Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Feb 4, 2026

View reviewed changes

.github/workflows/test-rust.yml Fixed Show fixed Hide fixed

.github/workflows/test-rust.yml Fixed Show fixed Hide fixed

aclark4life force-pushed the PYTHON-5683 branch 18 times, most recently from 493ee02 to e970962 Compare February 5, 2026 03:09

aclark4life force-pushed the PYTHON-5683 branch from e970962 to 8e34e92 Compare February 5, 2026 03:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PYTHON-5683: Spike: Investigate using Rust for Extension Modules#2695

PYTHON-5683: Spike: Investigate using Rust for Extension Modules#2695
aclark4life wants to merge 1 commit intomongodb:PYTHON-5683from
aclark4life:PYTHON-5683

aclark4life commented Feb 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aclark4life commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes in this PR

Test Plan

Checklist

Checklist for Author

Checklist for Reviewer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aclark4life commented Feb 4, 2026 •

edited

Loading