Skip to content

Conversation

@jamesfredley
Copy link

@jamesfredley jamesfredley commented Jan 28, 2026

This PR significantly improves invokedynamic performance in Grails 7 / Groovy 4 by implementing several key optimizations:

https://issues.apache.org/jira/browse/GROOVY-10307
apache/grails-core#15293

Using https://github.com/jamesfredley/grails7-performance-regression as a test application

10K/10K = groovy.indy.optimize.threshold/groovy.indy.fallback.threshold

Key Results

Configuration Runtime (4GB) Startup
No-Indy 60ms ~14s
Optimized 10K/10K 139ms ~11s
Baseline Indy 267ms ~14s

Problem Statement

Root Cause Analysis

The performance in the Grails 7 Test Application was traced to the global SwitchPoint guard in Groovy's invokedynamic implementation:

  1. Global SwitchPoint Invalidation: When ANY metaclass changes anywhere in the application, ALL call sites are invalidated
  2. Grails Framework Behavior: Grails frequently modifies metaclasses during request processing (controllers, services, domain classes)
  3. Cascade Effect: Each metaclass change triggers mass invalidation, forcing all call sites to re-resolve their targets
  4. Guard Failures: The constant invalidation causes guards to fail repeatedly, preventing the JIT from optimizing method handle chains

Optimizations Applied

  1. Disabled Global SwitchPoint Guard

    • Removed the global SwitchPoint that caused ALL call sites to invalidate when ANY metaclass changed
    • Individual call sites now manage their own invalidation via cache clearing
    • Can be re-enabled with: -Dgroovy.indy.switchpoint.guard=true
  2. Call Site Cache Invalidation

    • Implemented call site registration mechanism (ALL_CALL_SITES set with weak references)
    • When metaclass changes, all registered call sites have their caches cleared
    • Targets reset to default (fromCache) path, ensuring metaclass changes are visible
    • Garbage-collected call sites are automatically removed

Optimizations Removed (No Measurable Benefit)

Testing showed these additional optimizations provide no measurable performance benefit while adding complexity:

  1. Monomorphic Fast-Path (REMOVED)

    • Added volatile fields (latestClassName/latestMethodHandleWrapper) to CacheableCallSite
    • Skips synchronization and map lookups for call sites that consistently see the same types
    • Removal Reason: No measurable improvement in Grails 7 benchmark; added complexity
  2. Optimized Cache Operations (REMOVED)

    • Added get() method for cache lookup without automatic put
    • Added putIfAbsent() for thread-safe cache population
    • Removal Reason: No measurable improvement; existing getAndPut() sufficient
  3. Pre-Guard Method Handle Storage (REMOVED)

    • Selector stores the method handle before argument type guards are applied (handleBeforeArgGuards)
    • Enables potential future optimizations for direct invocation
    • Removal Reason: Future optimization not implemented; added unused code

What Was NOT Changed

  • Default thresholds remain 10,000/10,000 (same as baseline GROOVY_4_0_X)
  • All guards remain intact - guards are NOT bypassed, only the global SwitchPoint is disabled
  • Cache semantics unchanged - LRU eviction, soft references, same default size (4)

Performance Results

Test Environment

  • Java: Amazon Corretto JDK 25.0.2
  • OS: Windows 11
  • Test Application: Grails 7 performance benchmark
  • Data Set: 5 Companies, 29 Departments, 350 Employees, 102 Projects, 998 Tasks, 349 Milestones, 20 Skills
  • Test Method: 5 warmup + 50 test iterations per run, 5 runs per configuration

Runtime Performance Comparison (4GB Heap - Recommended)

Configuration Min Max Avg Stability
No-Indy 55ms ~85ms 60ms Excellent
Optimized 10K/10K 128ms ~216ms 139ms Good
Baseline Indy 224ms ~336ms 267ms Good

Startup Time Comparison

Configuration 4GB Notes
No-Indy ~14s Consistent
Baseline Indy ~14s Similar to no-indy
Optimized 10K/10K ~11s Fastest startup

Visual Performance Summary (4GB Heap)

Runtime Performance - Lower is Better
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
No-Indy:        ███ 60ms
Opt 10K/10K:    ███████ 139ms         
Baseline Indy:  █████████████ 267ms
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Startup Time - Lower is Better
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Opt 10K/10K:    ███████ 11s           
No-Indy:        ████████████ 14s
Baseline Indy:  ████████████ 14s
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Configuration Test Matrix

All tests performed on January 30, 2026 with JDK 25.0.2 (Corretto), 4GB heap (-Xmx4g -Xms2g). Runtime values are stabilized averages (runs 3-5).

Configuration Startup Runtime (Avg) Min Max
No-Indy ~14s 60ms 55ms 85ms
Baseline Indy ~14s 267ms 224ms 336ms
Optimized 10K/10K ~11s 139ms 128ms 216ms

Failed Optimization Attempts

These approaches were tried but caused test failures. They are documented here to prevent repeating the same mistakes.

Failed Attempt 1: Skip Metaclass Guards for Cache Hits

What it tried: Store method handle before guards, use unguarded handle for cache hits.

Why it failed: Call sites can be polymorphic (called with multiple receiver types). The unguarded handles contain explicitCastArguments with embedded type casts that throw ClassCastException when different types pass through.

Failed Attempt 2: Version-Based Metaclass Guards

What it tried: Use metaclass version instead of identity for guards, skip guards for "cache hits."

Why it failed: Same polymorphic call site issue - skipping guards for cache hits is fundamentally unsafe.

Failed Attempt 3: Skip Argument Guards After Warmup

What it tried: After warmup period, use handles without argument type guards.

Why it failed: Call sites can receive different argument types over their lifetime. Skipping argument guards causes ClassCastException or IncompatibleClassChangeError.

Key Learning

Any optimization that bypasses guards based on cache state is unsafe because:

  1. The cache key (receiver Class) identifies which handle to use
  2. But the call site itself can receive different types over its lifetime
  3. Guards in the method handle chain ensure proper fallback when types don't match
  4. The JVM's explicitCastArguments will throw if types don't match

Safe optimizations must:

  1. Keep all guards intact
  2. Focus on reducing guard overhead, not bypassing guards
  3. Reduce cache misses rather than optimize the miss path unsafely

Remaining Performance Gap

The optimized indy is still ~2.3x slower than non-indy (139ms vs 60ms). The remaining gap comes from:

  1. Guard evaluation overhead - Every call must check type guards
  2. Method handle chain traversal - Composed handles have inherent overhead
  3. Cache lookup cost - Cache lookup overhead on each call
  4. Fallback path cost - Cache misses trigger full method resolution

Potential Future Optimization Areas

  1. Reduce guard evaluation cost - Make guards cheaper (not skip them)
  2. Improve cache hit rate - Larger cache, better eviction policy
  3. Reduce method handle chain depth - Fewer composed handles
  4. JIT-friendly patterns - Ensure method handle chains can be inlined
  5. Profile-guided optimization - Use runtime profiles to specialize hot paths

Configuration Reference

System Properties

Property Default Description
groovy.indy.optimize.threshold 10000 Hit count before setting guarded target on call site
groovy.indy.fallback.threshold 10000 Fallback count before resetting to cache lookup path
groovy.indy.callsite.cache.size 4 Size of LRU cache per call site
groovy.indy.switchpoint.guard false Enable global SwitchPoint guard (NOT recommended)

Files Modified

Source Files

File Lines Changes
Selector.java +16 Disabled SwitchPoint guard by default
IndyInterface.java +41 Call site registration, cache invalidation on metaclass change
CacheableCallSite.java +17 Added clearCache() method
Total +69 Minimal changes for maximum impact

This commit significantly improves invokedynamic performance in
Grails 7 / Groovy 4 by implementing several key optimizations:

1. **Disabled Global SwitchPoint Guard**: Removed the global SwitchPoint
   that caused ALL call sites to invalidate when ANY metaclass changed.
   In Grails and similar frameworks with frequent metaclass modifications,
   this was causing massive guard failures and performance degradation.
   Individual call sites now manage their own invalidation via cache clearing.
   Can be re-enabled with: -Dgroovy.indy.switchpoint.guard=true

2. **Monomorphic Fast-Path**: Added volatile fields (latestClassName/
   latestMethodHandleWrapper) to CacheableCallSite to skip synchronization
   and map lookups for call sites that consistently see the same types,
   which is the common case in practice.

3. **Call Site Cache Invalidation**: Added call site registration and
   cache clearing mechanism. When metaclass changes, all registered call
   sites have their caches cleared and targets reset to the default
   (fromCache) path, ensuring metaclass changes are visible.

4. **Optimized Cache Operations**: Added get() and putIfAbsent() methods
   to CacheableCallSite for more efficient cache access patterns, and
   clearCache() for metaclass change invalidation.

5. **Pre-Guard Method Handle Storage**: Selector now stores the method
   handle before argument type guards are applied (handleBeforeArgGuards),
   enabling potential future optimizations for direct invocation.

The thresholds can be tuned via system properties:
- groovy.indy.optimize.threshold (default: 10000)
- groovy.indy.fallback.threshold (default: 10000)
- groovy.indy.callsite.cache.size (default: 4)
- groovy.indy.switchpoint.guard (default: false)
@jamesfredley
Copy link
Author

I do not believe the additional (10) failure was caused by these changes, since this commit already passed earlier today jamesfredley@510762d, but I do not have permission to rerun the job.

@eric-milles eric-milles requested a review from blackdrag January 29, 2026 01:53
Remove monomorphic fast-path, optimized cache ops, and pre-guard method
handle storage. Testing showed these provide no measurable benefit in
Grails 7 benchmarks while adding complexity.

The minimal optimization (disabled SwitchPoint guard + cache invalidation)
achieves the same performance improvement with 75% less code (69 lines vs
277 lines).

Removed:
- Monomorphic fast-path (latestClassName/latestMethodHandleWrapper fields)
- Optimized cache operations (get/putIfAbsent methods)
- Pre-guard method handle storage (handleBeforeArgGuards, directMethodHandle)

Retained:
- Disabled global SwitchPoint guard by default
- Cache invalidation on metaclass changes
- Call site registration for invalidation tracking
@jamesfredley
Copy link
Author

Optimizations Retained

  1. Disabled Global SwitchPoint Guard (Selector.java)

    • Removed the global SwitchPoint that caused ALL call sites to invalidate when ANY metaclass changed
    • Can be re-enabled with: -Dgroovy.indy.switchpoint.guard=true
  2. Call Site Cache Invalidation (IndyInterface.java, CacheableCallSite.java)

    • Implemented call site registration mechanism (ALL_CALL_SITES set with weak references)
    • When metaclass changes, all registered call sites have their caches cleared
    • Targets reset to default (fromCache) path, ensuring metaclass changes are visible

Optimizations Removed (No Measurable Benefit)

  1. Monomorphic Fast-Path - latestClassName/latestMethodHandleWrapper volatile fields
  2. Optimized Cache Operations - get(), putIfAbsent() methods
  3. Pre-Guard Method Handle Storage - handleBeforeArgGuards, directMethodHandle fields

Testing showed these additional optimizations provide no measurable performance benefit in the Grails 7 benchmark while adding complexity.

// sufficient, combined with cache invalidation on metaclass changes.
//
// If you need strict metaclass change detection, set groovy.indy.switchpoint.guard=true
if (SystemUtil.getBooleanSafe("groovy.indy.switchpoint.guard")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one question I have about this is whether this check ought to be inverted. That is, should the default behaviour be the old way, which we know is less performant but more battle-tested, or this way? Could Grails simply turn this flag on by default, thus solving their problem, without impacting other users?

Also, as a nitpicky thing, I reckon this string ought to be a public static final field so that we could add some tests around it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this PRs approach is viable, but does not prove to be generally valuable for all/most code run with Indy on, than I think the inverse would be fine for Grails and we could set that flag in default generated applications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of why I'd like to run this branch against the current benchmarks on CI. That would likely provide some confidence as to whether this was "safe" to merge without impacting anyone's expectations about how the language behaves.

@jonnybot0
Copy link
Contributor

I'm also well outside my expertise here, but so long as we're doing this, I'd be curious to see how this runs against the performance tests we have, compared to baseline.
https://ci-builds.apache.org/job/Groovy/job/Groovy%204%20performance%20tests/ has a Jenkins build that runs these. That's no good for running against a community repository, but perhaps we could push this up to a different branch and run it. Little outside my usual wheelhouse, so I'd like the nod from @paulk-asert, @blackdrag, or someone else on the PMC before I just run it willy-nilly.

I tried running ./gradlew :performance:performanceTests on my local machine with Java 17 on this branch:

Groovy 4.0.7 Average 376.26ms ± 64.42ms 
Groovy current Average 376.87ms ± 63.75ms (0.16% slower)
Groovy 3.0.14 Average 443.74ms ± 131.81ms (17.93% slower)
Groovy 2.5.20 Average 451.40999999999997ms ± 138.82ms (19.97% slower)

That seems comparable, so my assumption is this hasn't hurt the most basic performance measures we do. Main reason I'd want CI confirmation is to make sure it wasn't a quirk of my machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants