Skip to content

Conversation

@MichalStrehovsky
Copy link
Member

@MichalStrehovsky MichalStrehovsky commented Jan 16, 2026

Still very much WIP, don't look at implementation. This works just enough that I can do perf measurements.

This is changing interface dispatch from this shape:

mov         rcx,rbx  ; this
mov         r8d,edi  ; arg1
mov         edx,esi  ; arg2
lea         r11,[__InterfaceDispatchCell_repro_IFace__Call_repro_Program__DoCallFixed (07FF693C155B0h)]  
call        qword ptr [r11]  

To this shape:

lea         r11,[__InterfaceDispatchCell_repro_IFace__Call_repro_Program__DoCallFixed (07FF6A64D6700h)]  
mov         rcx,rbx  
call        RhpResolveInterfaceMethodFast (07FF6A6424E60h)  
mov         rcx,rbx  /// TODO: why does JIT consider rcx clobbered?
mov         r8d,edi  
mov         edx,esi  
call        rax  

This is the CFG dispatch shape that we recently added.

What this is also changing is the contents of the dispatch cell. The dispatch cell is now two pointers: a cached this pointer and a cached target method address. The dispatch cell is currently also prefixed by a pointer to the interface MethodTable and slot number (that are necessary to compute the dispatch if not cached). But this information can potentially be stored out-of-line, making the cache eligible to be stored as .bss and the composition as readonly data. This part is not done yet.

On a first dispatch, we call the slow resolution helper that will decompose the dispatch cell to MethodTable+slot, compute the result of lookup and store it in the dispatch cell itself. This is the fastest, monomorphic case.

If we later see dispatches with different kind of this, we cache them in a global hashtable. The key of the global hashtable is the this MethodTable address and the dispatch cell address. We use this as the key instead of interface MethodTable+slot+this MethodTable because it's faster to hash/compare and touches less memory.

Because the contents/shape of the dispatch cell is now fixed, we can inline the monomorphic case in the invoke sequence. This is also not done yet.

Cc @dotnet/ilc-contrib

regs

reorder BB

rax

rax

rbx

rdx -> r11

no rdx

f
@MichalStrehovsky
Copy link
Member Author

Test program:

Details
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;

class Program
{
    static void Main()
    {
        for (int it = 0; it < 6; it++)
        {
            IFace c0 = new C0();
            var sw = Stopwatch.StartNew();
            for (int i = 0; i < 10_000_000; i++)
            {
                DoCallFixed(c0, 1, 2);
            }
            Console.WriteLine(sw.ElapsedMilliseconds);

            IFace[] faces = [
                new C0(), new C1(), new C2(), new C3(),
            new C4(), new C5(), new C6(), new C7(),
            new C8(), new C9(), new CA(), new CB(),
            new CC(), new CD(), new CE(), new CF(),
            ];

            sw = Stopwatch.StartNew();
            for (int i = 0; i < 10_000_000; i++)
            {
                DoCall2(faces, 1, 2);
            }
            Console.WriteLine(sw.ElapsedMilliseconds);

            sw = Stopwatch.StartNew();
            for (int i = 0; i < 10_000_000; i++)
            {
                DoCall4(faces, 1, 2);
            }
            Console.WriteLine(sw.ElapsedMilliseconds);

            sw = Stopwatch.StartNew();
            for (int i = 0; i < 10_000_000; i++)
            {
                DoCall8(faces, 1, 2);
            }
            Console.WriteLine(sw.ElapsedMilliseconds);

            sw = Stopwatch.StartNew();
            for (int i = 0; i < 10_000_000; i++)
            {
                DoCall16(faces, 1, 2);
            }
            Console.WriteLine(sw.ElapsedMilliseconds);

            for (int i = 0; i < faces.Length; i++)
            {
                DoCallFixed(faces[i], 1, 2);
            }

            sw = Stopwatch.StartNew();
            IFace cf = new CF();
            for (int i = 0; i < 10_000_000; i++)
            {
                DoCallFixed(cf, 1, 2);
            }
            Console.WriteLine(sw.ElapsedMilliseconds);

            Console.WriteLine("---------------------------");
        }
    }



    [MethodImpl(MethodImplOptions.NoInlining)]
    //[MethodImpl(MethodImplOptions.AggressiveInlining)]
    static void DoCallFixed(IFace i, int x, int y)
    {
        i.Call(x, y);
        i.Call(x, y);
        i.Call(x, y);
        i.Call(x, y);
        i.Call(x, y);
        i.Call(x, y);
        i.Call(x, y);
        i.Call(x, y);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    //[MethodImpl(MethodImplOptions.AggressiveInlining)]
    static void DoCall2(IFace[] i, int x, int y)
    {
        i[0].Call(x, y);
        i[1].Call(x, y);
        i[0].Call(x, y);
        i[1].Call(x, y);
        i[0].Call(x, y);
        i[1].Call(x, y);
        i[0].Call(x, y);
        i[1].Call(x, y);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    //[MethodImpl(MethodImplOptions.AggressiveInlining)]
    static void DoCall4(IFace[] i, int x, int y)
    {
        i[0].Call(x, y);
        i[1].Call(x, y);
        i[2].Call(x, y);
        i[3].Call(x, y);
        i[0].Call(x, y);
        i[1].Call(x, y);
        i[2].Call(x, y);
        i[3].Call(x, y);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    //[MethodImpl(MethodImplOptions.AggressiveInlining)]
    static void DoCall8(IFace[] i, int x, int y)
    {
        i[0].Call(x, y);
        i[1].Call(x, y);
        i[2].Call(x, y);
        i[3].Call(x, y);
        i[4].Call(x, y);
        i[5].Call(x, y);
        i[6].Call(x, y);
        i[7].Call(x, y);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    //[MethodImpl(MethodImplOptions.AggressiveInlining)]
    static void DoCall16(IFace[] i, int x, int y)
    {
        i[0].Call(x, y);
        i[1].Call(x, y);
        i[2].Call(x, y);
        i[3].Call(x, y);
        i[4].Call(x, y);
        i[5].Call(x, y);
        i[6].Call(x, y);
        i[7].Call(x, y);
        i[8].Call(x, y);
        i[9].Call(x, y);
        i[10].Call(x, y);
        i[11].Call(x, y);
        i[12].Call(x, y);
        i[13].Call(x, y);
        i[14].Call(x, y);
        i[15].Call(x, y);
    }
}

interface IFace
{
    int Call(int x, int y);
}

class C0 : IFace { public int Call(int x, int y) => x + y; }
class C1 : IFace { public int Call(int x, int y) => x + y; }
class C2 : IFace { public int Call(int x, int y) => x + y; }
class C3 : IFace { public int Call(int x, int y) => x + y; }
class C4 : IFace { public int Call(int x, int y) => x + y; }
class C5 : IFace { public int Call(int x, int y) => x + y; }
class C6 : IFace { public int Call(int x, int y) => x + y; }
class C7 : IFace { public int Call(int x, int y) => x + y; }
class C8 : IFace { public int Call(int x, int y) => x + y; }
class C9 : IFace { public int Call(int x, int y) => x + y; }
class CA : IFace { public int Call(int x, int y) => x + y; }
class CB : IFace { public int Call(int x, int y) => x + y; }
class CC : IFace { public int Call(int x, int y) => x + y; }
class CD : IFace { public int Call(int x, int y) => x + y; }
class CE : IFace { public int Call(int x, int y) => x + y; }
class CF : IFace { public int Call(int x, int y) => x + y; }

Main:

Details
84
88
108
122
406
206
---------------------------
83
88
103
120
426
206
---------------------------
83
89
104
121
426
206
---------------------------
83
89
104
121
443
207
---------------------------
84
89
103
131
491
203
---------------------------
82
88
101
121
418
209
---------------------------

PR:

Details
138
155
172
183
360
186
---------------------------
126
157
170
185
360
185
---------------------------
127
156
172
182
360
187
---------------------------
127
161
172
183
360
187
---------------------------
126
156
174
191
370
193
---------------------------
132
158
176
187
367
190
---------------------------

As expected, right now this is a regression when the level of polymorphism is small. We start to get wins when the old cache size grows beyond a dozen entries or so.

@jakobbotsch
Copy link
Member

mov rcx,rbx /// TODO: why does JIT consider rcx clobbered?

It's probably not that we consider it clobbered, but LSRA has rather decided that this is homed in rbx and hence it must be moved to rcx at the ABI boundary. We don't have a mechanism that can say that this is present in both rbx and rcx after the first call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants