-
Notifications
You must be signed in to change notification settings - Fork 119
WIP: Hack to optimise based on known sizes #276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: snmalloc1
Are you sure you want to change the base?
Conversation
If the caller knows the sizes, and knows the allocation is thread local, then we can make some significant optimisations. This is a brief hack to show where would need changing.
|
On x86 the code would look something like: and This is not intended for merging, but I have pushed so we can adapt the idea later. |
|
|
||
| # Build a redirection layer for all sizes that are a multiple of | ||
| # 16bytes up to 1024. | ||
| add_executable(generate src/redirect/generate.cc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to differentiate between executables for the build host and executables for the target when we're cross-compiling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needing to build a C++ program on the host to generate things that we compile for the target is a bit painful for cross compiling (do we even guarantee that the sizes will be the same if, for example, we're compiling on a 32-bit system for a CHERI target?).
It's also going to be annoying to integrate into a libc build system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, all platforms and configurations, we build would use the same file. However, we can use this program to build a collection of headers that encode the parameters of interest. E.g. of the form
generated_[MIN_ALLOC_BITS]_[INTERMEDIATE_BITS].cc
Currently, everything would use exactly the same thing:
generated_4_2.cc
But for CHERI, we might want to up the minimum allocation size. Getting either parameter wrong in the header would "work", however, we might
- have more entry points than necessary, or
- allocate a larger object than necessary.
Perhaps, just check in the file in once, we have finished experimenting. While, we are experimenting, I think having this generated is good as it means we are less likely to make mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, we use the template parameter for s, so the code will be specialised for this size, even if it ends up being a medium alloc, or large.
#define DEFINE_MALLOC_SIZE(a, s) \
extern "C" void* a() \
{ \
return snmalloc::ThreadAlloc::get_noncachable()->template alloc<s>(); \
}
| SNMALLOC_EXPORT | ||
| void SNMALLOC_NAME_MANGLE(free_local_small)(void* ptr) | ||
| { | ||
| if (Alloc::small_local_dealloc(ptr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd feel slightly better if this if(!fast) slow dance were inside Alloc rather than having it export the fast and slow paths separately? Unless there's some reason to want to expose this for inlining?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to at least be in ThreadAlloc, as we don't want to take the TLS lookup on the fast path. So we need that in scope. I think we probably need a better refactor if we actually want to support this design. This is more a hack to get some codegen, and see if the use case makes sense.
| return (likely(slab->dealloc_fast(super, p))); | ||
| } | ||
|
|
||
| SNMALLOC_FAST_PATH void small_local_dealloc_slow(void* p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still SNMALLOC_FAST_PATH despite _slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo
Fix Align to be not a log size.
If the caller knows the sizes, and knows the allocation is thread local,
then we can make some significant optimisations.
This is a brief hack to show where would need changing.