Implement some of MSVC's intrinsics, for ImportC.#16372
Implement some of MSVC's intrinsics, for ImportC.#16372just-harry wants to merge 77 commits intodlang:masterfrom
Conversation
| *remainder = cast(int) (dividend % divisor); | ||
| return cast(int) (dividend / divisor); | ||
| } | ||
| else |
There was a problem hiding this comment.
Is it really necessary to provide an asm implementation here? ditto throughout
There was a problem hiding this comment.
For _div64 specifically, yes, as compilers will generate the 64-bit encoding of div, which won't cause a #DE error on overflow of the 32-bit quotient, so a change of behaviour.
For everything else, generally yeah, either because it can't be represented without asm, or to ensure that the generated code behaves identically to MSVC's code, or for performance where it's relevant (like 128-bit multiplication).
| { | ||
| if (__ctfe) | ||
| { | ||
| /* This is an amalgamation of core.int128.divmod and core.int128.neg. */ |
There was a problem hiding this comment.
why not simply forward to that implementation?
There was a problem hiding this comment.
I think that would rely on linking with DRuntime, which would rule out BetterC compatibility.
compiler/src/dmd/cparse.d
Outdated
| auto s = new AST.Import(Loc.initial, null, Id.importc_builtins, null, false); | ||
| wrap.push(s); | ||
| wrap.push(new AST.Import(Loc.initial, null, Id.importc_builtins, null, false)); | ||
| wrap.push(new AST.Import(Loc.initial, null, Id.importc_msvc_builtins, null, false)); |
There was a problem hiding this comment.
Could mscv builtins be publicly imported by importc builtins instead? That would remove the need to alter the compiler.
There was a problem hiding this comment.
Aye, they can be; good shout. (Though, the __importc_msvc_builtins.d file had to be renamed to __builtins_msvc.d, to match the module name.)
2cd0117 to
e3d934c
Compare
e3d934c to
ba2e427
Compare
|
I've flipped this back to a draft PR as currently the functions in the |
| @@ -0,0 +1,4 @@ | |||
| ImportC, when targeting the Microsoft C runtime, supports a subset of the intrinsics recognised by the MSVC compiler. | |||
There was a problem hiding this comment.
The documentation in the opening comment in the PR is far superior to this readme. Merge it in here.
There was a problem hiding this comment.
I think I've copied the right part—better now?
|
|
||
| static inline unsigned char __readgsbyte(unsigned int Offset) | ||
| { | ||
| return *(unsigned char __seg_gs *) (_import_c_msvc_ptr_int) Offset; |
There was a problem hiding this comment.
ImportC does not recognize __seg_gs or __seg_fs, so I don't see how this can work
There was a problem hiding this comment.
I thought that GDC delegated to GCC for C files, but I was mistaken. I've replaced this with inline asm in D, as is already done for DMD.
compiler/src/dmd/cparse.d
Outdated
| if (idx.length > 2 && idx[0] == '_' && idx[1] == '_') // leading double underscore | ||
| importBuiltins = true; // probably one of those compiler extensions | ||
| if (idx.length > 1 && idx[0] == '_') // leading underscore | ||
| importBuiltins = true; // maybe one of those compiler extensions |
There was a problem hiding this comment.
I'm a little concerned about nearly always importing 10,000 lines of code for Windows compilations.
There was a problem hiding this comment.
Mhm, it's not great. How about what I've pushed now?
We have a lazily-initialised StringTable of all the MSVC intrinsic names that we have implemented.
If we're targeting the Microsoft C runtime, and we come across an identifier with a leading underscore, then we check if that identifier is a known MSVC intrinsic, and if it is we import the MSVC builtins and we stop checking future identifiers.
That way, the cost of the import is paid only when one of the intrinsics is actually going to be used.
The downside there is we have to remember to update that StringTable if more MSVC intrinsics are implemented, and we do pay the minor cost of hashing and looking up identifiers with leading underscores.
ba2e427 to
27fe297
Compare
|
This is still a draft as I'm unable to get the compiler to actually generate code for the Presently, neither the This is effectively how unsigned long long UnsignedMultiply128 (unsigned long long Multiplier, unsigned long long Multiplicand, unsigned long long *HighProduct);
#define UnsignedMultiply128 _umul128
unsigned long long UnsignedMultiply128 (unsigned long long Multiplier, unsigned long long Multiplicand, unsigned long long *HighProduct);which will expand to But even if the Windows headers didn't do that, we still need code to be generated for ( Right now I'm stuck, because I've yet to figure out a good way to get the The best I've come up with thus far is very hacky, and would require the same changes to be made in LDC and GDC (unless I'm mistaken), which is: in In short, my question is: is there an existing way to make it so the Any suggestions would be appreciated. |
27fe297 to
5878242
Compare
|
I suggest having the user do the #include line as the first line in his C code. Or have the user pass it in with the dmd switch that passes switches to the C preprocessor. Will that work? |
5878242 to
f9ec5af
Compare
If that works for you, it works for me. #include <importc_msvc_builtins.h>
#include <windows.h>I've added some code to supply Though, the issue of these builtins only working when the Is there an existing way to force codegen for an imported module, when the |
8949057 to
480a37b
Compare
480a37b to
fc28c1b
Compare
|
Okay; a separate PR adding a mechanism that this PR can use to force codegen for the Via something like this in if (token.value == TOK._import) // import declaration extension
{
auto a = parseImport();
if (a && a.length)
/*+*/ {
/*+*/ auto imp = (*a)[0].isImport();
/*+*/ imp.forceCodegen = imp.id == Id.builtins_msvc;
symbols.append(a);
/*+*/ }
return;
} |
| #elif defined(__GNUC__) | ||
| #define __assume(expression) do {if (!(expression)) {__builtin_unreachable();}} while (0) | ||
| #else |
There was a problem hiding this comment.
In case this is meant to be public, FYI I think importC would discard this macro, whereas the other definitions (clang) are simple enough to be exposed as templates.
There was a problem hiding this comment.
Also __assume is supposed to be an optimiser hint, not an assert, which is what this implementation does.
There was a problem hiding this comment.
(unreachable is a compile-time barrier, not an assert)
I'll take your word for the clang and !clang!gnu paths.
There was a problem hiding this comment.
(unreachable is a compile-time barrier, not an assert)
I know that. assert is (often) defined as do { if (!cond) { /* print stuff*/ ; __builtin_unreachable();}} while (0). This is almost identical (modulo the print stuff).
Assume should be #define assume(x) x, i.e. a no-op.
Did you try converting these functions to templates? |
| private void llvm_arm_hint(int) @safe pure nothrow @nogc; | ||
| } | ||
| } | ||
| else version (GNU) |
There was a problem hiding this comment.
Thanks for taking the time to consider both ldc and gdc here, it's really appreciated
I was just wondering - GCC doesn't support MSVC. So are these GNU version blocks dead on Mingw targets?
There was a problem hiding this comment.
Thanks for taking the time to consider both ldc and gdc here, it's really appreciated
:)
So are these GNU version blocks dead on Mingw targets?
I think so. Unless GDC defines CRuntime_Microsoft when targeting MinGW?
I don't really know the story as far as GDC's support of Windows goes. So, I just implemented the GDC versions to save some other poor sod the hassle of doing it later it on.
I don't think there's anything preventing this implementation from working when targeting MinGW or Cygwin, if that's wanted?
I did—but no dice, unfortunately, even with all the affected functions being templates, the simplest possible usage still fails to link: // windows_h_c.c:
#include <importc_msvc_builtins.h>
#include <windows.h>
// windows_h.d:
pragma(lib, "MinCore");
import windows_h_c; |
|
Just have the user add the #include line. |
Perhaps I'm misunderstanding you, but is that not what the current implementation is doing? There, the user |
fc28c1b to
045fe87
Compare
This PR implements all but three of the MSVC intrinsics listed here: https://web.archive.org/web/20240412171516/https://learn.microsoft.com/en-ie/cpp/intrinsics/alphabetical-listing-of-intrinsic-functions?view=msvc-170, and a handful of undocumented intrinsics.
The three unimplemented intrinsics are
_AddressOfReturnAddress,__getcallerseflags, and_ReturnAddress, because they need compiler support for their implementation.The implemented intrinsics are listed in this expando.
__assume__umulh__mulh_umul128_mul128__emul__emulu_div128_udiv128_div64_udiv64__cpuid__cpuidex_cvt_ftoi_fast_cvt_ftoll_fast_cvt_ftoui_fast_cvt_ftoull_fast_cvt_dtoi_fast_cvt_dtoll_fast_cvt_dtoui_fast_cvt_dtoull_fast_cvt_ftoi_sat_cvt_ftoll_sat_cvt_ftoui_sat_cvt_ftoull_sat_cvt_dtoi_sat_cvt_dtoll_sat_cvt_dtoui_sat_cvt_dtoull_sat_cvt_ftoi_sent_cvt_ftoll_sent_cvt_ftoui_sent_cvt_ftoull_sent_cvt_dtoi_sent_cvt_dtoui_sent_cvt_dtoll_sent_cvt_dtoull_sent__readgsbyte__readgsword__readgsdword__readgsqword__writegsbyte__writegsword__writegsdword__writegsqword__addgsbyte__addgsword__addgsdword__addgsqword__incgsbyte__incgsword__incgsdword__incgsqword__readfsbyte__readfsword__readfsdword__readfsqword__writefsbyte__writefsword__writefsdword__writefsqword__addfsbyte__addfsword__addfsdword__incfsbyte__incfsword__incfsdword__debugbreak__fastfail__faststorefence_disable_enable_interlockedadd [Undocumented]_interlockedadd64 [Undocumented]_InterlockedAdd_InterlockedAdd_acq_InterlockedAdd_rel_InterlockedAdd_nf_InterlockedAdd64_InterlockedAdd64_acq_InterlockedAdd64_rel_InterlockedAdd64_nf_InterlockedAddLargeStatistic_InterlockedAnd_InterlockedAnd8_InterlockedAnd16_interlockedand64 [Undocumented]_InterlockedAnd_acq_InterlockedAnd_rel_InterlockedAnd_nf_InterlockedAnd8_acq_InterlockedAnd8_rel_InterlockedAnd8_nf_InterlockedAnd16_acq_InterlockedAnd16_rel_InterlockedAnd16_nf_InterlockedAnd64_acq_InterlockedAnd64_rel_InterlockedAnd64_nf_InterlockedAnd64_InterlockedAnd_np_InterlockedAnd8_np_InterlockedAnd16_np_InterlockedAnd64_np_InterlockedAnd64_HLEAcquire_InterlockedAnd64_HLERelease_InterlockedAnd_HLEAcquire_InterlockedAnd_HLERelease_interlockedbittestandreset_interlockedbittestandreset64_interlockedbittestandreset_HLEAcquire_interlockedbittestandreset_HLERelease_interlockedbittestandreset64_HLEAcquire_interlockedbittestandreset64_HLERelease_interlockedbittestandreset_acq_interlockedbittestandreset_rel_interlockedbittestandreset_nf_interlockedbittestandreset64_acq_interlockedbittestandreset64_rel_interlockedbittestandreset64_nf_interlockedbittestandset_interlockedbittestandset64_interlockedbittestandset_HLEAcquire_interlockedbittestandset_HLERelease_interlockedbittestandset64_HLEAcquire_interlockedbittestandset64_HLERelease_interlockedbittestandset_acq_interlockedbittestandset_rel_interlockedbittestandset_nf_interlockedbittestandset64_acq_interlockedbittestandset64_rel_interlockedbittestandset64_nf_InterlockedCompareExchange_InterlockedCompareExchange8_InterlockedCompareExchange16_InterlockedCompareExchange64_InterlockedCompareExchange_HLEAcquire_InterlockedCompareExchange_HLERelease_InterlockedCompareExchange64_HLEAcquire_InterlockedCompareExchange64_HLERelease_InterlockedCompareExchange_np_InterlockedCompareExchange16_np_InterlockedCompareExchange64_np_InterlockedCompareExchange_acq_InterlockedCompareExchange_rel_InterlockedCompareExchange_nf_InterlockedCompareExchange8_acq_InterlockedCompareExchange8_rel_InterlockedCompareExchange8_nf_InterlockedCompareExchange16_acq_InterlockedCompareExchange16_rel_InterlockedCompareExchange16_nf_InterlockedCompareExchange64_acq_InterlockedCompareExchange64_rel_InterlockedCompareExchange64_nf_InterlockedCompareExchange128_InterlockedCompareExchange128_np_InterlockedCompareExchange128_acq_InterlockedCompareExchange128_rel_InterlockedCompareExchange128_nf_InterlockedCompareExchangePointer_InterlockedCompareExchangePointer_HLEAcquire_InterlockedCompareExchangePointer_HLERelease_InterlockedCompareExchangePointer_np_InterlockedCompareExchangePointer_acq_InterlockedCompareExchangePointer_rel_InterlockedCompareExchangePointer_nf_InterlockedDecrement_InterlockedDecrement16_interlockeddecrement64_InterlockedDecrement64_InterlockedDecrement_acq_InterlockedDecrement_rel_InterlockedDecrement_nf_InterlockedDecrement16_acq_InterlockedDecrement16_rel_InterlockedDecrement16_nf_InterlockedDecrement64_acq_InterlockedDecrement64_rel_InterlockedDecrement64_nf_InterlockedExchange_InterlockedExchange8_InterlockedExchange16_interlockedexchange64_InterlockedExchange64_InterlockedExchange_HLEAcquire_InterlockedExchange_HLERelease_InterlockedExchange64_HLEAcquire_InterlockedExchange64_HLERelease_InterlockedExchange_acq_InterlockedExchange_rel_InterlockedExchange_nf_InterlockedExchange8_acq_InterlockedExchange8_rel_InterlockedExchange8_nf_InterlockedExchange16_acq_InterlockedExchange16_rel_InterlockedExchange16_nf_InterlockedExchange64_acq_InterlockedExchange64_rel_InterlockedExchange64_nf_InterlockedExchangeAdd_InterlockedExchangeAdd8_InterlockedExchangeAdd16_interlockedexchangeadd64 [Undocumented]_InterlockedExchangeAdd64_InterlockedExchangeAdd_HLEAcquire_InterlockedExchangeAdd_HLERelease_InterlockedExchangeAdd64_HLEAcquire_InterlockedExchangeAdd64_HLERelease_InterlockedExchangeAdd_acq_InterlockedExchangeAdd_rel_InterlockedExchangeAdd_nf_InterlockedExchangeAdd8_acq_InterlockedExchangeAdd8_rel_InterlockedExchangeAdd8_nf_InterlockedExchangeAdd16_acq_InterlockedExchangeAdd16_rel_InterlockedExchangeAdd16_nf_InterlockedExchangeAdd64_acq_InterlockedExchangeAdd64_rel_InterlockedExchangeAdd64_nf_InterlockedExchangePointer_InterlockedExchangePointer_HLEAcquire_InterlockedExchangePointer_HLERelease_InterlockedExchangePointer_acq_InterlockedExchangePointer_rel_InterlockedExchangePointer_nf_InterlockedIncrement_InterlockedIncrement16_interlockedincrement64 [Undocumented]_InterlockedIncrement64_InterlockedIncrement_acq_InterlockedIncrement_rel_InterlockedIncrement_nf_InterlockedIncrement16_acq_InterlockedIncrement16_rel_InterlockedIncrement16_nf_InterlockedIncrement64_acq_InterlockedIncrement64_rel_InterlockedIncrement64_nf_InterlockedOr_InterlockedOr8_InterlockedOr16_interlockedor64 [Undocumented]_InterlockedOr_acq_InterlockedOr_rel_InterlockedOr_nf_InterlockedOr8_acq_InterlockedOr8_rel_InterlockedOr8_nf_InterlockedOr16_acq_InterlockedOr16_rel_InterlockedOr16_nf_InterlockedOr64_acq_InterlockedOr64_rel_InterlockedOr64_nf_InterlockedOr64_InterlockedOr_np_InterlockedOr8_np_InterlockedOr16_np_InterlockedOr64_np_InterlockedOr64_HLEAcquire_InterlockedOr64_HLERelease_InterlockedOr_HLEAcquire_InterlockedOr_HLERelease_InterlockedXor_InterlockedXor8_InterlockedXor16_interlockedxor64 [Undocumented]_InterlockedXor_acq_InterlockedXor_rel_InterlockedXor_nf_InterlockedXor8_acq_InterlockedXor8_rel_InterlockedXor8_nf_InterlockedXor16_acq_InterlockedXor16_rel_InterlockedXor16_nf_InterlockedXor64_acq_InterlockedXor64_rel_InterlockedXor64_nf_InterlockedXor64_InterlockedXor_np_InterlockedXor8_np_InterlockedXor16_np_InterlockedXor64_np_InterlockedXor64_HLEAcquire_InterlockedXor64_HLERelease_InterlockedXor_HLEAcquire_InterlockedXor_HLERelease__inbyte__inword__indword__outbyte__outword__outdword__inbytestring__inwordstring__indwordstring__outbytestring__outwordstring__outdwordstring__int2c__invlpg__lidt__ll_lshift__ll_rshift__ull_rshift__lzcnt16__lzcnt__lzcnt64_mm_cvtsi64x_ss_mm_cvtss_si64x_mm_cvttss_si64x_mm_extract_si64_mm_extracti_si64_mm_insert_si64_mm_inserti_si64_mm_stream_sd_mm_stream_ss_mm_stream_si64x__movsb__movsw__movsd__movsq__noop__nop__popcnt16__popcnt__popcnt64__rdtsc__rdtscp__readcr0__readcr2__readcr3__readcr4__readcr8__readdr__readeflags__readmsr__readpmc__segmentlimit__shiftleft128__shiftright128__sidt__stosb__stosw__stosd__stosq__svm_clgi__svm_invlpga__svm_skinit__svm_stgi__svm_vmload__svm_vmrun__svm_vmsave__ud2__vmx_off__vmx_on__vmx_vmclear__vmx_vmlaunch__vmx_vmptrld__vmx_vmptrst__vmx_vmread__vmx_vmresume__vmx_vmwrite__wbinvd__writecr0__writecr2__writecr3__writecr4__writecr8__writedr__writeeflags__writemsr_ReadBarrier_WriteBarrier_ReadWriteBarrier_BitScanForward_BitScanReverse_BitScanForward64_BitScanReverse64_bittest_bittestandcomplement_bittestandreset_bittestandset_bittest64_bittestandcomplement64_bittestandreset64_bittestandset64_byteswap_uint64_byteswap_ulong_byteswap_ushort_lrotr_lrotl_rotr_rotl_rotr64_rotl64_rotr16_rotl16_rotr8_rotl8These implementations aim to be as compatible with the MSVC intrinsics as is possible—adhering to Hyrum's Law.
Separate implementations are provided for DMD, LDC, and GDC, and for x86, x86-64, AArch64, and ARM.
Almost all the intrinsics are implemented in D, except for
__assumewhich is a C macro.Every intrinsic, where possible, has a CTFE-compatible code-path.
It all compiles with or without DIP1000 being enabled.
Care has been taken to ensure that none of the implementations rely on DRuntime, so that these work in BetterC.
Regarding the
_cvt_family of functions: by default MSVC will generate code that uses SSE2 instructions, even for 32-bit targets, which means that for 32-bit targets the_cvt_functions will use SSE2.This is contrary to DMD's usual behaviour of using x87 for 32-bit Windows.
For their reimplementations, I've used SSE2 anyway for 32-bit targets for DMD, as doing otherwise would constitute a change in behaviour, as x87 FP-exceptions are different from SIMD FP-exceptions (and, I think the SSE2 and x87 versions return different results).
The oldest of the
_cvt_functions was introduced in the May of 2021, so I think it's almost certain that any code using them will be targeting at-least SSE2 anyway.Additionally, I've written a program that tests that the
_cvt_implementations return identical results to the MSVC implementations for all float values, and for ~402,653,184 double values (except for_cvt_ftoi_fastand_cvt_ftoi_senton 32-bit targets, as they cause an internal compiler error in MSVC); it also tests thatctfeX86RoundLongToFloatandctfeX86RoundFloatToLongproduce the same results as the hardware.It relies on Phobos and MSVC, so I don't really know what to do with it other than link to it here: https://github.com/just-harry/float-fuzzing-for-msvc-intrinsics
A few of the intrinsics can be used only in kernel-mode, so unittests have been omitted for them, as I don't think we have any infrastructure in place for testing in kernel-mode.
Some of the intrinsics terminate the program, so their unittests have been wrapped in a
version (none), others rely on specific compiler optimisations, so they too have been wrapped in aversion (none).I've split the intrinsics up into a few dozen commits to try and alleviate the whole 13,000-lines-of-codes-all-at-once thing.
(I staged them after-the-fact, so a few braces may be out-of-place in the intermediate commits.)
The intrinsics have been placed in their own files, separate from the existing ImportC builtin files, to try and avoid crowding the builtin files.
Currently, the ImportC builtins are imported conditionally, based on some heuristics.
One of those heuristics is if any identifier beginning with two underscores is used – I've changed that one to instead trigger on a single leading underscore as many of MSVC's intrinsics begin with only one underscore.
One notable header that can be successfully included by ImportC, with these intrinsics implemented, is
windows.h.P.S. If this is outside the purview of ImportC: that's fine, I'll just publish this as a library instead.