-
Notifications
You must be signed in to change notification settings - Fork 281
NEW API: Fixed shift, Rotation #1164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
NEW API: Fixed shift, Rotation #1164
Conversation
We already have a similar approach for
which makes me wonder if it's a good idea. Anyway I like your idea, further comments inline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add sse2's _mm_slli_epi32
and _mm_srai_epi16
?
You provided a generic implementation so ,other architectures shouldn't be impacted, but I'll still have a look and check if some specific instructions are available.
@@ -34,6 +34,11 @@ namespace xsimd | |||
{ return x << y; }, | |||
self, other); | |||
} | |||
template <int shift, class A, class T, class /*=typename std::enable_if<std::is_integral<T>::value, void>::type*/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the shift be negative? If not we should use an unsigned type for I
.
@@ -183,6 +193,12 @@ namespace xsimd | |||
constexpr auto N = std::numeric_limits<T>::digits; | |||
return (self << other) | (self >> (N - other)); | |||
} | |||
template <int count, class A, class T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, we tend to use size_t
for this kind of parameter.
include/xsimd/arch/xsimd_avx2.hpp
Outdated
template <class A, class T, class = typename std::enable_if<std::is_integral<T>::value, void>::type> | ||
XSIMD_INLINE batch<T, A> bitwise_lshift(batch<T, A> const& self, batch<T, A> const& other, requires_arch<avx2>) noexcept | ||
{ | ||
XSIMD_IF_CONSTEXPR(sizeof(T) == 4) | ||
{ | ||
return _mm256_sllv_epi32(self, other); | ||
} | ||
#if XSIMD_WITH_AVX512VL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally fine with this kind of macro guard. the instruction used should depend on the batch architecture (A
) and not macros. We could have someone wanting an avx2 kernel within an avx512 context, and I have not yet thought about how to handle instruction sets that extend previous generation instructions. Could you postpone that particular change?
Not saying it's incorrect, It's just that I need to think a bit about the pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of using SSEVL, AVXVL as new archs & register types. Then from AVX512 we forward to those. We also change make_sized_batch_t to return those instead of plain sse,avx. This is useful to me as I use custom sized batches even with AVX512 to avoid padding. It should make things faster expecially in sse as it remove vzeroupper if I am not mistaken.
@@ -223,7 +223,6 @@ namespace xsimd | |||
} | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, spurious change. I added an API there and then deleted.
@@ -244,6 +244,17 @@ | |||
#define XSIMD_WITH_AVX512DQ 0 | |||
#endif | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be in an independent commit, preferably with a test case that catches the scenario we missed.
1. adds the API bitwise_[l|r]shift<N>(...) and rot[l|r]<N>(...) 2. updates the test to use the API 3. Updates documentation
4ab1d66
to
ee37f7b
Compare
I implemente xoshiro (vecotrized) the current API looks like this:
and with AVX2 achieves
393.06 M samples/s
The new API is:
and with AVX2 achieves
423.76 M samples/s
With this change causing a 7% speed increase, xoshiro is the fastest RNG among the ones I tested. Othwewise, PCG64 beats it.