Skip to content

Conversation

DiamonDinoia
Copy link
Contributor

@DiamonDinoia DiamonDinoia commented Aug 29, 2025

  1. adds the API bitwise_[l|r]shift(...) and rot[l|r](...)
  2. updates the test to use the API
  3. fixes documentation

I implemente xoshiro (vecotrized) the current API looks like this:

    const auto result = xsimd::rotl(m_state[0] + m_state[3], 23) + m_state[0];
    const auto t = m_state[1] << 17;

    m_state[2] ^= m_state[0];
    m_state[3] ^= m_state[1];
    m_state[2] ^= t;

    m_state[3] = xsimd::rotl(m_state[3], 45);

    return result;

and with AVX2 achieves 393.06 M samples/s
The new API is:

    const auto result = xsimd::rotl<23>(m_state[0] + m_state[3]) + m_state[0];
    const auto t = xsimd::bitwise_lshift<17>(m_state[1]);

    m_state[2] ^= m_state[0];
    m_state[3] ^= m_state[1];
    m_state[1] ^= m_state[2];
    m_state[0] ^= m_state[3];

    m_state[2] ^= t;

    m_state[3] = xsimd::rotl<45>(m_state[3]);

    return result;

and with AVX2 achieves 423.76 M samples/s

With this change causing a 7% speed increase, xoshiro is the fastest RNG among the ones I tested. Othwewise, PCG64 beats it.

@DiamonDinoia DiamonDinoia changed the title NEW:API Fixed shift, Rotation NEW API: Fixed shift, Rotation Aug 29, 2025
@serge-sans-paille
Copy link
Contributor

We already have a similar approach for rotate_left and rotate_right, slide_left and slide_right, so that looks good.
We also have insert that has a different API:

    template <class T, class A, size_t I>
    XSIMD_INLINE batch<T, A> insert(batch<T, A> const& x, T val, index<I> pos) noexcept

which makes me wonder if it's a good idea.

Anyway I like your idea, further comments inline.

Copy link
Contributor

@serge-sans-paille serge-sans-paille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add sse2's _mm_slli_epi32 and _mm_srai_epi16?
You provided a generic implementation so ,other architectures shouldn't be impacted, but I'll still have a look and check if some specific instructions are available.

@@ -34,6 +34,11 @@ namespace xsimd
{ return x << y; },
self, other);
}
template <int shift, class A, class T, class /*=typename std::enable_if<std::is_integral<T>::value, void>::type*/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the shift be negative? If not we should use an unsigned type for I.

@@ -183,6 +193,12 @@ namespace xsimd
constexpr auto N = std::numeric_limits<T>::digits;
return (self << other) | (self >> (N - other));
}
template <int count, class A, class T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, we tend to use size_t for this kind of parameter.

template <class A, class T, class = typename std::enable_if<std::is_integral<T>::value, void>::type>
XSIMD_INLINE batch<T, A> bitwise_lshift(batch<T, A> const& self, batch<T, A> const& other, requires_arch<avx2>) noexcept
{
XSIMD_IF_CONSTEXPR(sizeof(T) == 4)
{
return _mm256_sllv_epi32(self, other);
}
#if XSIMD_WITH_AVX512VL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally fine with this kind of macro guard. the instruction used should depend on the batch architecture (A) and not macros. We could have someone wanting an avx2 kernel within an avx512 context, and I have not yet thought about how to handle instruction sets that extend previous generation instructions. Could you postpone that particular change?

Not saying it's incorrect, It's just that I need to think a bit about the pattern.

Copy link
Contributor Author

@DiamonDinoia DiamonDinoia Aug 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of using SSEVL, AVXVL as new archs & register types. Then from AVX512 we forward to those. We also change make_sized_batch_t to return those instead of plain sse,avx. This is useful to me as I use custom sized batches even with AVX512 to avoid padding. It should make things faster expecially in sse as it remove vzeroupper if I am not mistaken.

@@ -223,7 +223,6 @@ namespace xsimd
}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, spurious change. I added an API there and then deleted.

@@ -244,6 +244,17 @@
#define XSIMD_WITH_AVX512DQ 0
#endif

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be in an independent commit, preferably with a test case that catches the scenario we missed.

1. adds the API bitwise_[l|r]shift<N>(...) and rot[l|r]<N>(...)
2. updates the test to use the API
3. Updates documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants