vectoreyes/
lib.rs

1#![deny(missing_docs)]
2#![allow(unsafe_op_in_unsafe_fn)]
3//! VectorEyes is a (almost entirely) safe and cross-platform wrapper library around vectorized
4//! operations.
5//!
6//! While a normal `add` CPU instruction will add two numbers together, a
7//! [SIMD/Vectorized](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) `add`
8//! instruction will perform multiple additions from the same instruction. This will amortize the
9//! per-instruction cost (e.g. of the CPU decoding the instruction) across all the additions of the
10//! single instruction. This can provide large speed boosts on many platforms.
11//!
12//! Unfortunately, using these operations require using per-platform unsafe intrinsics. To make
13//! this easier, VectorEyes provide safe functions which will function identically on all
14//! platforms.
15//!
16//! The core of this crate is vector types (such as [`U64x2`]). You can think of vectors as arrays
17//! with some extra SIMD operations on top.
18//!
19//! Just like arrays vectors have an element type ([`u64`] in the example above), and an element
20//! count, frequently referred to as _lanes_ (2 in the above example).
21//!
22//! In fact, you can freely convert between arrays and vectors!
23//!
24//! ```
25//! # use vectoreyes::*;
26//! // These two represent the same thing.
27//! let vector_form = U64x2::from([123_u64, 456_u64]);
28//! let array_form: [u64; 2] = vector_form.into();
29//! ```
30//!
31//! However, the vector form has _special SIMD powers_! These two functions perform the same
32//! operation, but the SIMD variant may[^may_be_faster] take better advantage of the CPU hardware.
33//!
34//! [^may_be_faster]: As always, only a Sith deals in absolutes. The Rust compiler can, in some
35//! cases, employ _autovectorization_ to compile code which doesn't use SIMD operations into code
36//! which uses SIMD instructions. Unfortunately, the compiler can't always autovectorize the way we
37//! want it to, which is why VectorEyes exists!
38//!
39//! While normal _bog-standard_ arrays don't implement the `+` operator, our vector types do!
40//! Adding two vectors together performs pairwise addition, using (for the vector backends) a
41//! single CPU instruction!
42//!
43//! ```
44//! # use vectoreyes::*;
45//! fn double_without_simd(arr: [u64; 2]) -> [u64; 2] {
46//!     [arr[0] + arr[0], arr[1] + arr[1]]
47//! }
48//! fn double_with_simd(arr: U64x2) -> U64x2 {
49//!     arr + arr
50//! }
51//! assert_eq!(
52//!     U64x2::from(double_without_simd([1, 2])),
53//!     double_with_simd(U64x2::from([1, 2])),
54//! );
55//! ```
56//!
57//! The documentation for every method on a vector (e.g. [`I64x2::and_not`]) lists the equivalent
58//! scalar code, as well as information on how the operation is implemented on each backend.
59//!
60//! # Vector Sizes
61//! There aren't vector types for every conceivable `(type, element count)` pair. Instead, we have
62//! vector types that correspond to the vector registers that many CPUs have. Because these
63//! registers are 128- or 256-bits wide, we choose vector types which also have this size. For
64//! example, there's a [`U64x2`] type and a [`U32x4`] type, since both are 128-bits wide. But
65//! there's no `U32x2` type, because that'd only be 64-bits wide.
66//!
67//!
68//! # Backends
69//! VectorEyes chooses what backend to execute vector operations with at compile-time.
70//!
71//! ## AVX2
72//! x86-64 CPUs that support the `AVX`, `AVX2`, `SSE4.1`, `AES`, `SSE4.2`, and
73//! `PCLMULQDQ` features will use the `AVX2` backend.
74//!
75//! ## Neon
76//! This is available on aarch64/arm64 machines with `neon` and `aes` features.
77//!
78//! ## Scalar
79//! This is a fallback implementation that works on all CPUs. It's not
80//! particularly performant.
81//!
82//! # Cargo Configuration
83//! If using VectorEyes from the `swanky` repo, all this configuration has already been done for
84//! you!
85//! ## Native CPU Setup
86//! Compile on the machine that you'll be running your code on, and add the
87//! following to your `.cargo/config` file:
88//! ```toml
89//! [build]
90//! rustflags = ["-C", "target-cpu=native", "--cfg=vectoreyes-target-cpu-native"]
91//! rustdocflags = ["-C", "target-cpu=native", "--cfg=vectoreyes-target-cpu-native"]
92//! ```
93//! ## Specific CPU Selection
94//! If you want to compile for some specific CPU, add the following to your
95//! `.cargo/config` file:
96//! ```toml
97//! [build]
98//! rustflags = ["-C", "target-cpu=TARGET", "--cfg=vectoreyes-target-cpu=\"TARGET\""]
99//! rustdocflags = ["-C", "target-cpu=TARGET", "--cfg=vectoreyes-target-cpu=\"TARGET\""]
100//! ```
101//! ## Maximal Compatibility
102//! If you do not put any of the above in your `.cargo/config` file,
103//! `vectoreyes` will always use its `scalar` backend, which does not use vector
104//! instructions.
105//!
106//! # Limitations
107//! VectorEyes was designed around the AVX2 backend. For example, shuffle operations tend to be
108//! constrained to 128-bit lanes because that's how the Intel intrinsics are constrained. As a
109//! result, while code that uses VectorEyes might be optimal for an Intel platform, it might not be
110//! optimal for an ARM platform with different intrinsics. (This is a limitation, generally, with
111//! cross-platform SIMD libraries like VectorEyes.)
112//!
113//! In addition, many SIMD intrinsics are currently not wrapped in VectorEyes.
114
115use std::ops::*;
116
117/// What backend will be used when targeting the current CPU?
118#[non_exhaustive]
119#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
120pub enum VectorBackend {
121    /// The fallback scalar backend (doesn't use vector instructions).
122    Scalar,
123    /// A vector backend targeting [AVX2](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2).
124    Avx2,
125    /// A vector backend targeting [ARM Neon](https://developer.arm.com/Architectures/Neon).
126    Neon,
127}
128
129/// The vector backend that this process is using.
130pub const VECTOR_BACKEND: VectorBackend = current_vector_backend();
131
132/// Panic if the current binary uses features unsupported by the current CPU.
133///
134/// `vectoreyes` uses compile-time flags to select which backend to use and which CPU features to
135/// require. If this backend is used on an unsupported CPU, it will result in an "Illegal
136/// instruction" error (technically, _all_ Rust code--not even just `vectoreyes` code--may result in
137/// undefined behavior if run on a CPU that doesn't support the compile-time selected feature
138/// flags).
139///
140/// It would be advisable to call this in the `main()` function of executables to try to catch
141/// these errors early.
142pub fn assert_cpu_features() {
143    vector_backend_check_cpu()
144}
145
146/// A scalar that can live in the lane of a vector.
147pub trait Scalar:
148    'static
149    + std::fmt::Debug
150    + num_traits::PrimInt
151    + num_traits::WrappingAdd
152    + num_traits::WrappingSub
153    + num_traits::WrappingMul
154    + subtle::ConstantTimeEq
155    + subtle::ConditionallySelectable
156{
157    /// A scalar of the same width as this scalar, but signed.
158    type Signed: Scalar;
159    /// A scalar of the same width as this scalar, but unsigned.
160    type Unsigned: Scalar;
161
162    /// A scalar of the same sign as this scalar, but with width 8.
163    type SameSign8: Scalar<Signed = i8, Unsigned = u8>;
164    /// A scalar of the same sign as this scalar, but with width 16.
165    type SameSign16: Scalar<Signed = i16, Unsigned = u16>;
166    /// A scalar of the same sign as this scalar, but with width 32.
167    type SameSign32: Scalar<Signed = i32, Unsigned = u32>;
168    /// A scalar of the same sign as this scalar, but with width 64.
169    type SameSign64: Scalar<Signed = i64, Unsigned = u64>;
170}
171macro_rules! scalar_impls {
172    ($(($s:ty, $u:ty)),*) => {$(
173        impl Scalar for $s {
174            type Signed = $s;
175            type Unsigned = $u;
176
177            type SameSign8 = i8;
178            type SameSign16 = i16;
179            type SameSign32 = i32;
180            type SameSign64 = i64;
181        }
182        impl Scalar for $u {
183            type Signed = $s;
184            type Unsigned = $u;
185
186            type SameSign8 = u8;
187            type SameSign16 = u16;
188            type SameSign32 = u32;
189            type SameSign64 = u64;
190        }
191    )*};
192}
193scalar_impls!((i64, u64), (i32, u32), (i16, u16), (i8, u8));
194/// A vector equivalent to `[T; Self::Lanes]`.
195///
196/// # Representation
197/// This type should have the same size as `[T; Self::Lanes]`, though it may have increased
198/// alignment requirements.
199///
200/// # Effects of signedness on shift operations
201/// When `Scalar` is _signed_, shift operations are signed shifts. When `Scalar` is _unsigned_,
202/// shift operations are unsigned shifts.
203///
204/// ## Example
205/// A signed shift right will add the sign bit
206/// ```
207/// # use vectoreyes::*;
208/// assert_eq!(
209///     U64x2::from([0xffffffffffffffff, 0x2]) >> 1,
210///     U64x2::from([0x7fffffffffffffff, 0x1]),
211/// );
212/// assert_eq!(
213///     // Because the sign bit of 0xffffffffffffffff is 1, shifting right will cause a 1 to be
214///     // inserted which, in this case, results in the same 0xffffffffffffffff value.
215///     U64x2::from(I64x2::from(U64x2::from([0xffffffffffffffff, 0x2])) >> 1),
216///     U64x2::from([0xffffffffffffffff, 0x1]),
217/// );
218/// ```
219pub trait SimdBase:
220    'static
221    + Sized
222    + Clone
223    + Copy
224    + Sync
225    + Send
226    + std::fmt::Debug
227    + PartialEq
228    + Eq
229    + Default
230    + bytemuck::Pod
231    + bytemuck::Zeroable
232    + BitXor
233    + BitXorAssign
234    + BitOr
235    + BitOrAssign
236    + BitAnd
237    + BitAndAssign
238    + AddAssign
239    + Add<Output = Self>
240    + SubAssign
241    + Sub<Output = Self>
242    + ShlAssign<u64>
243    + Shl<u64, Output = Self>
244    + ShrAssign<u64>
245    + Shr<u64, Output = Self>
246    + ShlAssign<Self>
247    + Shl<Self, Output = Self>
248    + ShrAssign<Self>
249    + Shr<Self, Output = Self>
250    + subtle::ConstantTimeEq
251    + subtle::ConditionallySelectable
252    + AsRef<[Self::Scalar]>
253    + AsMut<[Self::Scalar]>
254{
255    /// The number of elements of this vector.
256    const LANES: usize;
257
258    /// The equivalent array type of this vector.
259    type Array: 'static
260        + Sized
261        + Clone
262        + Copy
263        + Sync
264        + Send
265        + std::fmt::Debug
266        + bytemuck::Pod
267        + bytemuck::Zeroable
268        + PartialEq
269        + Eq
270        + Default
271        + std::hash::Hash
272        + AsRef<[Self::Scalar]>
273        + From<Self>
274        + Into<Self>;
275
276    /// The scalar that this value holds.
277    type Scalar: Scalar;
278    /// The signed version of this vector.
279    type Signed: SimdBase<Scalar = <<Self as SimdBase>::Scalar as Scalar>::Signed>
280        + From<Self>
281        + Into<Self>;
282    /// The unsigned version of this vector.
283    type Unsigned: SimdBase<Scalar = <<Self as SimdBase>::Scalar as Scalar>::Unsigned>
284        + From<Self>
285        + Into<Self>;
286
287    /// A vector where every element is zero.
288    const ZERO: Self;
289    /// Is `self == Self::ZERO`?
290    ///
291    /// # Example
292    /// ```
293    /// # use vectoreyes::*;
294    /// assert!(U32x4::from([0, 0, 0, 0]).is_zero());
295    /// assert!(!U32x4::from([1, 0, 0, 0]).is_zero());
296    /// ```
297    fn is_zero(&self) -> bool;
298
299    /// Create a new vector by setting element 0 to `value`, and the rest of the elements to `0`.
300    ///
301    /// # Example
302    /// ```
303    /// # use vectoreyes::*;
304    /// assert_eq!(U32x4::from([64, 0, 0, 0]), U32x4::set_lo(64));
305    /// ````
306    fn set_lo(value: Self::Scalar) -> Self;
307
308    /// Create a new vector by setting every element to `value`.
309    ///
310    /// # Example
311    /// ```
312    /// # use vectoreyes::*;
313    /// assert_eq!(U32x4::from([64, 64, 64, 64]), U32x4::broadcast(64));
314    /// ````
315    fn broadcast(value: Self::Scalar) -> Self;
316
317    /// A vector of `[Self::Scalar; 128 / (8 * std::mem::size_of::<Self::Scalar>())]`
318    type BroadcastLoInput: SimdBase<Scalar = Self::Scalar>;
319    /// Create a vector by setting every element to element 0 of `of`.
320    ///
321    /// # Example
322    /// ```
323    /// # use vectoreyes::*;
324    /// assert_eq!(U32x4::from([1, 1, 1, 1]), U32x4::broadcast_lo(U32x4::from([1, 2, 3, 4])));
325    /// ````
326    fn broadcast_lo(of: Self::BroadcastLoInput) -> Self;
327
328    /// Get the `I`-th element of this vector.
329    ///
330    /// # Example
331    /// ```
332    /// # use vectoreyes::*;
333    /// let v = U32x4::from([1, 2, 3, 4]);
334    /// assert_eq!(v.extract::<0>(), 1);
335    /// assert_eq!(v.extract::<1>(), 2);
336    /// assert_eq!(v.extract::<2>(), 3);
337    /// assert_eq!(v.extract::<3>(), 4);
338    /// ````
339    fn extract<const I: usize>(&self) -> Self::Scalar;
340
341    /// Convert the vector to an array.
342    #[inline(always)]
343    fn as_array(&self) -> Self::Array {
344        (*self).into()
345    }
346
347    /// Shift each element left by `BITS`.
348    ///
349    /// # Example
350    /// ```
351    /// # use vectoreyes::*;
352    /// assert_eq!(U32x4::from([1, 2, 3, 4]).shift_left::<1>(), U32x4::from([2, 4, 6, 8]));
353    /// ````
354    fn shift_left<const BITS: usize>(&self) -> Self;
355    /// Shift each element right by `BITS`.
356    /// # Effects of Signedness
357    /// When `T` is _signed_, this will shift in sign bits, as opposed to zeroes.
358    ///
359    /// # Example
360    /// ```
361    /// # use vectoreyes::*;
362    /// assert_eq!(U32x4::from([1, 2, 3, 4]).shift_right::<1>(), U32x4::from([0, 1, 1, 2]));
363    /// assert_eq!(I32x4::from([-1, -2, -3, -4]).shift_right::<1>(), I32x4::from([-1, -1, -2, -2]));
364    /// ````
365    fn shift_right<const BITS: usize>(&self) -> Self;
366
367    /// Compute `self & (! other)`.
368    ///
369    /// # Example
370    /// ```
371    /// # use vectoreyes::*;
372    /// assert_eq!(
373    ///     U64x2::from([0b11, 0b00]).and_not(U64x2::from([0b10, 0b10])),
374    ///     U64x2::from([0b01, 0b00]),
375    /// );
376    /// ````
377    fn and_not(&self, other: Self) -> Self;
378
379    /// Create a vector where each element is all 1's if the elements are equal, and all 0's otherwise.
380    ///
381    /// # Example
382    /// ```
383    /// # use vectoreyes::*;
384    /// assert_eq!(
385    ///     U64x2::from([1, 2]).cmp_eq(U64x2::from([1, 3])),
386    ///     U64x2::from([0xffffffffffffffff, 0]),
387    /// );
388    /// ````
389    fn cmp_eq(&self, other: Self) -> Self;
390    /// Create a vector where each element is all 1's if the element of `self` is greater than the
391    /// corresponding element of `other`, and all 0's otherwise.
392    ///
393    /// # Example
394    /// ```
395    /// # use vectoreyes::*;
396    /// assert_eq!(
397    ///     U64x2::from([1, 28]).cmp_gt(U64x2::from([1, 3])),
398    ///     U64x2::from([0, 0xffffffffffffffff]),
399    /// );
400    /// ````
401    fn cmp_gt(&self, other: Self) -> Self;
402
403    /// Interleave the elements of the low half of `self` and `other`.
404    ///
405    /// # Example
406    /// ```
407    /// # use vectoreyes::*;
408    /// assert_eq!(
409    ///     U32x4::from([101, 102, 103, 104]).unpack_lo(U32x4::from([201, 202, 203, 204])),
410    ///     U32x4::from([101, 201, 102, 202]),
411    /// );
412    /// ````
413    fn unpack_lo(&self, other: Self) -> Self;
414    /// Interleave the elements of the high half of `self` and `other`.
415    ///
416    /// # Example
417    /// ```
418    /// # use vectoreyes::*;
419    /// assert_eq!(
420    ///     U32x4::from([101, 102, 103, 104]).unpack_hi(U32x4::from([201, 202, 203, 204])),
421    ///     U32x4::from([103, 203, 104, 204]),
422    /// );
423    /// ````
424    fn unpack_hi(&self, other: Self) -> Self;
425
426    /// Make a vector consisting of the maximum elements of `self` and other.
427    ///
428    /// # Example
429    /// ```
430    /// # use vectoreyes::*;
431    /// assert_eq!(
432    ///     U32x4::from([1, 2, 3, 4]).max(U32x4::from([0, 9, 0, 0])),
433    ///     U32x4::from([1, 9, 3, 4]),
434    /// );
435    /// ````
436    fn max(&self, other: Self) -> Self;
437    /// Make a vector consisting of the minimum elements of `self` and other.
438    ///
439    /// # Example
440    /// ```
441    /// # use vectoreyes::*;
442    /// assert_eq!(
443    ///     U32x4::from([1, 2, 3, 4]).min(U32x4::from([0, 9, 0, 0])),
444    ///     U32x4::from([0, 2, 0, 0]),
445    /// );
446    /// ````
447    fn min(&self, other: Self) -> Self;
448}
449
450/// A vector supporting the gather operation (indexing into an array using indices from a vector).
451pub trait SimdBaseGatherable<IV: SimdBase>: SimdBase {
452    /// Construct a vector by accessing values at `base + indices[i]`.
453    ///
454    /// # Safety
455    /// This operation is safe if `std::ptr::read(base.add(indices[i]))` is safe for all `i`.
456    ///
457    /// # Example
458    /// ```
459    /// # use vectoreyes::*;
460    /// let arr: Vec<i32> = (0..=1024).map(|x| x + 1).collect();
461    /// let out = unsafe {
462    ///     // SAFETY: All the indices are within bounds.
463    ///     I32x4::gather(arr.as_ptr(), U64x4::from([32, 647, 827, 920]))
464    /// };
465    /// assert_eq!(out, I32x4::from([33, 648, 828, 921]));
466    /// ```
467    unsafe fn gather(base: *const Self::Scalar, indices: IV) -> Self;
468    /// Construct a vector by accessing values at `base + indices[i]`, if the mask's MSB is set.
469    /// Else return `src[i]`.
470    ///
471    /// # Safety
472    /// This operation is safe if `std::ptr::read(base.add(indices[i]))` is safe for all `i`.
473    ///
474    /// # Example
475    /// ```
476    /// # use vectoreyes::*;
477    /// let arr: Vec<i32> = (0..=1024).map(|x| x + 1).collect();
478    /// let out = unsafe {
479    ///     // SAFETY: All the indices are within bounds.
480    ///     I32x4::gather_masked(
481    ///         arr.as_ptr(),
482    ///         U64x4::from([32, 647, 827, 920]),
483    ///         I32x4::from([-1, -1, 0, 0]),
484    ///         I32x4::from([1, 2, 3, 4]),
485    ///     )
486    /// };
487    /// assert_eq!(out, I32x4::from([33, 648, 3, 4]));
488    /// ```
489    unsafe fn gather_masked(base: *const Self::Scalar, indices: IV, mask: Self, src: Self) -> Self;
490}
491
492/// A vector containing 4 lanes.
493pub trait SimdBase4x: SimdBase {
494    /// If `Bi` is true, then that lane will be filled by `if_true`. Otherwise the lane
495    /// will be filled from `self`.
496    ///
497    /// # Example
498    /// ```
499    /// # use vectoreyes::*;
500    /// assert_eq!(
501    ///     U64x4::from([11, 12, 13, 14])
502    ///         .blend::<true, true, true, false>(U64x4::from([21, 22, 23, 24])),
503    ///     U64x4::from([11, 22, 23, 24]),
504    /// );
505    /// ````
506    fn blend<const B3: bool, const B2: bool, const B1: bool, const B0: bool>(
507        &self,
508        if_true: Self,
509    ) -> Self;
510}
511
512/// A vector containing 8 lanes.
513pub trait SimdBase8x: SimdBase {
514    /// If `Bi` is true, then that lane will be filled by `if_true`. Otherwise the lane
515    /// will be filled from `self`.
516    ///
517    /// # Example
518    /// ```
519    /// # use vectoreyes::*;
520    /// assert_eq!(
521    ///     U32x8::from([11, 12, 13, 14, 15, 16, 17, 18])
522    ///         .blend::<true, true, true, false, false, true, true, false>(
523    ///             U32x8::from([21, 22, 23, 24, 25, 26, 27, 28])),
524    ///     U32x8::from([11, 22, 23, 14, 15, 26, 27, 28]),
525    /// );
526    /// ````
527    fn blend<
528        const B7: bool,
529        const B6: bool,
530        const B5: bool,
531        const B4: bool,
532        const B3: bool,
533        const B2: bool,
534        const B1: bool,
535        const B0: bool,
536    >(
537        &self,
538        if_true: Self,
539    ) -> Self;
540}
541
542/// A vector supporting saturating arithmetic on each entry.
543///
544/// Saturating operations clamp their outputs to the scalar's maximum or minimum value on
545/// overflow/underflow.
546pub trait SimdSaturatingArithmetic: SimdBase {
547    /// Pairwise add vectors. On overflow, the entry's value goes to the maximum scalar value.
548    fn saturating_add(&self, other: Self) -> Self;
549    /// Pairwise add vectors. On overflow, the entry's value goes to the minimum scalar value.
550    fn saturating_sub(&self, other: Self) -> Self;
551}
552
553/// A vector containing 8-bit values.
554pub trait SimdBase8: SimdBase + SimdSaturatingArithmetic
555where
556    Self::Scalar: Scalar<Unsigned = u8, Signed = i8>,
557{
558    /// Split the vector into groups of 16 bytes. Within each group, shift the _entire_ bytes left
559    /// by `AMOUNT`.
560    ///
561    /// # Example
562    /// ```
563    /// # use vectoreyes::*;
564    /// assert_eq!(
565    ///     U8x16::from([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]).shift_bytes_left::<1>(),
566    ///     U8x16::from([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]),
567    /// );
568    /// ```
569    fn shift_bytes_left<const AMOUNT: usize>(&self) -> Self;
570    /// Split the vector into groups of 16 bytes. Within each group, shift the _entire_ bytes right
571    /// by `AMOUNT`.
572    ///
573    /// # Example
574    /// ```
575    /// # use vectoreyes::*;
576    /// assert_eq!(
577    ///     U8x16::from([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]).shift_bytes_right::<1>(),
578    ///     U8x16::from([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 0]),
579    /// );
580    /// ```
581    fn shift_bytes_right<const AMOUNT: usize>(&self) -> Self;
582    /// Get the sign/most significant bits of the elements of the vector.
583    ///
584    /// # Example
585    /// ```
586    /// # use vectoreyes::*;
587    /// assert_eq!(
588    ///     (U8x16::from([0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1]) << 7).most_significant_bits(),
589    ///     0b1111001001010000,
590    /// );
591    /// ```
592    fn most_significant_bits(&self) -> u32;
593}
594
595/// A vector containing 16-bit values.
596pub trait SimdBase16: SimdBase + SimdSaturatingArithmetic
597where
598    Self::Scalar: Scalar<Unsigned = u16, Signed = i16>,
599{
600    /// Shuffle within the lower 64-bits of each 128-bit subvector.
601    ///
602    /// # Example
603    /// ```
604    /// # use vectoreyes::*;
605    /// assert_eq!(
606    ///     U16x16::from([
607    ///         0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
608    ///     ]).shuffle_lo::<0, 1, 1, 3>(),
609    ///     U16x16::from([
610    ///         3, 1, 1, 0, 4, 5, 6, 7, 11, 9, 9, 8, 12, 13, 14, 15
611    ///     ]),
612    /// );
613    /// ```
614    fn shuffle_lo<const I3: usize, const I2: usize, const I1: usize, const I0: usize>(
615        &self,
616    ) -> Self;
617    /// Shuffle within the upper 64-bits of each 128-bit subvector.
618    ///
619    /// # Example
620    /// ```
621    /// # use vectoreyes::*;
622    /// assert_eq!(
623    ///     U16x16::from([
624    ///         0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
625    ///     ]).shuffle_hi::<0, 1, 1, 3>(),
626    ///     U16x16::from([
627    ///         0, 1, 2, 3, 7, 5, 5, 4, 8, 9, 10, 11, 15, 13, 13, 12
628    ///     ]),
629    /// );
630    /// ```
631    fn shuffle_hi<const I3: usize, const I2: usize, const I1: usize, const I0: usize>(
632        &self,
633    ) -> Self;
634}
635
636/// A vector containing 32-bit values.
637pub trait SimdBase32: SimdBase
638where
639    Self::Scalar: Scalar<Unsigned = u32, Signed = i32>,
640{
641    /// Shuffle within 128-bit subvector.
642    ///
643    /// # Example
644    /// ```
645    /// # use vectoreyes::*;
646    /// assert_eq!(
647    ///     U32x8::from([
648    ///         0, 1, 2, 3, 4, 5, 6, 7
649    ///     ]).shuffle::<0, 1, 1, 3>(),
650    ///     U32x8::from([
651    ///         3, 1, 1, 0, 7, 5, 5, 4
652    ///     ]),
653    /// );
654    /// ```
655    fn shuffle<const I3: usize, const I2: usize, const I1: usize, const I0: usize>(&self) -> Self;
656}
657
658/// A vector containing 64-bit values.
659pub trait SimdBase64: SimdBase
660where
661    Self::Scalar: Scalar<Unsigned = u64, Signed = i64>,
662{
663    /// Zero out the upper-32 bits of each word, and then perform pairwise multiplication.
664    ///
665    /// # Example
666    /// ```
667    /// # use vectoreyes::*;
668    /// assert_eq!(
669    ///     U64x4::from([6, 7, 8, 9]).mul_lo(U64x4::from([1, 2, 3, 4])),
670    ///     U64x4::from([6, 14, 24, 36]),
671    /// );
672    /// assert_eq!(
673    ///     U64x4::from([6, 7, 8, 9]).mul_lo(
674    ///         U64x4::from([1, 2, 3, 4]) | U64x4::broadcast(u64::MAX << 32)
675    ///     ),
676    ///     U64x4::from([6, 14, 24, 36]),
677    /// );
678    /// ```
679    fn mul_lo(&self, other: Self) -> Self;
680}
681
682/// A vector containing 4 64-bit values.
683pub trait SimdBase4x64: SimdBase64 + SimdBase4x
684where
685    Self::Scalar: Scalar<Unsigned = u64, Signed = i64>,
686{
687    /// Shuffle the 64-bit values.
688    ///
689    /// # Example
690    /// ```
691    /// # use vectoreyes::*;
692    /// assert_eq!(
693    ///     U64x4::from([0, 1, 2, 3]).shuffle::<0, 1, 1, 3>(),
694    ///     U64x4::from([3, 1, 1, 0]),
695    /// );
696    /// ```
697    fn shuffle<const I3: usize, const I2: usize, const I1: usize, const I0: usize>(&self) -> Self;
698}
699
700// TODO: deprecate the uses of from() everywhere and use traits/functions that make it obvious which
701// casts are free and which aren't.
702
703/// Lossily cast a vector by {zero,sign}-extending its values.
704pub trait ExtendingCast<T: SimdBase>: SimdBase {
705    /// Cast from one vector to another by sign or zero extending the values from the source until it
706    /// fills the destination.
707    ///
708    /// The lowest-index values in `t` are kept. Any values which don't fit are discarded.
709    ///
710    /// # Example
711    /// ```
712    /// # use vectoreyes::*;
713    /// assert_eq!(
714    ///     U64x2::extending_cast_from(U32x4::from([1, 2, 3, 4])),
715    ///     U64x2::from([1, 2]),
716    /// );
717    /// ```
718    fn extending_cast_from(t: T) -> Self;
719}
720
721/// A [`Scalar`] type which has a vector type of length `N`.
722///
723/// See [`Simd`] for how this trait is used.
724pub trait HasVector<const N: usize>: Scalar {
725    /// The vector of `[Self; N]`.
726    type Vector: SimdBase<Scalar = Self>;
727}
728
729/// An alternative way of naming SIMD types.
730///
731/// This allows for functions to be written which are generic in the type or length of a vector.
732///
733/// # Example
734/// ```
735/// # use vectoreyes::*;
736/// type MyVector = Simd<u8, 16>; // The same as U8x16.
737///
738/// fn my_length_generic_code<const N: usize>(x: Simd<u32, N>, y: Simd<u32, N>) -> Simd<u32, N>
739///     where u32: HasVector<N>
740/// {
741///     x + x + y
742/// }
743/// ```
744pub type Simd<T, const N: usize> = <T as HasVector<N>>::Vector;
745
746/// An AES block cipher, suitable for encryption.
747///
748/// This cipher can be used for encryption. Decryption operations are handled in the subtrait
749/// [`AesBlockCipherDecrypt`].
750pub trait AesBlockCipher: 'static + Clone + Sync + Send {
751    /// The type of the AES key.
752    type Key: 'static + Clone + Sync + Send;
753
754    /// Running `encrypt_many` with this many blocks will typically result in good
755    /// performance.
756    const BLOCK_COUNT_HINT: usize;
757
758    /// Run the AES key schedule operation with a given key.
759    fn new_with_key(key: Self::Key) -> Self;
760
761    /// Encrypt a single 128-bit AES block.
762    #[inline(always)]
763    fn encrypt(&self, block: U8x16) -> U8x16 {
764        self.encrypt_many([block])[0]
765    }
766    /// Encrypt an array of `N` 128-bit AES blocks using ECB mode.
767    fn encrypt_many<const N: usize>(&self, blocks: [U8x16; N]) -> [U8x16; N]
768    where
769        array_utils::ArrayUnrolledOps: array_utils::UnrollableArraySize<N>;
770}
771
772/// An AES block cipher, suitable for encryption and decryption.
773pub trait AesBlockCipherDecrypt: AesBlockCipher {
774    /// Decrypt a single 128-bit AES block.
775    #[inline(always)]
776    fn decrypt(&self, block: U8x16) -> U8x16 {
777        self.decrypt_many([block])[0]
778    }
779    /// Decrypt an array of `N` 128-bit AES blocks using ECB mode.
780    fn decrypt_many<const N: usize>(&self, blocks: [U8x16; N]) -> [U8x16; N]
781    where
782        array_utils::ArrayUnrolledOps: array_utils::UnrollableArraySize<N>;
783}
784
785pub mod array_utils;
786pub(crate) mod utils;
787
788// We want to allow `which_lane * 0 + 0` expressions.
789// These also allow for simpler generated code. For example, sometimes we have code which looks
790// like:
791//    let x: {{ty}};
792//    x as u8
793// When {{ty}} _is_ u8, this cast isn't neccessary. But it's simpler to always insert it in the
794// generated code.
795#[allow(
796    clippy::identity_op,
797    clippy::erasing_op,
798    clippy::unnecessary_cast,
799    clippy::useless_conversion
800)]
801// intel intrinsics have many arguments
802#[allow(clippy::too_many_arguments)]
803// our compressed code doesn't have newlines
804#[allow(clippy::suspicious_else_formatting)]
805// You can't put inline(always) without a closure
806#[allow(clippy::redundant_closure)]
807// These two lints let us have extra parentheses in the generated source (which makes generation
808// easier).
809#[allow(unused_parens)]
810#[allow(clippy::needless_borrow)]
811// </the two lints>
812mod generated;
813pub use generated::implementation::*;