diff --git a/Doc/DesignPrinciples.html b/Doc/DesignPrinciples.html new file mode 100644 index 00000000..3e521826 --- /dev/null +++ b/Doc/DesignPrinciples.html @@ -0,0 +1,68 @@ + +Images/Title.png +

+Back to main page +

+

Design principles

+

+

+ +

+

Choice of C++ instead of C

+

+The original version of DFPSR was written in the C language, in an attempt to keep it simple and have a stable ABI for dynamic linking with almost any language. +Being a graphics API instead of a graphics engine however, the pre-existing image filters would have made very boring graphics. +So dynamic linking was not an option and then it did not matter if the ABI was stable or not when compiling all source code in the same version of the same language. +Any mistake in handling of resources when writing all applications in C would lead to memory corruption, so development was painfully slow when trying to throw together a quick prototype. +The library needed a language that allowed writing both realtime image filters with pointers, and the high level abstractions needed to apply bound checks on those pointers in debug mode. + +

+While not perfect, C++ 2014 filled most of the requirements, by having access to both low level optimization and high level abstractions. +With dsr::SafePointer, there were no more random memory crashes of unknown origin, because bound checked pointers caught data corruption directly where it happened in debug mode. +With RAII, memory leaks no longer happened from having allocations with no pointers to them, only from applications deciding to keep adding things to collections without ever removing them. +The SIMD abstraction used operator overloading instead of named functions for everything, which made complex math expressions a lot more readable. + +

+

+ +

+

Relying on a language standard instead of a single compiler implementation

+

+What if the library had used a language with a single compiler implementation to fully specify the behavior? +If the interpretation of your code requires reading the source code of a specific compiler to understand what it is supposed to do, that code will die with the compiler. +Once the source code stands on its own without a compiler that still runs on modern systems, avoiding undefined behavior must be done in the code itself. + +

+

+ +

+

Bundling transpilers with projects

+

+Relying on a transpiler that is bundled with the project is however fine, because you can generate code that avoids undefined behavior and save the output in case that the transpiler can no longer execute. +This documentation was generated using a transpiler, so that it does not rely on the ability to parse HTML. + +

+

+

+ diff --git a/Doc/Dictionary.html b/Doc/Dictionary.html new file mode 100644 index 00000000..68d07f17 --- /dev/null +++ b/Doc/Dictionary.html @@ -0,0 +1,119 @@ + +Images/Title.png +

+Back to main page +

+

Dictionary

+

+To avoid confusion when reading the code comments and documentation, some contemporary programming words will have to be explained in this dictionary. + +

+

+ +

+

Acronyms

+

+While one should mostly avoid using abbreviations in source code, writing davidForsgrenPiuvasSoftwareRenderer:: instead of dsr:: in headers would take too much space and not help to explain the purpose of the code. + +

+ABI | Application Binary Interface. +How a library is linked and called after the source code has been compiled, which is a binary representation. + +

+API | Application Programming Interface. +How a library is called in the source code, which can be checked by the programming language's type system. +A wrapper to call code written in a different language is an API, because the ABI has been wrapped in pre-declarations explaining the meaning of the bits in the calling language's type system. + +

+AD | Anno Domini (In the lord's years), used for years after the birth of prophet Jesus Christ in the Gregorian calendar, which is the calendar used to name versions of the C++ programming language. + +

+C | A programming language created by Dennis Ritche in the 1970s AD. +Not to be confused with the C# language. + +

+C++ | A programming language created by Bjarne Stroustrup in the 1980s AD. Inherited most of the core design from C for easy porting while adding new features. + +

+DSR | David's (Forsgren Piuva's) software renderer, the shorter acronym for this library and the surrounding framework. + +

+DFPSR | David Forsgren Piuva's software renderer, the longer acronym for this library and the surrounding framework. + +

+HTML | Hyper Text Markup Language. A text format adapted for reading on computers instead of printing, by containing links and not dictating any page layout. Most known as the default format for websites. + +

+LOD | Level Of Detail. Often referring to a 3D model's polygon count, but can also apply to other types of assets. + +

+MIP | Multum in parvo. A term used for texture pyramids because there are multiple image resolutions stored in the same texture. Used to avoid noisy undersampling of textures that are seen from far away. + +

+RAII | Resource Allocation Is Initialization, an important design principle in C++ that prevents memory leaks by letting types handle heap memory allocation automatically in constructors and destructors. + +

+SIMD | Single Instruction Multiple Data. +A power efficient hardware design for packing arrays of elements into the same register and processing them at the same time using multiple lanes. +Makes the machine instructions faster, by both increasing the number of elements processed per instruction, and reducing the pipeline latency needed to run a single instruction by using smaller types. + +

+

+ +

+

Abbreviations

+

+Doc | Documentation + +

+Gen | Generator, a function or program used to create something automatically. + +

+

+ +

+

Expressions

+

+Array | A coherent and indexed block of memory with alignment for each element to allow random access. +dsr::Array creates a heap allocation based on a size given at runtime during construction. +dsr::FixedArray does not make any heap allocation, but can not change its size at runtime. + +

+Et cetera | Latin expression that is often used to indicate that a list of examples is not exhaustive, roughly meaning "and all the rest". + +

+List | A collection that can be populated with elements at runtime, where indices to the elements are treated as something temporary. +dsr::List is an array based list, and simply called List to save some typing when there is no linked list among the framework's generic collections. + +

+Vector | An array of numbers, used for multi-dimensional math or SIMD vectorization. + +

+Vectorize | To use vectors in order to solve a math problem. +Instead of treating one million pixels as independent objects with a heap allocation for each pixel, we vectorize images into memory aligned arrays and loop over them using chunks of shorter SIMD vectors. + +

+Scalar | A single number, as opposed to a collection of multiple numbers. +

+ diff --git a/Doc/Manual.html b/Doc/Manual.html index 08dbc704..0551dd31 100644 --- a/Doc/Manual.html +++ b/Doc/Manual.html @@ -81,6 +81,9 @@ Choosing storage

+SIMD vectorization +

+ Image processing

@@ -108,11 +111,17 @@

Future maintenance

-Preserving software for future maintainers +Dictionary for commonly used terms +

+ +Design principles

Porting to new processors and operating systems

+ +Preserving software for future maintainers +

diff --git a/Doc/Text/DesignPrinciples.txt b/Doc/Text/DesignPrinciples.txt new file mode 100644 index 00000000..ad68c541 --- /dev/null +++ b/Doc/Text/DesignPrinciples.txt @@ -0,0 +1,35 @@ +<- Manual.html | Back to main page + +Title: Design principles + +--- + +Title2: Choice of C++ instead of C + +The original version of DFPSR was written in the C language, in an attempt to keep it simple and have a stable ABI for dynamic linking with almost any language. +Being a graphics API instead of a graphics engine however, the pre-existing image filters would have made very boring graphics. +So dynamic linking was not an option and then it did not matter if the ABI was stable or not when compiling all source code in the same version of the same language. +Any mistake in handling of resources when writing all applications in C would lead to memory corruption, so development was painfully slow when trying to throw together a quick prototype. +The library needed a language that allowed writing both realtime image filters with pointers, and the high level abstractions needed to apply bound checks on those pointers in debug mode. + +While not perfect, C++ 2014 filled most of the requirements, by having access to both low level optimization and high level abstractions. +With dsr::SafePointer, there were no more random memory crashes of unknown origin, because bound checked pointers caught data corruption directly where it happened in debug mode. +With RAII, memory leaks no longer happened from having allocations with no pointers to them, only from applications deciding to keep adding things to collections without ever removing them. +The SIMD abstraction used operator overloading instead of named functions for everything, which made complex math expressions a lot more readable. + +--- + +Title2: Relying on a language standard instead of a single compiler implementation + +What if the library had used a language with a single compiler implementation to fully specify the behavior? +If the interpretation of your code requires reading the source code of a specific compiler to understand what it is supposed to do, that code will die with the compiler. +Once the source code stands on its own without a compiler that still runs on modern systems, avoiding undefined behavior must be done in the code itself. + +--- + +Title2: Bundling transpilers with projects + +Relying on a transpiler that is bundled with the project is however fine, because you can generate code that avoids undefined behavior and save the output in case that the transpiler can no longer execute. +This documentation was generated using a transpiler, so that it does not rely on the ability to parse HTML. + +--- diff --git a/Doc/Text/Dictionary.txt b/Doc/Text/Dictionary.txt new file mode 100644 index 00000000..bc06a0cb --- /dev/null +++ b/Doc/Text/Dictionary.txt @@ -0,0 +1,69 @@ +<- Manual.html | Back to main page + +Title: Dictionary + +To avoid confusion when reading the code comments and documentation, some contemporary programming words will have to be explained in this dictionary. + +--- + +Title2: Acronyms + +While one should mostly avoid using abbreviations in source code, writing davidForsgrenPiuvasSoftwareRenderer:: instead of dsr:: in headers would take too much space and not help to explain the purpose of the code. + +ABI | Application Binary Interface. +How a library is linked and called after the source code has been compiled, which is a binary representation. + +API | Application Programming Interface. +How a library is called in the source code, which can be checked by the programming language's type system. +A wrapper to call code written in a different language is an API, because the ABI has been wrapped in pre-declarations explaining the meaning of the bits in the calling language's type system. + +AD | Anno Domini (In the lord's years), used for years after the birth of prophet Jesus Christ in the Gregorian calendar, which is the calendar used to name versions of the C++ programming language. + +C | A programming language created by Dennis Ritche in the 1970s AD. +Not to be confused with the C# language. + +C++ | A programming language created by Bjarne Stroustrup in the 1980s AD. Inherited most of the core design from C for easy porting while adding new features. + +DSR | David's (Forsgren Piuva's) software renderer, the shorter acronym for this library and the surrounding framework. + +DFPSR | David Forsgren Piuva's software renderer, the longer acronym for this library and the surrounding framework. + +HTML | Hyper Text Markup Language. A text format adapted for reading on computers instead of printing, by containing links and not dictating any page layout. Most known as the default format for websites. + +LOD | Level Of Detail. Often referring to a 3D model's polygon count, but can also apply to other types of assets. + +MIP | Multum in parvo. A term used for texture pyramids because there are multiple image resolutions stored in the same texture. Used to avoid noisy undersampling of textures that are seen from far away. + +RAII | Resource Allocation Is Initialization, an important design principle in C++ that prevents memory leaks by letting types handle heap memory allocation automatically in constructors and destructors. + +SIMD | Single Instruction Multiple Data. +A power efficient hardware design for packing arrays of elements into the same register and processing them at the same time using multiple lanes. +Makes the machine instructions faster, by both increasing the number of elements processed per instruction, and reducing the pipeline latency needed to run a single instruction by using smaller types. + +--- + +Title2: Abbreviations + +Doc | Documentation + +Gen | Generator, a function or program used to create something automatically. + +--- + +Title2: Expressions + +Array | A coherent and indexed block of memory with alignment for each element to allow random access. +dsr::Array creates a heap allocation based on a size given at runtime during construction. +dsr::FixedArray does not make any heap allocation, but can not change its size at runtime. + +Et cetera | Latin expression that is often used to indicate that a list of examples is not exhaustive, roughly meaning "and all the rest". + +List | A collection that can be populated with elements at runtime, where indices to the elements are treated as something temporary. +dsr::List is an array based list, and simply called List to save some typing when there is no linked list among the framework's generic collections. + +Vector | An array of numbers, used for multi-dimensional math or SIMD vectorization. + +Vectorize | To use vectors in order to solve a math problem. +Instead of treating one million pixels as independent objects with a heap allocation for each pixel, we vectorize images into memory aligned arrays and loop over them using chunks of shorter SIMD vectors. + +Scalar | A single number, as opposed to a collection of multiple numbers. diff --git a/Doc/Text/Manual.txt b/Doc/Text/Manual.txt index 4c6a0290..9fef13a8 100644 --- a/Doc/Text/Manual.txt +++ b/Doc/Text/Manual.txt @@ -49,6 +49,9 @@ Title2: Techniques * <- ChoosingStorage.html | Choosing storage +* +<- Vectorization.html | SIMD vectorization + * <- ImageProcessing.html | Image processing @@ -74,9 +77,15 @@ Title2: Technical details Title2: Future maintenance * -<- Preserving.html | Preserving software for future maintainers +<- Dictionary.html | Dictionary for commonly used terms + +* +<- DesignPrinciples.html | Design principles * <- Porting.html | Porting to new processors and operating systems +* +<- Preserving.html | Preserving software for future maintainers + --- diff --git a/Doc/Text/Vectorization.txt b/Doc/Text/Vectorization.txt new file mode 100644 index 00000000..5f27d3da --- /dev/null +++ b/Doc/Text/Vectorization.txt @@ -0,0 +1,205 @@ +<- Manual.html | Back to main page + +Title: Vectorization + +One of the most important features of this framework is the ability to use vectors that are independent of processor architecture. +Instead of having to manually use intrinsic functions directly, the library generates the intrinsic function calls using inlined functions bound to operands. + +All lanes in the vectors are indexed by memory addresses, so lane zero is always first in memory, while reinterpret casting between SIMD vectors works as if reinterpret casting pointers to bytes in memory. + +--- + +Title2: The headers + +* +Source/DFPSR/base/simd.h + +The Source/DFPSR/base/simd.h header is where the SIMD vectors are implemented. + +* +Source/DFPSR/base/noSimd.h + +To simplify using the same math functions with and without vectorization, the same functionality is offered to basic scalar types in Source/DFPSR/base/noSimd.h. +When writing a template function that works for both scalar and vector types, one can import Source/DFPSR/base/noSimd.h in the header that defines the template functions without polluting the header with everything inside of simd.h. +Then the caller can import simd.h from within a cpp file and call the template function using the portable vector types. + +--- + +Title2: 128-bit vector types + +The most deterministic and beginner friendly way of SIMD vectorizing, is to use the fixed size 128-bit vectors. +Then you do not need to test the algorithm with multiple vector widths. +It will just work with the same determinism as using scalar operations directly in C++. + +U8x16 is a vector of unsigned integer with 16 lanes of 8 bits each. + +U16x8 is a vector of unsigned integer with 8 lanes of 16 bits each. + +U32x4 is a vector of unsigned integer with 4 lanes of 32 bits each. + +I32x4 is a vector of signed integer with 4 lanes of 32 bits each. +Note that signed integers in C++ have undefined behavior to give room for better optimization. + +F32x4 is a vector of floating-point numbers with 4 lanes of 32 bits each. +(float, float, float, float) +Even if the computer follows a standard for storage of floating-point values, the calculations are always undefined behavior when it comes to rounding errors in calculations. +Use integers if you need absolute determinism regarding precision. + +--- + +Title2: Variable width vector types + +When targeting hardware that has wider vectors than 128 bits, it is recommended to use the variable width SIMD vectors. +Note that some SIMD hardware will have a special read-only register for reading the vector's width at runtime instead of compiling for different targets, so that the same compiled program can use the same instructions for different vector widths. + +For SSE2 and ARM NEON, both X and F vectors are 128 bits wide. +These are usually enabled by default, because the extensions are not optional in modern hardware. + +For AVX, F vectors are 256 bit wide and X vectors are 128 bits wide. +AVX has to be enabled using a flag specific to the compiler and will make the executable incompatible with processors that do not have the AVX extension. +AVX only makes the F32xF type wider, so if you are not using it, you can skip compiling for AVX and only compile for SSE2 and AVX2. +If you are not targeting old processors, you can start with AVX as the minimum requirement. + +For AVX2, both X and F vectors are 256 bit wide. +AVX2 has to be enabled using a flag specific to the compiler and will make the executable incompatible with processors that do not have the AVX2 extension. +For programs compiled with AVX2, it is advisable to also release a version of the program that does not require AVX2. + +--- + +Title2: X vector types + +An X vector has the largest bit width that is available for both floating-point and integer types. + +* +U8xX is a vector of 8-bit unsigned integers filling up the X vector width. + +* +U16xX is a vector of 16-bit unsigned integers filling up the X vector width. + +* +U32xX is a vector of 32-bit unsigned integers filling up the X vector width. + +* +I32xX is a vector of 32-bit signed integers filling up the X vector width. + +* +F32xX is a vector of 32-bit floats filling up the X vector width, which can be smaller than F32xF when X and F vectors are not of the same size. + +--- + +Title2: F vector types + +An F vector has the largest bit width that is available for both floating-point only, which can then be wider than the X vector. +When using the F vector, all operations in the loop must use floating-point values. +If you are mixing in integers with the X vector width, you have to fall back to the X width for the entire loop. + +* +F32xF is a vector of 32-bit floats filling up the F vector width. + +--- + +Title2: Reading and writing data + +The SIMD vectors are designed to be used together with dsr::Buffer and dsr::SafePointer. +dsr::Buffer guarantees that the start of the buffer is aligned with both the widest SIMD vector and the widest cache line among all CPU cores. +dsr::SafePointer makes it easy to catch out-of-bound errors in debug mode, even when using gather instructions. + +--- + +Title2: readAligned + +readAligned is a static member function of the SIMD vectors, so you call it using the vector type as a namespace before the call. +If you for example want to load an X vector of 16-bit unsigned integers, you call 'U16xX::readAligned(firstElementPointer, "Reading my vector")' to get U16xX as the result. +The "data" pointer should point to the first element in the array of element to load, and must be memory aligned with the full size of the vector. +The "methodName" string should be an ascii string literal, which is only used in debug mode and optimized away in release mode. + +U8x? U8x32::readAligned(dsr::SafePointer data, const char* methodName) + +U16x? U8x32::readAligned(dsr::SafePointer data, const char* methodName) + +U32x? U8x32::readAligned(dsr::SafePointer data, const char* methodName) + +I32x? U8x32::readAligned(dsr::SafePointer data, const char* methodName) + +F32x? U8x32::readAligned(dsr::SafePointer data, const char* methodName) + +--- + +Title2: writeAligned + +writeAligned is a regular member function of the SIMD vectors, so you call it by treating the vector as an object. +If you have a SIMD vector called myVector containing a result that you want to store in memory, you call 'myVector.writeAligned(firstElementPointer, "Writing my vector");' to store it. +The "data" pointer should point to the first element in the array of element to write, and must be memory aligned with the full size of the vector. +The "methodName" string should be an ascii string literal, which is only used in debug mode and optimized away in release mode. + +void U8x?::writeAligned(dsr::SafePointer data, const char* methodName) const + +void U16x?::writeAligned(dsr::SafePointer data, const char* methodName) const + +void U32x?::writeAligned(dsr::SafePointer data, const char* methodName) const + +void I32x?::writeAligned(dsr::SafePointer data, const char* methodName) const + +void F32x?::writeAligned(dsr::SafePointer data, const char* methodName) const + +--- + +Title2: Gather functions + +The gather_U32, gather_I32 and gather_F32 functions take a SafePointer and a vector of 32-bit unsigned integer element offsets. +For each element offset in the offset vector, the element at the pointer plus the offset is returned at the corresponding lane in the result. + +U32x? gather_U32(dsr::SafePointer data, const U32x? &elementOffset) + +I32x? gather_I32(dsr::SafePointer data, const U32x? &elementOffset) + +F32x? gather_F32(dsr::SafePointer data, const U32x? &elementOffset) + +--- + +Title2: Constructing SIMD vectors from multiple scalars + +The easiest way to construct a SIMD vector is to construct it from its elements, such as 'F32x4(0.3f, 12.5f, 10.0f, 1.0f)'. + +--- + +Title2: Constructing SIMD vectors from a single scalar + +Every SIMD vector can also be constructed from a uniform scalar, because this has special SIMD instructions in some processors. +So 'U32x4(25)' is the same as writing 'U32x4(25, 25, 25, 25)'. + +--- + +Title2: Constructing SIMD vectors from gradients + +While not hardware accellerated, the static member function createGradient can be used to construct a SIMD vector using the value of the first element and how much to add for each following element. +So by writing 'F32xX::createGradient(0.5f, 1.0f)', you get F32x4(0.5f, 1.5f, 2.5f, 3.5f), F32x8(0.5f, 1.5f, 2.5f, 3.5f, 4.5f, 5.5f, 6.5f, 7.5f)... depending on the length of the type. + +Vectorizing sampling locations will sometimes require initializing a SIMD vector to a gradient and then iterate it in strides. + +--- + +Title2: Math operations + +On top of the functions defined in noSimd.h, most of the standard math functions that applies to scalars are also defined for SIMD vectors, such as addition (a + b), subtraction (a - b, -a) and multiplication (a * b). +The math operations are performed as point operations. +(a, b, c, d...) op (x, y, z, w...) = (a op x, b op y, c op z, d op w...) + +32-bit integer multiplication is not supported directly in hardware on some processors, but it will fall back on multiple scalar multiplications when that happens to give correct results. + +Integer division is not supported for SIMD vectors, because noboty uses integer division in optimized code and hardware supports it. +For multiplying and dividing by powers of two, you can instead pre-calculate the base-two logarithm of the value to multiply or divide. +Shift left on an unsigned integer to multiply. +Shift right on an unsigned integer to divide. + +Bit shifting is not supported for signed integers, because that would be undefined behavior. +Some hardware architectures offer signed bit shifting in a way that divides and multiplies signed integers correctly, but these might not even exist for scalar operations. + +For optimal performance, use bitShiftLeftImmediate instead of << and bitShiftRightImmediate instead of >> when you know the offset amount in compile time. +Instead of 'myVector << 5', write 'bitShiftLeftImmediate<5>(myVector)'. +Instead of 'myVector >> 12', write 'bitShiftRightImmediate<12>(myVector)'. +Otherwise the hardware abstraction may have to fall back on repeated scalar operations even though bit shifting intrinsic functions are available for immediate offsets. + +SIMD vectors with lanes of unsigned integers also support bitwise and (a & b), bitwise inclusive or (a | b), bitwise exclusive or (a ^ b), bitwise negation (~a). + +--- diff --git a/Doc/Vectorization.html b/Doc/Vectorization.html new file mode 100644 index 00000000..21c3b01a --- /dev/null +++ b/Doc/Vectorization.html @@ -0,0 +1,295 @@ + +Images/Title.png +

+Back to main page +

+

Vectorization

+

+One of the most important features of this framework is the ability to use vectors that are independent of processor architecture. +Instead of having to manually use intrinsic functions directly, the library generates the intrinsic function calls using inlined functions bound to operands. + +

+All lanes in the vectors are indexed by memory addresses, so lane zero is always first in memory, while reinterpret casting between SIMD vectors works as if reinterpret casting pointers to bytes in memory. + +

+

+ +

+

The headers

+

+ +Source/DFPSR/base/simd.h + +

+The Source/DFPSR/base/simd.h header is where the SIMD vectors are implemented. + +

+ +Source/DFPSR/base/noSimd.h + +

+To simplify using the same math functions with and without vectorization, the same functionality is offered to basic scalar types in Source/DFPSR/base/noSimd.h. +When writing a template function that works for both scalar and vector types, one can import Source/DFPSR/base/noSimd.h in the header that defines the template functions without polluting the header with everything inside of simd.h. +Then the caller can import simd.h from within a cpp file and call the template function using the portable vector types. + +

+

+ +

+

128-bit vector types

+

+The most deterministic and beginner friendly way of SIMD vectorizing, is to use the fixed size 128-bit vectors. +Then you do not need to test the algorithm with multiple vector widths. +It will just work with the same determinism as using scalar operations directly in C++. + +

+U8x16 is a vector of unsigned integer with 16 lanes of 8 bits each. + +

+U16x8 is a vector of unsigned integer with 8 lanes of 16 bits each. + +

+U32x4 is a vector of unsigned integer with 4 lanes of 32 bits each. + +

+I32x4 is a vector of signed integer with 4 lanes of 32 bits each. +Note that signed integers in C++ have undefined behavior to give room for better optimization. + +

+F32x4 is a vector of floating-point numbers with 4 lanes of 32 bits each. +(float, float, float, float) +Even if the computer follows a standard for storage of floating-point values, the calculations are always undefined behavior when it comes to rounding errors in calculations. +Use integers if you need absolute determinism regarding precision. + +

+

+ +

+

Variable width vector types

+

+When targeting hardware that has wider vectors than 128 bits, it is recommended to use the variable width SIMD vectors. +Note that some SIMD hardware will have a special read-only register for reading the vector's width at runtime instead of compiling for different targets, so that the same compiled program can use the same instructions for different vector widths. + +

+For SSE2 and ARM NEON, both X and F vectors are 128 bits wide. +These are usually enabled by default, because the extensions are not optional in modern hardware. + +

+For AVX, F vectors are 256 bit wide and X vectors are 128 bits wide. +AVX has to be enabled using a flag specific to the compiler and will make the executable incompatible with processors that do not have the AVX extension. +AVX only makes the F32xF type wider, so if you are not using it, you can skip compiling for AVX and only compile for SSE2 and AVX2. +If you are not targeting old processors, you can start with AVX as the minimum requirement. + +

+For AVX2, both X and F vectors are 256 bit wide. +AVX2 has to be enabled using a flag specific to the compiler and will make the executable incompatible with processors that do not have the AVX2 extension. +For programs compiled with AVX2, it is advisable to also release a version of the program that does not require AVX2. + +

+

+ +

+

X vector types

+

+An X vector has the largest bit width that is available for both floating-point and integer types. + +

+ +U8xX is a vector of 8-bit unsigned integers filling up the X vector width. + +

+ +U16xX is a vector of 16-bit unsigned integers filling up the X vector width. + +

+ +U32xX is a vector of 32-bit unsigned integers filling up the X vector width. + +

+ +I32xX is a vector of 32-bit signed integers filling up the X vector width. + +

+ +F32xX is a vector of 32-bit floats filling up the X vector width, which can be smaller than F32xF when X and F vectors are not of the same size. + +

+

+ +

+

F vector types

+

+An F vector has the largest bit width that is available for both floating-point only, which can then be wider than the X vector. +When using the F vector, all operations in the loop must use floating-point values. +If you are mixing in integers with the X vector width, you have to fall back to the X width for the entire loop. + +

+ +F32xF is a vector of 32-bit floats filling up the F vector width. + +

+

+ +

+

Reading and writing data

+

+The SIMD vectors are designed to be used together with dsr::Buffer and dsr::SafePointer. +dsr::Buffer guarantees that the start of the buffer is aligned with both the widest SIMD vector and the widest cache line among all CPU cores. +dsr::SafePointer makes it easy to catch out-of-bound errors in debug mode, even when using gather instructions. + +

+

+ +

+

readAligned

+

+readAligned is a static member function of the SIMD vectors, so you call it using the vector type as a namespace before the call. +If you for example want to load an X vector of 16-bit unsigned integers, you call 'U16xX::readAligned(firstElementPointer, "Reading my vector")' to get U16xX as the result. +The "data" pointer should point to the first element in the array of element to load, and must be memory aligned with the full size of the vector. +The "methodName" string should be an ascii string literal, which is only used in debug mode and optimized away in release mode. + +

+U8x? U8x32::readAligned(dsr::SafePointer data, const char* methodName) + +

+U16x? U8x32::readAligned(dsr::SafePointer data, const char* methodName) + +

+U32x? U8x32::readAligned(dsr::SafePointer data, const char* methodName) + +

+I32x? U8x32::readAligned(dsr::SafePointer data, const char* methodName) + +

+F32x? U8x32::readAligned(dsr::SafePointer data, const char* methodName) + +

+

+ +

+

writeAligned

+

+writeAligned is a regular member function of the SIMD vectors, so you call it by treating the vector as an object. +If you have a SIMD vector called myVector containing a result that you want to store in memory, you call 'myVector.writeAligned(firstElementPointer, "Writing my vector");' to store it. +The "data" pointer should point to the first element in the array of element to write, and must be memory aligned with the full size of the vector. +The "methodName" string should be an ascii string literal, which is only used in debug mode and optimized away in release mode. + +

+void U8x?::writeAligned(dsr::SafePointer data, const char* methodName) const + +

+void U16x?::writeAligned(dsr::SafePointer data, const char* methodName) const + +

+void U32x?::writeAligned(dsr::SafePointer data, const char* methodName) const + +

+void I32x?::writeAligned(dsr::SafePointer data, const char* methodName) const + +

+void F32x?::writeAligned(dsr::SafePointer data, const char* methodName) const + +

+

+ +

+

Gather functions

+

+The gather_U32, gather_I32 and gather_F32 functions take a SafePointer and a vector of 32-bit unsigned integer element offsets. +For each element offset in the offset vector, the element at the pointer plus the offset is returned at the corresponding lane in the result. + +

+U32x? gather_U32(dsr::SafePointer data, const U32x? &elementOffset) + +

+I32x? gather_I32(dsr::SafePointer data, const U32x? &elementOffset) + +

+F32x? gather_F32(dsr::SafePointer data, const U32x? &elementOffset) + +

+

+ +

+

Constructing SIMD vectors from multiple scalars

+

+The easiest way to construct a SIMD vector is to construct it from its elements, such as 'F32x4(0.3f, 12.5f, 10.0f, 1.0f)'. + +

+

+ +

+

Constructing SIMD vectors from a single scalar

+

+Every SIMD vector can also be constructed from a uniform scalar, because this has special SIMD instructions in some processors. +So 'U32x4(25)' is the same as writing 'U32x4(25, 25, 25, 25)'. + +

+

+ +

+

Constructing SIMD vectors from gradients

+

+While not hardware accellerated, the static member function createGradient can be used to construct a SIMD vector using the value of the first element and how much to add for each following element. +So by writing 'F32xX::createGradient(0.5f, 1.0f)', you get F32x4(0.5f, 1.5f, 2.5f, 3.5f), F32x8(0.5f, 1.5f, 2.5f, 3.5f, 4.5f, 5.5f, 6.5f, 7.5f)... depending on the length of the type. + +

+Vectorizing sampling locations will sometimes require initializing a SIMD vector to a gradient and then iterate it in strides. + +

+

+ +

+

Math operations

+

+On top of the functions defined in noSimd.h, most of the standard math functions that applies to scalars are also defined for SIMD vectors, such as addition (a + b), subtraction (a - b, -a) and multiplication (a * b). +The math operations are performed as point operations. +(a, b, c, d...) op (x, y, z, w...) = (a op x, b op y, c op z, d op w...) + +

+32-bit integer multiplication is not supported directly in hardware on some processors, but it will fall back on multiple scalar multiplications when that happens to give correct results. + +

+Integer division is not supported for SIMD vectors, because noboty uses integer division in optimized code and hardware supports it. +For multiplying and dividing by powers of two, you can instead pre-calculate the base-two logarithm of the value to multiply or divide. +Shift left on an unsigned integer to multiply. +Shift right on an unsigned integer to divide. + +

+Bit shifting is not supported for signed integers, because that would be undefined behavior. +Some hardware architectures offer signed bit shifting in a way that divides and multiplies signed integers correctly, but these might not even exist for scalar operations. + +

+For optimal performance, use bitShiftLeftImmediate instead of << and bitShiftRightImmediate instead of >> when you know the offset amount in compile time. +Instead of 'myVector << 5', write 'bitShiftLeftImmediate<5>(myVector)'. +Instead of 'myVector >> 12', write 'bitShiftRightImmediate<12>(myVector)'. +Otherwise the hardware abstraction may have to fall back on repeated scalar operations even though bit shifting intrinsic functions are available for immediate offsets. + +

+SIMD vectors with lanes of unsigned integers also support bitwise and (a & b), bitwise inclusive or (a | b), bitwise exclusive or (a ^ b), bitwise negation (~a). + +

+

+

+ diff --git a/Source/test/tests/SimdTest.cpp b/Source/test/tests/SimdTest.cpp index 8ea9ce93..8f0de750 100755 --- a/Source/test/tests/SimdTest.cpp +++ b/Source/test/tests/SimdTest.cpp @@ -3,6 +3,8 @@ #include "../../DFPSR/base/simd.h" #include "../../DFPSR/base/endian.h" +// TODO: Write tests for the abs function in noSimd.h, using SIMD vectors. +// Implement the abs function directly to override the template functoin when hardware is available for the vector type. // TODO: Set up a test where SIMD is disabled to force using the reference implementation. // TODO: Keep the reference implementation alongside the SIMD types during brute-force testing with millions of random inputs.