Configurable Hardware for Large Integer Multiplication

Problem Statement
Generic and Configurable Large Integer Multiplication Architecture for FPGA Implementation

Large integer multiplication is a fundamental operation in domains such as cryptography (e.g., RSA, ECC), scientific computing, and digital signal processing, where both speed and precision are critical. However, standard multiplication techniques and general-purpose processors (CPUs/GPUs) become inefficient as operand sizes increase, particularly when faced with constraints on energy and latency. Implementing large integer multiplication on hardware, especially FPGAs, presents several challenges such as limited resources, increased latency with operand size, and the rigid, non-configurable nature of many existing multiplier architectures. Therefore, there is a pressing need for a generic, scalable, and configurable hardware multiplier that can adapt to varying operand sizes (e.g., 64-bit to 4096-bit), reuse basic arithmetic units efficiently (such as 32-bit multipliers), and balance performance with resource usage. This project aims to design and implement such a large integer multiplier on FPGA, featuring support for runtime or compile-time configurable input sizes, pipelined modular arithmetic blocks (e.g., Schoolbook or Karatsuba), and optimized accumulation and scheduling of partial products. The architecture focuses on minimal control overhead, low register usage, and efficient resource utilization, making it well-suited for scalable hardware acceleration on modern FPGA platforms.

Related Work
Efficient large integer multiplication has been explored through various hardware techniques due to its importance in cryptographic and scientific applications. Redundant binary addition trees have been shown to offer constant-time accumulation and regular VLSI layout, matching Wallace tree speeds while maintaining layout simplicity [1]. For cryptographic algorithms like Saber, lightweight accelerators with algorithm-specific mapping have significantly improved area-delay product (ADP) [2].

Montgomery multiplication remains a standard for modular arithmetic, though Karatsuba and FFT-based methods are gaining traction in hardware for larger operands [3]. Hybrid designs combining Karatsuba and Comba multiplication show latency improvements for operand sizes above 512 bits [4]. Ultra-large multiplication (million-bit scale) using Schönhage–Strassen and NTT has also been realized on FPGAs with custom memory hierarchies and FFT-oriented processing units [5].

Folded architectures have been proposed to reduce DSP usage while maintaining high throughput, offering better scalability for FPGA platforms [6]. Additionally, comparative studies highlight the trade-offs between multiplication algorithms on FPGA in terms of complexity, latency, and resource utilization [7].

Approaches

Vectorized Schoolbook Multiplier with Batch Scheduling
1. A schoolbook multiplication architecture using direct x*y multiplication for vectorized operands, supporting lane-based parallelism and batch-wise scheduling for different operand blocks.
2. Parameters: TotalOperandSize (bit-width of each operand), VectorLength (number of operands), NumLanes (parallel lanes per batch), BaseMultiplierBits (width of base multiplier).
3. Inputs: Entire vector operands X and Y of size VectorLength × TotalOperandSize each.
4. Critical path: base multiplication + accumulation of partial products across lanes per batch.
5. Disadvantage: No support for dynamic operand sizing or streaming; full operand inputs required at once; increased I/O footprint due to complete operand/result loading.
6. Advantage: Compile-time fixed size allows for tight resource optimization; vector lanes enable parallelism; control logic remains simple due to static scheduling.
Vectorized Recursive Karatsuba Multiplier
1. Recursive Karatsuba multiplication with unrolled stages using batch-scheduled base multipliers per recursion level. Parameters control lane width, vector length, and recursion base size.
2. Parameters: TotalOperandSize (bit-width of each operand), VectorLength (number of operands), NumLanes (parallel lanes per batch), BaseMultiplierBits (width of base multiplier).
3. Inputs: Entire vector operands X and Y of size VectorLength × TotalOperandSize each.
4. Critical Path: Multiple recursion levels involving parallel base multipliers and adders with inter-level accumulation and recombination.
5. Advantage: Optimized for performance at large operand sizes; high parallelism via vector lanes; good latency-speed tradeoff due to full unrolling.
6. Disadvantage: High resource usage; no streaming support; operand size fixed at compile time; full operand inputs required at once, increasing memory bandwidth demand.
Vectorized Schönhage–Strassen Multiplier (NTT)
1. Implements Schönhage–Strassen algorithm using 32-bit modular NTT units vectorized across user-defined lanes. Optimized for binary polynomial multiplication.
2. Parameters: VectorLength (number of operands), NumLanes (parallel lanes per batch), numpointntt
3. Inputs: Entire vector operands X and Y of size VectorLength × 32 each.
4. Critical Path: NTT computation and point-wise modular multiplication followed by inverse NTT and recombination.
5. Disadvantage: High resource usage; no streaming or dynamic sizing; operand size compile-time fixed (32-bit only); complete operand inputs required at once; lower performance for small operand sizes.
Grouped Base Multiplier (2-Word Blocks) with 1 Word block (x, y) inputs
1. A schoolbook-style grouped multiplier that computes the product of two operands using two-word blocks. Each clock cycle inputs a single word from each operand (X and Y). Internally, the architecture buffers and reuses previously loaded words to compute multiple partial products across cycles.
3. Parameters: Size (total operand length in words), base_bits (bit-width of each word)
4. Inputs: One word from each operand X and Y per clock cycle (each of base_bits width)
5. Critical Path: Final accumulation of partial products into full result
6. Advantages: Supports streaming input; low I/O bandwidth per cycle; compile-time size resource utilisation optimised;
7. Disadvantages: Compile-time fixed operand size; requires duplicated word storage; idle cycles due to 2-word base block.
8. Timing degrades as input size increases.
9. Frequency drops from ~261 MHz (for 64 inputs) to ~20 MHz (for 4096 inputs).
10. LUT and register usage grows significantly with input size due to more control and accumulation logic.
11. 64-bit base width improves throughput (fewer clock cycles) but uses more DSPs and IOBs.
12. Power increases slightly, but not drastically — timing is the main concern.
Grouped Base Multiplier (1-Word Block) with 1-Word Inputs
1. A basic and compact multiplier that computes a partial product using a single base multiplier operating on one word from each operand (X and Y) per cycle.
3. Parameters: Size (operand length in words), base_bits (bit-width per word)
4. Inputs: One base_bits-wide word from each operand X and Y per cycle
5. Critical Path: Final result accumulation
6. Advantages: Enables streaming input; resource-efficient for fixed operand sizes; low per-cycle I/O bandwidth.
7. Disadvantages: Operand size fixed at compile time; duplicated word storage in internal A and B registers; added latency due to sequential word loading.
8. Lower resource usage: Fewer DSPs and LUTs than the 2-group design, especially for smaller input sizes.
9. Longer execution time (more clock cycles), particularly as size grows — due to the sequential accumulation bottleneck.
10. Timing shows the design can’t meet timing constraints at scale.
11. Good power efficiency — reduced logic leads to slightly lower total and static power.
Karatsuba Multiplier (Full Operand Width)
1. This architecture implements the Karatsuba multiplication algorithm over the entire operand width using a single-level split. At the base level, it uses standard schoolbook multipliers for computing the partial products. Only one word block from each operand (X and Y) is provided as input per clock cycle. The design internally manages operand word reuse and accumulation.
3. Parameters: Size (total operand length in words), base_bits (bit-width per word)
4. Inputs: One base_bits-wide word from each operand X and Y per clock cycle
5. Critical Path: Accumulation and recombination of recursive Karatsuba subterms (Z0, Z1, Z2) including subtraction and addition dependencies across recursion levels
6. Advantages: Supports streaming input; improves multiplication depth efficiency for large operand sizes; reduces total number of base multiplications compared to schoolbook.
7. Disadvantages: Operand size fixed at compile time; duplicated operand word storage in A and B registers; increased resource usage due to extra add/sub logic for recursive recombination.
8. Very high resource usage: LUTs, registers, and especially DSPs scale rapidly due to recursive call overhead and parallel sub-multipliers.
9. Timing holds up well until 1024 size, then WNS drops sharply (approaching zero).
10. Power consumption increases significantly with size — Karatsuba is logic-heavy.
11. Greatest benefit: Lower clock cycles compared to other methods.
Grouped Base Multiplier (1-Word Block) with 2-Word Inputs (x1, x2, y1, y2)
1. A compromise architecture that receives 2-word inputs but uses a single-word base multiplier for each pair. Executes x1×y1 and buffers x2,y2 for future cycles.
3. Parameters: Size (operand length in words), base_bits (bit-width per word)
4. Inputs: Two base_bits-wide words from each operand X and Y per clock cycle (x1, x2, y1, y2)
5. Critical Path: Final accumulation of delayed partial products across buffered cycles
6. Advantages: Supports input streaming; reduces multiplier complexity with single-word base unit; resource-efficient for fixed-size operands.
7. Disadvantages: Operand size fixed at compile time; duplicated word storage in internal registers; underutilizes available input bandwidth per cycle due to serialized base multiplier.
8. Increased BRAM use reduces read/write contention and improves memory bandwidth.
9. Despite that, performance (clock cycles) is still limited by single accumulator bottleneck.
10. Timing degrades with size: Negative WNS at ≥1024 points at T = 11ns.
11. IOB count is higher due to wider BRAM interfacing and output access (162/322 bonded IOBs).
12. Power usage is modestly higher than 2-BRAM setup, mainly due to more BRAMs and IOBs.
Generic Multiplier with Alternating Dual-BRAM Inputs
1. This architecture uses 4 BRAMs in total — 2 for operand X and 2 for operand Y. For each operand, the BRAMs are accessed using alternating indices, which enables continuous data flow and avoids the two-cycle read latency typically associated with single BRAM access.
2. In each clock cycle, the design receives interface_bits-wide word blocks from the BRAMs. These are grouped and fed into the base multiplier, which operates on inputs of width base_mult bits. It is assumed that base_mult is a multiple of interface_bits, allowing seamless grouping and scheduling.
3. The BRAM depth parameter defines the total number of words storable per BRAM and indirectly determines the maximum supported operand size, which is interface_bits × bram_depth.
4. The module accepts a size input at runtime and supports multiplication of any operand size up to interface_bits × bram_depth bits.
6. Parameters: interface_bits (I/O word width), base_bits (base multiplier width), bram_depth (words per BRAM)
7. Inputs: Runtime input size (in bits), with words loaded via 2 single port BRAMs
8. Critical Path: Final accumulation of partial products from base multiplier outputs, including grouping and alignment logic
9. Advantages: Supports dynamic operand sizes; separates interface and computation widths; maintains constant resource profile across sizes; avoids BRAM read latency through alternating access.
10. Disadvantages: Higher relative resource usage for small operands; optimal only when size ≈ interface_bits × bram_depth; requires duplicated register buffers; lower max frequency due to flexible scheduling and runtime sizing logic.
11. Good balance between resource usage and timing for smaller configurations (e.g., 32×32 with depth 4).
12. WNS drops significantly for larger depths (64), indicating timing issues as more data is stored.
13. Registers and LUTs scale moderately with input size; DSP usage rises quickly with width.
14. Static power remains stable, but total power increases with size and depth.
15. Only 2–4 BRAMs used across configs – efficient memory use.
Parallel Traditional Base Multiplier with If-Else Logic
1. This multiplier uses 4 BRAMs (2 for operand X and 2 for operand Y) and assumes that the base multiplier width (base_bits) is equal to the interface bit-width (interface_bits).
2. The multiplication is performed using a traditional shift-and-add method, implemented via if-else logic:
  For each bit of operand B, if the bit is 1, operand A is left-shifted by 2^bit_position and accumulated. This operation is replicated in parallel for multiple word blocks, allowing word-level parallelism.
3. The module accepts single-word inputs per clock cycle (i.e., one block of operand A and one of operand B, each of base_bits width), and internally generates partial products using the parallel shift-and-add units. These partial products are then accumulated in subsequent cycles to compute the full result.
4. Parameters: base_bits (operand block size), bram_depth (number of blocks per BRAM)
5. Inputs: Runtime input size (in bits)
6. Critical Path: Bitwise conditional shift and accumulation logic within parallel shift-and-add units
7. Advantages: Enables parallelism in base multiplier; traditional shift-add logic is faster than full multiplication; supports dynamic operand sizes; constant resource usage across sizes; avoids BRAM read latency; achieves higher max frequency due to regular control flow.
8. Disadvantages: Higher resource usage for small or irregular operand sizes; not size-optimized for operand-specific configurations.
9. No DSP usage: Entirely LUT-based multiplication, which explains very high LUT utilization, especially for 64×64 (over 100K LUTs).
10. WNS and timing still acceptable, but large logic usage may stress placement and routing, limiting scalability.
11. Power consumption increases sharply with size due to heavy combinational logic.
12. Frequency remains fairly high, showing that parallelism is effective despite high resource use.
Generic Parallel x*y Multiplier with 4 BRAMs
1. This architecture uses 4 BRAMs — two each for operands X and Y. It is a generic, parameterized multiplier that instantiates multiple base x * y multipliers operating in parallel. Each multiplier processes one word block of base_bits width from each operand.
2. Operand words are fetched alternately from BRAMs to hide read latency, enabling a steady data stream to the multipliers. The number of partial products computed per cycle depends on the degree of parallelism (defined by the number of base multipliers instantiated).
3. Each base multiplier takes base_bits-wide inputs and computes a partial product. The results are scheduled for accumulation based on their respective shift amounts (position in final product).
4. Parameters: base_bits (bit-width of each word block), bram_depth (depth of operand BRAMs)
5. Inputs: Runtime input size (in bits), with BRAM-based word-level streaming
6. Critical Path: Operand word routing and queuing to parallel base multipliers, including alignment and accumulation scheduling
7. Advantages: Supports high parallelism; handles dynamic operand sizes; maintains constant resource usage across sizes; hides BRAM latency via alternating access.
8. Disadvantages: Requires for-loops for managing all base multipliers; additional queues and control logic for operand alignment; higher resource usage; duplicated register storage for buffered words.
9. Balanced resource usage: Compared to the modified schoolbook method, this achieves lower LUT usage and controlled DSP utilization while retaining high throughput.
10. Good timing: WNS remains comfortably positive even for 128×128, with no critical timing issues.
11. Scales better: Resource growth is linear and manageable — more suitable for large operand sizes.
12. Efficient power profile: Power rises with size but more slowly compared to full-logic versions.
Generic x*y Multiplier with 4 BRAMs and No Intermediate Registers
1. This design uses 4 BRAMs — two for each input operand (X and Y) — and directly streams operand word blocks into a base multiplier without storing them in intermediate registers.
2. Operand words are read alternately from dual BRAMs to hide the 2-cycle BRAM read latency, and are fed directly to the x*y base multiplier in real time. No pipeline or intermediate storage is used — the multiplication is initiated as soon as the operands are available.
3. Parameters: base_bits (word width per BRAM access), bram_depth (number of word blocks per operand)
4. Inputs: Runtime input size (in bits), streamed from BRAM
5. Critical Path: Final accumulation of partial products from real-time base multiplier outputs
6. Advantages: Fewer clock cycles due to immediate operand use; no duplicated word storage; lower overall resource utilization compared to other generic multiplier designs; supports dynamic operand sizes.
7. Disadvantages: Higher relative resource usage for small or irregular operand sizes; no operand-specific resource optimization;
8. Lower register usage: Registers are minimal compared to other methods, due to no intermediate storage.
9. Good BRAM and DSP usage: Efficient use of memory blocks and DSPs per operation.
10. Timing is stable: WNS is positive for all sizes; timing is not critical, though less slack than other methods.
11. Power and frequency: Slightly better power efficiency than methods with register banks; frequency remains above 110 MHz even at 128×128.
Generic Multiplier Using 2 BRAMs (X and Y), Parameterized Interface and Base Widths
1. This architecture uses only 2 BRAMs — one for each operand (X and Y). Each BRAM is read sequentially to stream operand blocks of width interface_bits into the datapath. The actual multiplication is carried out by a parameterized base multiplier of width base_bits.
2. Operand blocks are read from the BRAMs and appropriately buffered (if needed) to build up base multiplier inputs of width base_bits. The system assumes that base_bits is a multiple of interface_bits so operand fragments can be concatenated or sliced deterministically.
3. Parameters: interface_bits, base_bits, bram_depth — define BRAM access width, base multiplier word size, and operand depth.
4. Inputs: size — operand size provided at runtime to determine number of BRAM accesses and shifts required.
5. Critical Path: Final out accumulation — dominant delay comes from accumulation of all partial products after multiplication.
6. Advantages: High operating frequency due to minimal BRAM contention and smaller datapath; lower BRAM usage; runtime-configured sizing allows flexibility across operand widths; scalable and portable for multiple applications.
7. Disadvantages: Higher number of clock cycles due to serial operand fetch and base operand formation; requires duplicate registers for temporary operand storage; control logic is more complex for size alignment and shift tracking; resource underutilization for small operands.
8. Low BRAM usage: Only 1–2 BRAMs even for 128×64 configuration.
9. Efficient resource profile: LUTs, registers, and DSPs are all modest, especially for 32×32 and 64×32.
10. Positive WNS across all tests: Indicates good timing closure with adequate slack, even without aggressive pipelining.
11. Frequency holds reasonably well (140+ MHz for 64×64, ~125 MHz for 128×64), making it practical for mid-speed designs.
4-BRAM Generic Multiplier with Karatsuba Base Multiplier (Configurable Levels)
1. This multiplier architecture uses 4 BRAMs: 2 each for operand X and Y, alternating indices to minimize read latency. The data fetched from each BRAM is interface_bits wide per cycle.
2. Input operands are sliced into interface_bits-sized chunks and fed into the recursive Karatsuba structure, where each level performs splits, recursive calls, and combines partial products.
3. Parameters: base_bits, interface_bits, max_karatsuba_levels, bram_depth — configure base multiplier resolution, input bandwidth, depth of Karatsuba recursion, and BRAM capacity.
4. Inputs: size — defines operand width dynamically, controlling how many BRAM words are fetched and processed.
5. Critical Path: Final out accumulation — latency dominated by partial product summation and recombination after recursive multiplications.
6. Advantages: Faster multiplication through subquadratic Karatsuba logic; recursive structure handles larger operands efficiently; supports dynamic operand sizing at runtime; leverages parallelism across blocks and recursive calls.
7. Disadvantages: Higher resource utilization due to recursive decomposition and control logic; lower operating frequency observed from complex datapath and deeper logic levels; requires careful management of recursion depth to balance performance and usage.
8. LUT utilization is significantly higher compared to standard base multipliers (e.g., 6042 LUTs for 64×64), due to Karatsuba’s logic overhead.
9. DSP usage is minimal (only 3–17), indicating most computation is LUT-based.
10. WNS becomes negative at 128×32 for 11ns time period, suggesting timing closure issues at higher bit widths.
11. Frequencies drop with larger sizes (down to ~85 MHz for 128×32), limiting throughput potential.
Karatsuba Multiplier (Interface-Width Inputs Directly from BRAM)
1. This design performs Karatsuba multiplication on full-width operands, with X and Y input blocks of width interface_bits each. The operands are fetched from 2 BRAMs, with one word per operand provided per cycle.
2. The Karatsuba recursion proceeds with each level consuming full interface-width chunks from BRAM. The number of levels is configurable via max_karatsuba_levels, and multiplication continues until the base level where schoolbook (x*y) multipliers are used.
3. Parameters: interface_bits, max_karatsuba_levels, bram_depth — define input word size, recursion depth of Karatsuba, and number of operand blocks.
4. Inputs: size — total operand bit-width that determines how many BRAM words are used and how deep the recursion proceeds.
5. Critical Path: Final out accumulation — dominated by partial product summation and Karatsuba recombination delays.
6. Advantages: Efficient for large operands; configurable recursion depth optimizes for speed vs. resource tradeoff;
7. Disadvantages: Increased clock cycles; complex control for recursion and operand splitting; lower frequency at higher recursion levels due to deeper combinational logic.
8. LUT usage is very high (8K+), showing that the recursive structure heavily relies on combinational logic.
9. DSP usage increases significantly with operand size, reaching 108 for 64×64 – indicating less DSP savings than partial Karatsuba.
10. Frequency remains reasonably stable (115–120 MHz), but still lower than simpler multiplier designs.
11. Registers usage is high (up to ~5K), implying large pipelining or temporary value storage overhead.
Shift-Based BRAM Accumulating Generic Multiplier
1. This design performs multiplication of large-width operands by breaking them into smaller base_mult-bit blocks. It utilizes BRAMs for storing operand chunks and accumulates partial products using a shift-indexed accumulation strategy, where results of base multiplications are stored in BRAM at addresses equal to the sum of their operand indices.
2. Input Partitioning: The full-width operands A and B are partitioned into smaller blocks of width base_mult bits. These blocks are stored in four BRAMs (a1, a2, b1, b2) in an interleaved fashion to allow dual operand access (via even/odd indexed reads).
3. Block Multiplication: A generic_base_multiplier module performs multiplication of 2 operand blocks per cycle, based on control signals a_in and b_in, which select between even and odd block pairs.
4. Partial Product Accumulation:
  1. Each partial product is shifted based on the sum of its block indices (i.e., i+j) and written to a BRAM (final_out) at address i+j.
  2. A read-modify-write mechanism is used: the existing value at address i+j is read from BRAM, the new product is added to it, and the result is written back.
  3. This accumulation continues for all block pairs, effectively performing multi-block multiplication via convolution-like accumulation.
5. Final Output Generation:
  1. After all partial products are accumulated, the BRAM content is read sequentially.
  2. The result is then streamed out base_mult bits at a time, with carry propagated and accumulated using a pipelined adder tree logic to reconstruct the full-width output.
6. Parameters: base_bits (word width per BRAM access), bram_depth (number of word blocks per operand)
7. Inputs: Runtime input size (in bits), streamed from BRAM
8. Critical Path: The critical path of 5.297 ns spans from the BRAM read (DEVICE_8SERIES.NO_BMM_INFO.SP.WIDE_PRIM36) to the pipeline register mult_pipe_reg[54]/D, involving 5 DSP blocks, 2 LUT6s, and totals 5.472 ns delay (logic: 3.004 ns, routing: 2.468 ns) with 5.703 ns slack under an 11.000 ns clock period.
9. Advantages: Reduces logic complexity by shifting partial product accumulation into BRAM; scales efficiently with operand size due to memory-based accumulation model; avoids full-width temporary buffers by storing intermediate sums indexed by shift depth; enables consistent output bandwidth (base_mult bits per cycle) after accumulation, supporting streaming applications; minimal logic duplication due to dual-port BRAM reuse; architecture-friendly for FPGAs with abundant BRAM resources; parametrizable design; Can be used on any board;
10. Disadvantages: High latency due to full quadratic accumulation cycles (O(n²)) before output begins; read-modify-write operations on BRAM introduce sequential bottlenecks and limit clock frequency; partial sum hazard management requires complex control logic; sequential shift-indexed accumulation leads to underutilization of available parallelism; BRAM bandwidth constraints prevent multiple partial products being accumulated simultaneously; carry propagation handled only at final output stage, introducing additional post-processing delay; tightly coupled memory access and multiplier scheduling make timing closure harder for large operand sizes;
11. Doubling base size roughly quarters the latency — efficient scaling in time. Latency increases linearly with input size and inversely with base multiplier size.
12. LUTs and DSPs grow exponentially with base size. base 256 is too resource-intensive.
13. Frequency decreases with larger bases. base 128 is the highest safe base size with good speed and without critical timing issues.

Base	Pros	Cons	Usage
32	Very low area, high frequency	High latency	Small FPGAs, low-power
64	Balanced latency and area	Moderate DSP usage	Embedded devices with DSP
128	Best latency-area trade-off	Timing close to critical	High-speed apps on mid-range FPGAs
256	Lowest latency	Less frequency, more area	Not practical unless on very high-end FPGAs and with deep pipelining

15. ![][image12]

Results
Latency Trends:

Karatsuba (Recursive & Hybrid) showed low latency at small sizes, but latency scaled poorly due to recursion depth and overhead.
Parallel Traditional Multiplier offered very low latency but required massive area and was limited by BRAM depth.
Managed Memory Access Design had moderate latency with good scalability, striking a balance between control and resource efficiency.
Generic Multiplier with Accumulation BRAM scaled better for large operand sizes, keeping latency low by minimizing intermediate storage overhead.
Multi-base Parallel Designs had competitive latency while offering flexibility with moderate resource overhead.

Frequency & Timing Closure:

Base-32 and Base-64 Generic Multiplier Designs achieved high frequencies (up to ~198 MHz), meeting timing comfortably.
Base-128 and Base-256 Designs (especially recursive or deep BRAM ones) often failed timing, especially with large interface widths or parallelism.
Karatsuba Fully Recursive Designs had poor WNS at scale (e.g., 0.077 ns at size 1024), barely meeting or failing timing.

Resource Utilization:

Traditional Karatsuba and Parallel Multiplier Variants consumed extremely high LUTs and DSPs (e.g., >100K LUTs for 64x64 parallel).
Generic Multiplier + Accumulation BRAM and Managed Access Designs had modest to moderate resource use.
Fully Recursive Karatsuba used less DSP but increased LUT usage due to logical duplication.

Power Consumption:

Closely tied to resource usage and switching activity.
Karatsuba with minimal intermediate storage consumed more dynamic power due to frequent memory access.
Traditional + Parallel methods consumed the most static power due to high area occupancy.

BRAM Usage:

Designs with accumulation (e.g., 2/4 BRAM groupings) used BRAM efficiently for large-scale multiplications.
Karatsuba and Traditional Parallel relied less on BRAM, more on registers and logic.
Managed Memory Designs offered the best BRAM-to-performance balance.

Scalability & Suitability:

Base-128 Generic Multiplier (BRAM accumulation based) was the best overall trade-off: good frequency, acceptable resource use, and low latency.
Parallel Traditional Multiplier is ideal when area is not constrained but high throughput is critical.
Karatsuba is optimal for reducing DSP usage, especially on resource-limited FPGAs.
Generic with Memory Management is suitable for controlled latency with minimal area overhead.

Applications
This generic multiplier design, optimized for large-integer multiplication using parameterized base widths and efficient BRAM streaming, has a wide range of technical applications in cryptography, signal processing, and hardware acceleration. The design is especially suited for scenarios requiring high-throughput, dynamically configurable multiplication across varying operand sizes.

Public-Key Cryptography (RSA, ECC, Post-Quantum)
1. Enables multiplication of 1024, 2048, and 4096-bit integers essential for RSA encryption/decryption.
2. Forms the core compute block for ECC scalar multipliers over prime fields and binary fields.
3. Supports large-number arithmetic in lattice-based and code-based post-quantum cryptographic algorithms.
Modular Multiplication Units
1. Implements Montgomery multiplication units for efficient modular exponentiation in RSA or ECC.
2. Facilitates Barrett reduction for fast modular operations in hardware cryptographic cores.
3. Capable of operand reuse and latency hiding due to BRAM-based operand block streaming.
Finite Field Arithmetic
1. Supports field multiplications in Fp and F2m used in cryptosystems like ECDSA, Curve25519, and others.
2. Enables high-speed polynomial arithmetic in error correction codes such as Reed–Solomon and BCH codes.
3. Configurable base and interface widths allow precise alignment with field element sizes.
Blockchain and Zero-Knowledge Proof Systems (ZKPs)
1. Provides reusable, scalable multiplication hardware for zk-SNARK/zk-STARK circuits over large fields.
2. Efficiently handles field arithmetic for Merkle-tree hashing, Poseidon/SHA arithmetic, and circuit evaluations.
3. Suitable for offloading proof-generation workloads onto FPGA-based zk accelerators.
Digital Signal Processing (DSP) and Polynomial Arithmetic
1. Acts as a high-precision multiplier in convolution, correlation, and transform pipelines over large integer or fixed-point domains.
2. Supports polynomial evaluation and multiplication for cryptographic primitives like NTRU or Ring-LWE.
3. Adaptable for large operand FFT butterfly units where integer or modular multiplication is required.
Scientific Computing and Arbitrary-Precision Arithmetic
1. Enables large integer multiplication in symbolic algebra engines and numerical solvers deployed on FPGA.
2. Useful for hardware implementations of arbitrary-precision libraries like GMP or MPFR equivalents.
3. Dynamically scales without significant area overhead, ideal for reconfigurable computing platforms.

Future Scope
Going forward, future research can focus on automated design-space exploration to dynamically select optimal configurations (e.g., base size, recursion depth, BRAM mapping) at runtime based on workload characteristics. Further improvements can include hardware-software co-design techniques for better integration with CPUs or embedded processors, and support for pipelined vector multiplication to accelerate multi-operand arithmetic.

On the system level, evaluating the multipliers within larger arithmetic processing units or AI accelerators would help assess their impact on end-to-end performance, particularly in DSP, cryptography, and machine learning applications. Finally, integrating power-aware dynamic reconfiguration could lead to intelligent energy-saving modes tailored to runtime demands.

Conclusion
This work presents a comprehensive exploration of dynamically reconfigurable multiplier architectures, incorporating both traditional and advanced methods such as Schoolbook, Karatsuba, and their hybrid combinations. By systematically analyzing multiple configurations across parameters like base size, recursion depth, BRAM usage, and accumulation strategies, the designs were evaluated in terms of performance, area, power consumption, and latency. The results clearly demonstrate the trade-offs between resource utilization and computational efficiency, highlighting the effectiveness of hybrid approaches and memory-aware optimizations. The flexibility achieved through reconfigurability enables support for a wide range of operand sizes and application needs, making the proposed architectures highly suitable for embedded systems, DSP, and hardware accelerators.

Final Codes and Results
Google sheets link: Large Integer Multiplication Results
Github Link: https://github.com/Snigdha2005/Large_Integer_Multiplication

References

[1] T. Matsuyama and A. Ohta, “High-Speed VLSI Multiplication Algorithm with a Redundant Binary Addition Tree,” IEEE Transactions on Computers, 1980.

[2] Y. Liu et al., “Hardware-Implemented Lightweight Accelerator for Large Integer Polynomial Multiplication,” IEEE Letters, 2021.

[3] Z. Shen et al., “Hardware Complexity of Modular Multiplication and Exponentiation,” IEEE Transactions on Computers, 2019.

[4] R. Kumar et al., “Evaluation of Large Integer Multiplication Methods on Hardware,” Journal of Hardware Systems, 2022.

[5] M. E. Kai et al., “Evaluating the Hardware Performance of a Million-bit Multiplier,” IEEE International Symposium on Field-Programmable Custom Computing Machines, 2020.

[6] L. Wang et al., “Folded Integer Multiplication for FPGAs,” Proceedings of the ACM/SIGDA FPGA Symposium, 2020.

[7] A. Setiawan and T. S. Gunawan, “Comparison of Multiplication Algorithms Based on FPGA,” International Journal of Advanced Computer Science and Applications (IJACSA), 2019.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
generic_multipliers		generic_multipliers
mem_files		mem_files
parameterised_multipliers		parameterised_multipliers
vectorised_multipliers		vectorised_multipliers
README.md		README.md
main_base_pipe_interface_bits.v		main_base_pipe_interface_bits.v
main_base_pipe_interface_bits_parallel.v		main_base_pipe_interface_bits_parallel.v
main_board_verified.v		main_board_verified.v
main_input_verified_2.v		main_input_verified_2.v
main_separate_final_interface_bits.v		main_separate_final_interface_bits.v
main_separate_final_interface_bits_v2.v		main_separate_final_interface_bits_v2.v
random_mem_gen.py		random_mem_gen.py
zero_mem_gen.py		zero_mem_gen.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Configurable Hardware for Large Integer Multiplication

Public-Key Cryptography (RSA, ECC, Post-Quantum)

Enables multiplication of 1024, 2048, and 4096-bit integers essential for RSA encryption/decryption.

Forms the core compute block for ECC scalar multipliers over prime fields and binary fields.

Supports large-number arithmetic in lattice-based and code-based post-quantum cryptographic algorithms.

Modular Multiplication Units

Implements Montgomery multiplication units for efficient modular exponentiation in RSA or ECC.

Facilitates Barrett reduction for fast modular operations in hardware cryptographic cores.

Capable of operand reuse and latency hiding due to BRAM-based operand block streaming.

Finite Field Arithmetic

Supports field multiplications in Fp and F2m used in cryptosystems like ECDSA, Curve25519, and others.

Enables high-speed polynomial arithmetic in error correction codes such as Reed–Solomon and BCH codes.

Configurable base and interface widths allow precise alignment with field element sizes.

Blockchain and Zero-Knowledge Proof Systems (ZKPs)

Provides reusable, scalable multiplication hardware for zk-SNARK/zk-STARK circuits over large fields.

Efficiently handles field arithmetic for Merkle-tree hashing, Poseidon/SHA arithmetic, and circuit evaluations.

Suitable for offloading proof-generation workloads onto FPGA-based zk accelerators.

Digital Signal Processing (DSP) and Polynomial Arithmetic

Acts as a high-precision multiplier in convolution, correlation, and transform pipelines over large integer or fixed-point domains.

Supports polynomial evaluation and multiplication for cryptographic primitives like NTRU or Ring-LWE.

Adaptable for large operand FFT butterfly units where integer or modular multiplication is required.

Scientific Computing and Arbitrary-Precision Arithmetic

Enables large integer multiplication in symbolic algebra engines and numerical solvers deployed on FPGA.

Useful for hardware implementations of arbitrary-precision libraries like GMP or MPFR equivalents.

Dynamically scales without significant area overhead, ideal for reconfigurable computing platforms.

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Configurable Hardware for Large Integer Multiplication

Public-Key Cryptography (RSA, ECC, Post-Quantum)

Enables multiplication of 1024, 2048, and 4096-bit integers essential for RSA encryption/decryption.

Forms the core compute block for ECC scalar multipliers over prime fields and binary fields.

Supports large-number arithmetic in lattice-based and code-based post-quantum cryptographic algorithms.

Modular Multiplication Units

Implements Montgomery multiplication units for efficient modular exponentiation in RSA or ECC.

Facilitates Barrett reduction for fast modular operations in hardware cryptographic cores.

Capable of operand reuse and latency hiding due to BRAM-based operand block streaming.

Finite Field Arithmetic

Supports field multiplications in Fp​ and F2m​ used in cryptosystems like ECDSA, Curve25519, and others.

Enables high-speed polynomial arithmetic in error correction codes such as Reed–Solomon and BCH codes.

Configurable base and interface widths allow precise alignment with field element sizes.

Blockchain and Zero-Knowledge Proof Systems (ZKPs)

Provides reusable, scalable multiplication hardware for zk-SNARK/zk-STARK circuits over large fields.

Efficiently handles field arithmetic for Merkle-tree hashing, Poseidon/SHA arithmetic, and circuit evaluations.

Suitable for offloading proof-generation workloads onto FPGA-based zk accelerators.

Digital Signal Processing (DSP) and Polynomial Arithmetic

Acts as a high-precision multiplier in convolution, correlation, and transform pipelines over large integer or fixed-point domains.

Supports polynomial evaluation and multiplication for cryptographic primitives like NTRU or Ring-LWE.

Adaptable for large operand FFT butterfly units where integer or modular multiplication is required.

Scientific Computing and Arbitrary-Precision Arithmetic

Enables large integer multiplication in symbolic algebra engines and numerical solvers deployed on FPGA.

Useful for hardware implementations of arbitrary-precision libraries like GMP or MPFR equivalents.

Dynamically scales without significant area overhead, ideal for reconfigurable computing platforms.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Supports field multiplications in Fp and F2m used in cryptosystems like ECDSA, Curve25519, and others.

Packages