MCAmericanEngine APIs - 2023.2 English

In our library, the MCAmerican Option Pricing with Monte Carlo simulation is provided as an API MCAmericanEngine(). However, due to external memory usage on DDR/HBM and avoiding the designed hardware cross-SLR placed and routed. The American engine option supports two modes:

single API version: use one API to run the whole American option
three APIs version: three APIs/kernels are provided, connecting them on the host side to compose the overall design.

The boundary between them is external memory access. For the calibration process, two APIs are provided. Calibration step 1 and 2 are wrapped as one kernel, namely, MCAmericanEnginePreSamples. Step 3 and step 4 compose another kernel MCAmericanEngineCalibrate. And pricing process as another kernel MCAmericanEnginePricing in this library. Because the pricing process is separated as a kernel, the data exchange between the calibration and pricing process may not through the BRAM any more. Thus, in the implementation, DDR/HBM is used as the coefficients data storage memory.

With the three kernels, the kernel level pipeline by shortening the overall execution time could be achieved. However, employing kernel level pipeline requires a complex schedule from the host code side. An illustration of connection 3 kernels as a complete system is given in this part, which can be seen in Figure 64. Price data and \(B\) matrix data are the outputs from kernel MCAmericanEnginePreSamples. For each timestep, path number (default 4096) price data B and x matrix data (a number of 9) need to be saved to DDR or HBM memory.

McAmericanEngine Vitis project architecture on FPGA

Kernel 1 MCAmericanEngineCalibrate reads price data \(y\) and matrix data \(B\) from external memory and outputs coefficients to DDR/HBM. The last kernel MCAmericanEnginePricing reads coefficients data from DDR/HBM and saves the final output optimal exercise price to DDR/HBM.

Hint

Why the number of matrix B is 9 in DDR/HBM?

The matrix \(A\) is 4096 * 4 for each timestep when the path number is 4096 (default). The size of its transform matrix \(A^T\) is 4 * 4096. So, the size of matrix \(B\) is 4 * 4. However, some elements in \(B\) are the same, and 9 can represent all 16 data. More precisely, assuming

\[\begin{split} A^T = \begin{bmatrix} &1\ 1\ ...\ 1\ ...\ 1 \\ &S_0\ S_1\ ...\ S_t\ ...\ S_T \\ &S_0^2\ S_1^2\ ...\ S_t^2\ ...\ S_T^2\\ &E_0\ E_1\ ...\ E_t\ ...\ E_T \end{bmatrix}, \ \ \ \ A = \begin{bmatrix} 1\ S_0\ S_0^2\ E_0 \\ 1\ S_1\ S_1^2\ E_1 \\ ... \\ 1\ S_t\ S_t^2\ E_t \\ ... \\ 1\ S_T\ S_T^2\ E_T \end{bmatrix} \\ \\ ==> B = A^T \ A = \begin{bmatrix} \sum(1)\ \sum(S_i)\ \sum(S_i^2)\ \sum(E_i) \\ \sum(S_i)\ \sum(S_i^2)\ \sum(S_i^3)\ \sum(S_iE_i) \\ \sum(S_i^2)\ \sum(S_i^3)\ \sum(S_i^4)\ \sum(S_i^2E_i) \\ \sum(E_i)\ \sum(S_iE_i)\ \sum(S_i^2E_i)\ \sum(E_i^2) \\ \end{bmatrix}\end{split}\]

It is evident that some elements are the same. After removing duplicated elements, the following 9 elements of \(B\) are stored to DDR/HBM each timestep:

\[\begin{split}B_{save} = \begin{bmatrix} &\sum(1)\ \\ &\sum(S_i)\ \\ &\sum(S_i^2)\ \sum(S_i^3)\ \sum(S_i^4)\ \\ &\sum(E_i)\ \sum(S_iE_i)\ \sum(S_i^2E_i)\ \sum(E_i^2) \end{bmatrix}\end{split}\]

Caution

The architecture illustrated above is only an example design. In fact, multiple numbers of kernels, each with a different unroll number (UN) may be deployed. The number of kernels that can be instanced in design depends on the resource/size of the FPGA.