Implemention - 2023.2 English

Vitis Libraries

Release Date
2023-12-20
Version
2023.2 English

If you go into the details of the implementation of hardware VM, you may find even the basic version of hardware VM is significantly different from the one in Oniguruma, let alone the performance optimized one. Thus, this section is especially for developers who wants to add more OPs to the VM by themselves or who are extremely interested in our design.

The first thing you want to conquer will be the software compiler. Once you have a full understanding of a specific OP in Oniguruma, you have to add it to the corresponding instruction with the format acceptable for the hardware VM. The 64-bit instruction format for communication between software compiler and hardware VM can be explained like this:

Instruction Format

Then, if the OP you want to add is related to a jump/push operation on the OP address, the absolute address must be provided at the first while-loop in the source code of the software compiler for calculation of the address which will be put into instruction list later. The rest information related to this OP and the calculated address should be pack into one instruction at the second while-loop. So far, the software compiler part is done.

Location of the source of the software compiler: L1/src/sw/xf_re_compile.c

Finally, add the corresponding logic to the hardware VM based on your understanding of the OP and test it accordingly. Once the test passed, you may start optimizing the implemtation which is extremely challenging and tricky.

Let me introduce you what we’ve done currently for optimizing the hardware VM. Hope it will inspire you to some extent.

  1. Simplify the internal logic for each OP we added as mush as we can.
  2. Merge the newly added OP into another if possible to let them share the same logic.
  3. Offload runtime calculations to software compiler for pre-calculation if possible to improve the runtime performance.
  4. Separate the data flow and control flow, do pre-fetch and post-store operations to improve memory access efficiency.
  5. Resolve the read-and-write dependency of on-chip RAMs by caching intermediate data in registers to avoid unnecessary accesses.
  6. Execute a predict (2nd) instruction in each iteration to accelerate the process under specific circumstances. (performance optimized version executes 2 instructions / 3 cycles)

Note

For the following scenarios, the predict instruction will not be executed:

  1. Read/write the internal stack simultaneously
  2. OP for 2nd instruction is any_char_star, pop_to_mark, or mem_start_push
  3. Jump on OP address happened in 1st instruction
  4. Read/write the offset buffer simultaneously
  5. Pointer for input string moves in 1st instruction and 2nd instruction goes into the OP which needs character comparision
  6. Write the offset buffer simultaneously