AIE-ML Memory Tile Overview and Features

Versal Adaptive SoC AIE-ML Architecture Manual (AM020)

Document ID
AM020
Release Date
2023-11-10
Revision
1.2 English

The AIE-ML memory tile is introduced in the AIE-ML architecture to significantly increase the on-chip memory inside the AIE-ML array. The memory tile reduces the utilization of PL resources (LUTs, block RAMs and URAMs) in ML applications. It is similar to the AIE-ML tile but without the AIE-ML processor and program memory. The AIE-ML memory tile contains high-density (512 KB) and high bandwidth memory, and an integrated DMA to access local memory and neighboring memories. The AIE-ML memory tile only has vertical streaming interfaces (no cascade or horizontal). A subset of DMA channels can directly access memory in the nearest neighboring memory tiles to the East and West. The following figure shows the AIE-ML memory tile architecture.

Figure 1. AIE-ML Memory Tile Architecture

The memory tile has the following functional blocks. They are either the same or similar to the equivalent blocks in the AIE-ML tile:

  • Memory
  • DMA
  • Locks
  • AXI4-Stream switch
  • Memory-mapped AXI4 switch
  • Control, debug, and trace
  • Events and event broadcast

The following is a list of AIE-ML memory tile features:

  • Memory
    • 512 KB memory arranged into 16 banks (each 128-bit wide and 2k words deep), ECC protected
    • The memory banks in the AIE-ML memory tile will be initialized to zero at boot and reset
    • Supports up to 30 GB/s read and 30 GB/s write in parallel per memory tile
  • DMA
    • Memory to stream DMA (MM2S) with six channels
      • 6 x 32-bit stream interfaces
      • 6 x 128-bit memory interfaces
      • 5D tensor address generation (including iteration-offset)
      • Support inserting zero padding into stream data and compression
      • Access memory and locks in east/west neighboring tiles (channels 0–3)
      • Support task queue and task-complete-tokens; queue depth is four tasks per channel (see Task-Completion-Tokens for more information)
    • Stream to memory DMA (S2MM) with six channels
      • 6x32-bit stream interfaces
      • 6x128-bit memory interfaces
      • 5D tensor address generation (including iteration-offset)
      • Support out-of-order packet transfer, finish-on-TLAST, and decompression
      • Access memory and locks in east/west neighboring tiles (channel 0-3)
      • Support task queue and task-complete-tokens; queue depth is four tasks per channel (see Task-Completion-Tokens for more information)
    • Buffer descriptors (BD)
      • 48 shared BDs
      • Each channel can access 24 BDs and each BD can be accessed by six channels
    • Stream Switch
      • Share the same design as AIE-ML tile. 17 master and 18 slave ports
      • North and South ports but no east and west streams
      • Trace and control ports
    • Lock Module
      • Accessible from neighboring AIE-ML memory tile DMA channels; there are 64 semaphore locks and each lock state is 6-bit unsigned
    • Additional control and status registers
      • Events, event actions, event broadcast, combo events
      • Task-complete-tokens logic (see Task-Completion-Tokens for more information)
    • Configuration/debug interconnect (memory-mapped AXI4)
      • 1 MB address space per tile
      • Write bandwidth improvement and stream control-packet support
    • Debug and Trace
      • Similar to that in AIE-ML tile
      • Event trace stream; 4x performance counters and 64-bit tile timer