Image Sensor Processing multistream pipeline

ISP multistream pipeline allows user to process input from multi streams using one instance of ISP. Current multistream pipeline processes 4 streams in a Round-Robin method with input TYPE as XF_16UC1 and output TYPE as XF_8UC3(RGB). After the color conversion from RGB to YUV colorspace the output TYPE is XF_16UC1(YUYV).

This ISP pipeline includes 19 modules, they are following:

Extract Exposure Frames: The Extract Exposure Frames module returns the Short Exposure Frame and Long Exposure Frame from the input frame using the Digital overlap parameter.
HDR Merge: HDR Merge module generates the High Dynamic Range image from a set of different exposure frames. Usually, image sensors have limited dynamic range and it’s difficult to get HDR image with single image capture. From the sensor, the frames are collected with different exposure times and will get different exposure frames. HDR Merge will generate the HDR frame with those exposure frames.
HDR Decompand: This module decompands or decompresses a piecewise linear (PWL) companded data. Companding is performed in image sensors not capable of high bitwidth during data transmission. This decompanding module supports Bayer raw data with 4 knee point PWL mapping and equations are provided for 12-bit to 16-bit conversion.
RGBIR to Bayer (RGBIR): This module converts the input image with R, G, B, IR pixel data into a standard Bayer pattern image along with a full IR data image.
Auto Exposure Compensation (AEC): This module automatically attempts to correct the exposure level of captured image and also improves contrast of the image.
Black Level Correction (BLC): This module corrects the black and white levels of the overall image. Black level leads to the whitening of image in dark regions and perceived loss of overall contrast.
Bad Pixel Correction (BPC): This module removes defective/bad pixels from an image sensor resulting from of manufacturing faults or variations in pixel voltage levels based on temperature or exposure.
Degamma: This module linearizes the input from sensor in order to facilitate ISP processing that operates on linear domain.
Lens Shading Correction (LSC): This module corrects the darkening toward the edge of the image caused by camera lens limitations. This darkening effect is also known as vignetting.
Gain Control: This module improves the overall brightness of the image.
Demosaicing: This module reconstructs RGB pixels from the input Bayer image (RGGB, BGGR, RGBG, GRGB).
Auto White Balance (AWB): This module improves color balance of the image by using image statistics.
Color Correction Matrix (CCM): This module converts the input image color format to output image color format using the Color Correction Matrix provided by the user (CCM_TYPE).
Quantization & Dithering (QnD): This module is a tone-mapper that dithers input image using Floyd-Steinberg dithering method. It is commonly used by image manipulation software, for example when an image is converted into GIF format each pixel intensity value is quantized to 8 bits i.e. 256 colors.
Global Tone Mapping (GTM): This module is a tone-mapper that reduces the dynamic range from higher range to display range using tone mapping.
Local Tone Mapping (LTM): This module is a tone-mapper that takes pixel neighbor statistics into account and produces images with more contrast and brightness.
Gamma Correction: This module improve the overall brightness of the image.
3DLUT: The 3D LUT module operates on three independent parameters. This drastically increases the number of mapped indexes to value pairs. For example, a combination of 3 individual 1D LUTs can map 2^n * 3 values where n is the bit depth, whereas a 3D LUT processing 3 channels will have 2^n * 2^n * 2^n possible values.
Color Space Conversion (CSC): The CSS module converts RGB image to YUV422(YUYV) image for HDMI display purpose. RGB2YUYV converts the RGB image into Y channel for every pixel and U and V for alternating pixels.

ISP multistream Diagram

Parameter Descriptions

Table 233 Runtime parameter
Parameter	Description
dcp_params_16to12	Params to converts the 16bit input image bit depth to 12bit.
dcp_params_12to16	Params to converts the 12bit input image bit depth to 16bit.
R_IR_C1_wgts	5x5 Weights to calculate R at IR location for constellation1.
R_IR_C2_wgts	5x5 Weights to calculate R at IR location for constellation2.
B_at_R_wgts	5x5 Weights to calculate B at R location.
IR_at_R_wgts	3x3 Weights to calculate IR at R location.
IR_at_B_wgts	3x3 Weights to calculate IR at B location.
sub_wgts	Weights to perform weighted subtraction of IR image from RGB image. sub_wgts[0] -> G Pixel, sub_wgts[1] -> R Pixel, sub_wgts[2] -> B Pixel sub_wgts[3] -> calculated B Pixel
wr_hls	Lookup table for weight values. Computing the weights LUT in host side and passing as input to the function.
array_params	Parameters added in one array for multistream pipeline.
gamma_lut	Lookup table for gamma values. First 256 will be R, next 256 values are G and last 256 values are B.
dgam_params	Array containing upper limit, slope and intercept of linear equations for Red, Green and Blue colour.
c1	To retain the details in bright area using, c1 in the tone mapping.
c2	Efficiency factor, ranges from 0.5 to 1 based on output device dynamic range.

Table 234 Compile time parameter
Parameter	Description
XF_HEIGHT	Maximum height of input and output image.
XF_WIDTH	Maximum width of input and output image.
XF_SRC_T	Input pixel type. Supported pixel width is 16.
NUM_STREAMS	Total number of streams.
STRM1_ROWS	Maximum number of rows to be processed for stream 1 in one burst.
STRM2_ROWS	Maximum number of rows to be processed for stream 2 in one burst.
STRM3_ROWS	Maximum number of rows to be processed for stream 3 in one burst.
STRM4_ROWS	Maximum number of rows to be processed for stream 4 in one burst.
NUM_SLICES	Number of slices processing in each stream.
BLOCK_WIDTH	Maximum block width the image is divided into. This can be any positive integer greater than or equal to 32 and less than input image width.
BLOCK_HEIGHT	Maximum block height the image is divided into. This can be any positive integer greater than or equal to 32 and less than input image height.
XF_NPPC	Number of pixels processed per cycle.
NO_EXPS	Number of exposure frames to be merged in the module.
W_B_SIZE	W_B_SIZE is used to define the array size for storing the weight values for wr_hls. W_B_SIZE should be 2^bit depth.
FILTERSIZE1	Filter size for RGB pixels.
FILTERSIZE2	Filter size for IR pixels.
DGAMMA_KP	Configurable number of knee points in degamma.
SQLUTDIM	Squared value of maximum dimension of input LUT.
LUTDIM	33x33 dimension of input LUT.

Table 235 Descriptions of array_params
Parameter	Description
rgain	To configure gain value for the red channel.
bgain	To configure gain value for the blue channel.
ggain	To configure gain value for the green channel.
pawb	%top and %bottom pixels are ignored while computing min and max to improve quality.
bayer_p	The Bayer format of the RAW input image.
black_level	Black level value to adjust overall brightness of the image.
height	The number of rows in the image or height of the image.
width	The number of columns in the image or width of the image.
blk_height	Actual block height.
blk_width	Actual block width.
lut_dim	Dimension of input LUT.

Table 236 Compile time flags
Parameter	Description
USE_HDR_FUSION	Flag to enable or disable HDR fusion module.
USE_GTM	Flag to enable or disable GTM module.
USE_LTM	Flag to enable or disable LTM module.
USE_QND	Flag to enable or disable QND module.
USE_RGBIR	Flag to enable or disable RGBIR module.
USE_3DLUT	Flag to enable or disable 3DLUT module.
USE_DEGAMMA	Flag to enable or disable Degamma module.
USE_AEC	Flag to enable or disable AEC module.

The following example demonstrates the top-level ISP pipeline:

ISPPipeline_accel(ap_uint<INPUT_PTR_WIDTH>* img_inp1,
               ap_uint<INPUT_PTR_WIDTH>* img_inp2,
               ap_uint<INPUT_PTR_WIDTH>* img_inp3,
               ap_uint<INPUT_PTR_WIDTH>* img_inp4,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out1,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out2,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out3,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out4,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out_ir1,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out_ir2,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out_ir3,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out_ir4,
               short wr_hls[NUM_STREAMS][NO_EXPS * XF_NPPC * W_B_SIZE],
               int dcp_params_12to16[NUM_STREAMS][3][4][3],
               char R_IR_C1_wgts[NUM_STREAMS][25],
               char R_IR_C2_wgts[NUM_STREAMS][25],
               char B_at_R_wgts[NUM_STREAMS][25],
               char IR_at_R_wgts[NUM_STREAMS][9],
               char IR_at_B_wgts[NUM_STREAMS][9],
               char sub_wgts[NUM_STREAMS][4],
               ap_ufixed<32, 18> dgam_params[NUM_STREAMS][3][DGAMMA_KP][3],
               float c1[NUM_STREAMS],
               float c2[NUM_STREAMS],
               unsigned short array_params[NUM_STREAMS][11],
               unsigned char gamma_lut[NUM_STREAMS][256 * 3],
               ap_uint<LUT_PTR_WIDTH>* lut1,
               ap_uint<LUT_PTR_WIDTH>* lut2,
               ap_uint<LUT_PTR_WIDTH>* lut3,
               ap_uint<LUT_PTR_WIDTH>* lut4) {
// clang-format off
#pragma HLS INTERFACE m_axi     port=img_inp1             offset=slave bundle=gmem1
#pragma HLS INTERFACE m_axi     port=img_inp2             offset=slave bundle=gmem2
#pragma HLS INTERFACE m_axi     port=img_inp3             offset=slave bundle=gmem3
#pragma HLS INTERFACE m_axi     port=img_inp4             offset=slave bundle=gmem4
#pragma HLS INTERFACE m_axi     port=img_out1             offset=slave bundle=gmem5
#pragma HLS INTERFACE m_axi     port=img_out2             offset=slave bundle=gmem6
#pragma HLS INTERFACE m_axi     port=img_out3             offset=slave bundle=gmem7
#pragma HLS INTERFACE m_axi     port=img_out4             offset=slave bundle=gmem8

#pragma HLS INTERFACE m_axi     port=img_out_ir1          offset=slave bundle=gmem9
#pragma HLS INTERFACE m_axi     port=img_out_ir2          offset=slave bundle=gmem10
#pragma HLS INTERFACE m_axi     port=img_out_ir3          offset=slave bundle=gmem11
#pragma HLS INTERFACE m_axi     port=img_out_ir4          offset=slave bundle=gmem12
#pragma HLS INTERFACE m_axi     port=wr_hls               offset=slave bundle=gmem13
#pragma HLS INTERFACE m_axi     port=dcp_params_12to16    offset=slave bundle=gmem14
#pragma HLS INTERFACE m_axi     port=R_IR_C1_wgts         offset=slave bundle=gmem15
#pragma HLS INTERFACE m_axi     port=R_IR_C2_wgts         offset=slave bundle=gmem16
#pragma HLS INTERFACE m_axi     port=B_at_R_wgts          offset=slave bundle=gmem17
#pragma HLS INTERFACE m_axi     port=IR_at_R_wgts         offset=slave bundle=gmem18
#pragma HLS INTERFACE m_axi     port=IR_at_B_wgts         offset=slave bundle=gmem19
#pragma HLS INTERFACE m_axi     port=sub_wgts             offset=slave bundle=gmem20
#pragma HLS INTERFACE m_axi     port=dgam_params          offset=slave bundle=gmem21
#pragma HLS INTERFACE m_axi     port=c1                   offset=slave bundle=gmem22
#pragma HLS INTERFACE m_axi     port=c2                   offset=slave bundle=gmem23
#pragma HLS INTERFACE m_axi     port=array_params         offset=slave bundle=gmem24
#pragma HLS INTERFACE m_axi     port=gamma_lut            offset=slave bundle=gmem25
#pragma HLS INTERFACE m_axi     port=lut1                 offset=slave bundle=gmem26
#pragma HLS INTERFACE m_axi     port=lut2                 offset=slave bundle=gmem27
#pragma HLS INTERFACE m_axi     port=lut3                 offset=slave bundle=gmem28
#pragma HLS INTERFACE m_axi     port=lut4                 offset=slave bundle=gmem29
   // clang-format on

   struct ispparams_config params[NUM_STREAMS];

   uint32_t tot_rows = 0;
   int rem_rows[NUM_STREAMS];
   static short wr_hls_tmp[NUM_STREAMS][NO_EXPS * XF_NPPC * W_B_SIZE];
   static unsigned char gamma_lut_tmp[NUM_STREAMS][256 * 3];
   static float c1_tmp[NUM_STREAMS], c2_tmp[NUM_STREAMS];
   static ap_ufixed<32, 18> dgam_params_tmp[NUM_STREAMS][3][DGAMMA_KP][3];
   static int dcp_params_12to16_tmp[NUM_STREAMS][3][4][3];
   static char R_IR_C1_wgts_tmp[NUM_STREAMS][25], R_IR_C2_wgts_tmp[NUM_STREAMS][25],
               B_at_R_wgts_tmp[NUM_STREAMS][25], IR_at_R_wgts_tmp[NUM_STREAMS][9],
               IR_at_B_wgts_tmp[NUM_STREAMS][9], sub_wgts_tmp[NUM_STREAMS][4];

   unsigned short height_arr[NUM_STREAMS], width_arr[NUM_STREAMS];
   constexpr int dg_parms_c1 = 3;
   constexpr int dg_parms_c2 = 3;
   constexpr int dcp_parms1 = 3;
   constexpr int dcp_parms2 = 4;
   constexpr int dcp_parms3 = 3;
DEGAMMA_PARAMS_LOOP:
   for (int n = 0; n < NUM_STREAMS; n++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
      // clang-format on

      for (int i = 0; i < dg_parms_c1; i++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=dg_parms_c1 max=dg_parms_c1
        // clang-format on
        for(int j=0; j<DGAMMA_KP; j++){
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=DGAMMA_KP max=DGAMMA_KP
          // clang-format on
          for(int k=0; k<dg_parms_c2; k++){
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=dg_parms_c2 max=dg_parms_c2
            // clang-format on
            dgam_params_tmp[n][i][j][k] = dgam_params[n][i][j][k];
            }
         }
        }
    }

DECOMPAND_PARAMS_LOOP:
   for(int n=0; n<NUM_STREAMS; n++){
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
      // clang-format on

      for (int i = 0; i < dcp_parms1; i++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=dcp_parms1 max=dcp_parms1
        // clang-format on
        for(int j=0; j<dcp_parms2; j++){
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=dcp_parms2 max=dcp_parms2
          // clang-format on
          for(int k=0; k<dcp_parms3; k++){
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=dcp_parms3 max=dcp_parms3
            // clang-format on
            dcp_params_12to16_tmp[n][i][j][k] = dcp_params_12to16[n][i][j][k];
          }
        }
      }
   }


C1_C2_INIT_LOOP:
   for(int i=0; i < NUM_STREAMS; i++){
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
      // clang-format on
      c1_tmp[i]=c1[i];
      c2_tmp[i]=c2[i];

}
   constexpr int R_B_count=25, IR_count=9, sub_count=4;

RGBIR_INIT_LOOP_1:
   for(int n=0; n < NUM_STREAMS; n++){

   // clang-format off
   #pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
      // clang-format on

      for (int i = 0; i < R_B_count; i++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=R_B_count max=R_B_count
      // clang-format on

      R_IR_C1_wgts_tmp[n][i] = R_IR_C1_wgts[n][i];
      R_IR_C2_wgts_tmp[n][i] = R_IR_C2_wgts[n][i];
      B_at_R_wgts_tmp[n][i]  = B_at_R_wgts[n][i];
      }
   }

RGBIR_INIT_LOOP_2:
  for(int n=0; n < NUM_STREAMS; n++){

// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
   // clang-format on

    for (int i = 0; i < IR_count; i++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=IR_count max=IR_count
      // clang-format on

      IR_at_R_wgts_tmp[n][i] = IR_at_R_wgts[n][i];
      IR_at_B_wgts_tmp[n][i] = IR_at_B_wgts[n][i];
    }
  }

RGBIR_INIT_LOOP_3:
   for(int n=0; n < NUM_STREAMS; n++){

// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
      // clang-format on

      for (int i = 0; i < sub_count; i++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=sub_count max=sub_count
        // clang-format on

        sub_wgts_tmp[n][i] = sub_wgts[n][i];
      }
   }

ARRAY_PARAMS_LOOP:
   for (int i = 0; i < NUM_STREAMS; i++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=1 max=NUM_STREAMS
      // clang-format on

      height_arr[i] = array_params[i][6];
      width_arr[i] = array_params[i][7];
      height_arr[i] = height_arr[i] * RD_MULT;
      tot_rows = tot_rows + height_arr[i];
      rem_rows[i] = height_arr[i];
   }
   constexpr int glut_TC = 256 * 3;

GAMMA_LUT_LOOP:
   for (int n = 0; n < NUM_STREAMS; n++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
      // clang-format on
      for(int i=0; i < glut_TC; i++){
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=glut_TC max=glut_TC
        // clang-format on

        gamma_lut_tmp[n][i] = gamma_lut[n][i];

      }
   }

WR_HLS_INIT_LOOP:
   for(int n =0; n < NUM_STREAMS; n++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
   // clang-format on
      for (int k = 0; k < XF_NPPC; k++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=XF_NPPC max=XF_NPPC
        // clang-format on
        for (int i = 0; i < NO_EXPS; i++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NO_EXPS max=NO_EXPS
          // clang-format on
          for (int j = 0; j < (W_B_SIZE); j++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=W_B_SIZE max=W_B_SIZE
            // clang-format on
            wr_hls_tmp[n][(i + k * NO_EXPS) * W_B_SIZE + j] = wr_hls[n][(i + k * NO_EXPS) * W_B_SIZE + j];
          }
        }
      }
   }

   const uint16_t pt[NUM_STREAMS] = {STRM1_ROWS, STRM2_ROWS, STRM3_ROWS, STRM4_ROWS};
   uint16_t max = STRM1_ROWS;
   for (int i = 1; i < NUM_STREAMS; i++) {
      if (pt[i] > max) max = pt[i];
   }

   const uint16_t TC = tot_rows / max;
   uint32_t addrbound, wr_addrbound, num_rows;

   int strm_id = 0, stream_idx = 0, slice_idx = 0;
   bool eof_awb[NUM_STREAMS] = {0};
   bool eof_tm[NUM_STREAMS] = {0};
   bool eof_aec[NUM_STREAMS] = {0};

   uint32_t rd_offset1 = 0, rd_offset2 = 0, rd_offset3 = 0, rd_offset4 = 0;
   uint32_t wr_offset1 = 0, wr_offset2 = 0, wr_offset3 = 0, wr_offset4 = 0;

TOTAL_ROWS_LOOP:
   for (int r = 0; r < tot_rows;) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=(XF_HEIGHT/STRM_HEIGHT)*NUM_STREAMS max=(XF_HEIGHT/STRM_HEIGHT)*NUM_STREAMS
      // clang-format on

// Compute no.of rows to process
     if (rem_rows[stream_idx] / RD_MULT > pt[stream_idx]) { // Check number for remaining rows of 1 interleaved image
       num_rows = pt[stream_idx];
       eof_awb[stream_idx] = 0; // 1 interleaved image/stream is not done
       eof_tm[stream_idx] = 0;
       eof_aec[stream_idx] = 0;
     } else {
       num_rows = rem_rows[stream_idx] / RD_MULT;
       eof_awb[stream_idx] = 1; // 1 interleaved image/stream done
       eof_tm[stream_idx] = 1;
       eof_aec[stream_idx] = 1;
     }

     strm_id = stream_idx;

     if (stream_idx == 0 && num_rows > 0) {
     Streampipeline(img_inp1 + rd_offset1, img_out1 + wr_offset1, img_out_ir1 + wr_offset1, lut1, num_rows,
                   height_arr[stream_idx], width_arr[stream_idx], STRM1_ROWS, dgam_params_tmp, hist0_awb,
                   hist1_awb, igain_0, igain_1, flag_awb, eof_awb, array_params, gamma_lut_tmp, wr_hls_tmp,
                   R_IR_C1_wgts_tmp, R_IR_C2_wgts_tmp, B_at_R_wgts_tmp, IR_at_R_wgts_tmp, IR_at_B_wgts_tmp,
                   sub_wgts_tmp, dcp_params_12to16_tmp, hist0_aec, hist1_aec, flag_aec, eof_aec, omin_r, omax_r,
                   omin_w, omax_w, mean1, mean2, L_max1, L_max2, L_min1, L_min2, c1_tmp, c2_tmp, flag_tm,
                   eof_tm, stream_idx, slice_idx);
     rd_offset1 += (RD_MULT * num_rows * ((width_arr[stream_idx] + RD_ADD) >> XF_BITSHIFT(XF_NPPC))) / 4;
     wr_offset1 += (num_rows * (width_arr[stream_idx] >> XF_BITSHIFT(XF_NPPC))) / 4;

     } else if (stream_idx == 1 && num_rows > 0) {
     Streampipeline(img_inp2 + rd_offset2, img_out2 + wr_offset2, img_out_ir2 + wr_offset2, lut2, num_rows,
                   height_arr[stream_idx], width_arr[stream_idx], STRM2_ROWS, dgam_params_tmp, hist0_awb,
                   hist1_awb, igain_0, igain_1, flag_awb, eof_awb, array_params, gamma_lut_tmp, wr_hls_tmp,
                   R_IR_C1_wgts_tmp, R_IR_C2_wgts_tmp, B_at_R_wgts_tmp, IR_at_R_wgts_tmp, IR_at_B_wgts_tmp,
                   sub_wgts_tmp, dcp_params_12to16_tmp, hist0_aec, hist1_aec, flag_aec, eof_aec, omin_r, omax_r,
                   omin_w, omax_w, mean1, mean2, L_max1, L_max2, L_min1, L_min2, c1_tmp, c2_tmp, flag_tm,
                   eof_tm, stream_idx, slice_idx);

     rd_offset2 += (RD_MULT * num_rows * ((width_arr[stream_idx] + RD_ADD) >> XF_BITSHIFT(XF_NPPC))) / 4;
     wr_offset2 += (num_rows * (width_arr[stream_idx] >> XF_BITSHIFT(XF_NPPC))) / 4;

     } else if (stream_idx == 2 && num_rows > 0) {
     Streampipeline(img_inp3 + rd_offset3, img_out3 + wr_offset3, img_out_ir3 + wr_offset3, lut3, num_rows,
                   height_arr[stream_idx], width_arr[stream_idx], STRM3_ROWS, dgam_params_tmp, hist0_awb,
                   hist1_awb, igain_0, igain_1, flag_awb, eof_awb, array_params, gamma_lut_tmp, wr_hls_tmp,
                   R_IR_C1_wgts_tmp, R_IR_C2_wgts_tmp, B_at_R_wgts_tmp, IR_at_R_wgts_tmp, IR_at_B_wgts_tmp,
                   sub_wgts_tmp, dcp_params_12to16_tmp, hist0_aec, hist1_aec, flag_aec, eof_aec, omin_r, omax_r,
                   omin_w, omax_w, mean1, mean2, L_max1, L_max2, L_min1, L_min2, c1_tmp, c2_tmp, flag_tm,
                   eof_tm, stream_idx, slice_idx);
     rd_offset3 += (RD_MULT * num_rows * ((width_arr[stream_idx] + RD_ADD) >> XF_BITSHIFT(XF_NPPC))) / 4;
     wr_offset3 += (num_rows * (width_arr[stream_idx] >> XF_BITSHIFT(XF_NPPC))) / 4;

     } else if (stream_idx == 3 && num_rows > 0) {
     Streampipeline(img_inp4 + rd_offset4, img_out4 + wr_offset4, img_out_ir4 + wr_offset4, lut4, num_rows,
                   height_arr[stream_idx], width_arr[stream_idx], STRM4_ROWS, dgam_params_tmp, hist0_awb,
                   hist1_awb, igain_0, igain_1, flag_awb, eof_awb, array_params, gamma_lut_tmp, wr_hls_tmp,
                   R_IR_C1_wgts_tmp, R_IR_C2_wgts_tmp, B_at_R_wgts_tmp, IR_at_R_wgts_tmp, IR_at_B_wgts_tmp,
                   sub_wgts_tmp, dcp_params_12to16_tmp, hist0_aec, hist1_aec, flag_aec, eof_aec, omin_r, omax_r,
                   omin_w, omax_w, mean1, mean2, L_max1, L_max2, L_min1, L_min2, c1_tmp, c2_tmp, flag_tm,
                   eof_tm, stream_idx, slice_idx);

     rd_offset4 += (RD_MULT * num_rows * ((width_arr[stream_idx] + RD_ADD) >> XF_BITSHIFT(XF_NPPC))) / 4;
     wr_offset4 += (num_rows * (width_arr[stream_idx] >> XF_BITSHIFT(XF_NPPC))) / 4;
     }
     // Update remaining rows to process
     rem_rows[stream_idx] = rem_rows[stream_idx] - num_rows * RD_MULT;

     // Next stream selection
     if (stream_idx == NUM_STREAMS - 1) {
       stream_idx = 0;
       slice_idx++;

     } else {
       stream_idx++;
     }

     // Update total rows to process
     r += num_rows * RD_MULT;
   } // TOTAL_ROWS_LOOP

 return;

}

Create and Launch kernel in the testbench:

Histogram needs two frames to populate the histogram and to get correct result in auto exposure frame. Auto white balance, GTM and other tone-mapping functions needs one extra frame in each to populate its parameters and apply those parameters to get a correct image. For the specific example below, four iterations are needed because the AEC, AWB and LTM module selected.

// Create a kernel:
OCL_CHECK(err, cl::Kernel kernel(program, "ISPPipeline_accel", &err));

for (int i = 0; i < 4; i++) {

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inVec_Weights,  // buffer on the FPGA
                                    CL_TRUE,                 // blocking call
                                    0,                       // buffer offset in bytes
                                    vec_weight_size_bytes,   // Size in bytes
                                    wr_hls));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_decompand_params,  // buffer on the FPGA
                                    CL_TRUE,                  // blocking call
                                    0,                        // buffer offset in bytes
                                    dcp_params_in_size_bytes, // Size in bytes
                                    dcp_params_12to16));

   OCL_CHECK(err, q.enqueueWriteBuffer(buffer_R_IR_C1,        // buffer on the FPGA
                                    CL_TRUE,               // blocking call
                                    0,                     // buffer offset in bytes
                                    filter1_in_size_bytes, // Size in bytes
                                    R_IR_C1_wgts));
  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_R_IR_C2,        // buffer on the FPGA
                                    CL_TRUE,               // blocking call
                                    0,                     // buffer offset in bytes
                                    filter1_in_size_bytes, // Size in bytes
                                    R_IR_C2_wgts));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_B_at_R,         // buffer on the FPGA
                                    CL_TRUE,               // blocking call
                                    0,                     // buffer offset in bytes
                                    filter1_in_size_bytes, // Size in bytes
                                    B_at_R_wgts));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_IR_at_R,        // buffer on the FPGA
                                    CL_TRUE,               // blocking call
                                    0,                     // buffer offset in bytes
                                    filter2_in_size_bytes, // Size in bytes
                                    IR_at_R_wgts));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_IR_at_B,        // buffer on the FPGA
                                    CL_TRUE,               // blocking call
                                    0,                     // buffer offset in bytes
                                    filter2_in_size_bytes, // Size in bytes
                                    IR_at_B_wgts));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_sub_wgts,        // buffer on the FPGA
                                    CL_TRUE,                // blocking call
                                    0,                      // buffer offset in bytes
                                    sub_wgts_in_size_bytes, // Size in bytes
                                    sub_wgts));
  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_dgam_params,        // buffer on the FPGA
                                    CL_TRUE,                   // blocking call
                                    0,                         // buffer offset in bytes
                                    dgam_params_in_size_bytes, // Size in bytes
                                    dgam_params));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_c1,     // buffer on the FPGA
                                    CL_TRUE,       // blocking call
                                    0,             // buffer offset in bytes
                                    c1_size_bytes, // Size in bytes
                                    c1));
  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_c2,     // buffer on the FPGA
                                    CL_TRUE,       // blocking call
                                    0,             // buffer offset in bytes
                                    c2_size_bytes, // Size in bytes
                                    c2));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_array,     // buffer on the FPGA
                                    CL_TRUE,            // blocking call
                                    0,                  // buffer offset in bytes
                                    array_size_bytes,   // Size in bytes
                                    array_params));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inVec,      // buffer on the FPGA
                                    CL_TRUE,             // blocking call
                                    0,                   // buffer offset in bytes
                                    vec_in_size_bytes,   // Size in bytes
                                    gamma_lut));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inLut1,     // buffer on the FPGA
                                    CL_TRUE,           // blocking call
                                    0,                 // buffer offset in bytes
                                    lut_in_size_bytes, // Size in bytes
                                    casted_lut1,       // Pointer to the data to copy
                                    nullptr));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inLut2,     // buffer on the FPGA
                                    CL_TRUE,           // blocking call
                                    0,                 // buffer offset in bytes
                                    lut_in_size_bytes, // Size in bytes
                                    casted_lut2,       // Pointer to the data to copy
                                    nullptr));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inLut3,     // buffer on the FPGA
                                    CL_TRUE,           // blocking call
                                    0,                 // buffer offset in bytes
                                    lut_in_size_bytes, // Size in bytes
                                    casted_lut3,       // Pointer to the data to copy
                                    nullptr));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inLut4,     // buffer on the FPGA
                                    CL_TRUE,           // blocking call
                                    0,                 // buffer offset in bytes
                                    lut_in_size_bytes, // Size in bytes
                                    casted_lut4,       // Pointer to the data to copy
                                    nullptr));


  if(HDR_FUSION) {
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage1, CL_TRUE, 0, image_in_size_bytes, interleaved_img1.data));
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage2, CL_TRUE, 0, image_in_size_bytes, interleaved_img2.data));
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage3, CL_TRUE, 0, image_in_size_bytes, interleaved_img3.data));
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage4, CL_TRUE, 0, image_in_size_bytes, interleaved_img4.data));

  }
  else {
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage1, CL_TRUE, 0, image_in_size_bytes, out_img1_12bit.data));
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage2, CL_TRUE, 0, image_in_size_bytes, out_img1_12bit.data));
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage3, CL_TRUE, 0, image_in_size_bytes, out_img1_12bit.data));
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage4, CL_TRUE, 0, image_in_size_bytes, out_img1_12bit.data));
  }

  // Profiling Objects
  cl_ulong start = 0;
  cl_ulong end = 0;
  double diff_prof = 0.0f;
  cl::Event event_sp;

  // Launch the kernel
  OCL_CHECK(err, err = q.enqueueTask(kernel, NULL, &event_sp));
  clWaitForEvents(1, (const cl_event*)&event_sp);

  event_sp.getProfilingInfo(CL_PROFILING_COMMAND_START, &start);
  event_sp.getProfilingInfo(CL_PROFILING_COMMAND_END, &end);
  diff_prof = end - start;
  std::cout << (diff_prof / 1000000) << "ms" << std::endl;
  // Copying Device result data to Host memory
  q.enqueueReadBuffer(buffer_outImage1, CL_TRUE, 0, image_out_size_bytes, out_img1.data);
  q.enqueueReadBuffer(buffer_outImage2, CL_TRUE, 0, image_out_size_bytes, out_img2.data);
  q.enqueueReadBuffer(buffer_outImage3, CL_TRUE, 0, image_out_size_bytes, out_img3.data);
  q.enqueueReadBuffer(buffer_outImage4, CL_TRUE, 0, image_out_size_bytes, out_img4.data);

  if (USE_RGBIR) {
    q.enqueueReadBuffer(buffer_IRoutImage1, CL_TRUE, 0, image_out_ir_size_bytes, out_img_ir1.data);
    q.enqueueReadBuffer(buffer_IRoutImage2, CL_TRUE, 0, image_out_ir_size_bytes, out_img_ir2.data);
    q.enqueueReadBuffer(buffer_IRoutImage3, CL_TRUE, 0, image_out_ir_size_bytes, out_img_ir3.data);
    q.enqueueReadBuffer(buffer_IRoutImage4, CL_TRUE, 0, image_out_ir_size_bytes, out_img_ir4.data);
  }
}

Resource Utilization

The following table summarizes the resource utilization of ISP multistream generated using Vitis HLS 2023.1 tool on ZCU102 board.

Table 237 ISP multistream Resource Utilization Summary
Operating Mode	Operating Frequency (MHz)	Utilization Estimate
Operating Mode	Operating Frequency (MHz)	BRAM	DSP	CLB Registers	CLB LUT
1 Pixel	150	209.5	325	60142	63718

Performance Estimate

The following table summarizes the performance of the ISP multistream in 1-pixel mode as generated using Vitis HLS 2023.1 tool on ZCU102 board.

Estimated average latency is obtained by running the accel with 4 iterations. The input to the accel is a 12bit non-linearized full-HD (1920x1080) image.

Table 238 ISP multistream Performance Estimate Summary
Operating Mode	Latency Estimate
Operating Mode	Average latency(ms)
1 pixel operation (150 MHz)	62.742

Image Sensor Processing multistream pipeline - 2023.2 English

Vitis Libraries