To balance the latency by adding pipeline stages, add the stage to the control path and not the data path. The data path includes wider buses, which increases the number of flip-flop and register resources used.
For example, if you have a 128-bit data path, 2 stages of registers, and a requirement of 5 cycles of latency, inserting 3 register stages results in an extra 3 x 128 = 384 flip-flops. Alternatively, you can use registers to control logic to enable the data path. Use 5 stages of single-bit registers to control the enable signal of datapath flip-flops and multicycle path timing exceptions accordingly.