One good approach could be to base the architecture on the TPU v1 from [1]. Ther...

One good approach could be to base the architecture on the TPU v1 from [1]. There are also open-source accelerators you could get inspiration from, for example [2][3]. If you want to do less work/not hand code the RTL yourself then you could look into methods for automatically mapping OpenCL to an FPGA accelerator architecture (or a service like [3] provides pre-designed architectures for multiple FPGAs).

[1] https://arxiv.org/abs/1704.04760

[2] https://github.com/jofrfu/tinyTPU

[3] https://github.com/tensil-ai/tensil