One good approach could be to base the architecture on the TPU v1 from [1]. There are also open-source accelerators you could get inspiration from, for example [2][3]. If you want to do less work/not hand code the RTL yourself then you could look into methods for automatically mapping OpenCL to an FPGA accelerator architecture (or a service like [3] provides pre-designed architectures for multiple FPGAs).
[1] https://arxiv.org/abs/1704.04760
[2] https://github.com/jofrfu/tinyTPU
[3] https://github.com/tensil-ai/tensil