We had a discussion on this in general - especially about async. behaviour, it is always the question if you want an implementation close to the reality or optimal for the FPGA...
You are right, async. implementations are not optimal for FPGAs, but the mentioned async. reset works fine on Spartan 3E as used on the Replay, just costs a very little more ressources on the FPGA. Btw. the reset is not really anynchronous on the Replay, the timing is also derived from the clock (this is usually also the case if you get it e.g. from a DLL or similar block)... On the other hand there are only a few clock domains and everything is gated, which IMHO helps much, much more for proper timing closure (especially this helps the tools and reduces the interfaces between clock domains, which are hard to cover properly by checks).
On the replay the T65 core runs within a pure ~33MHz clock domain (divided from the DDR RAM clock) and I don't see a speed issue - so it should easily run with a 16MHz clock or a 8MHz clock as mentioned above as well. I also did a STA for my cores, the async. reset is not an issue.
Most people miss to set up proper clock and I/O timing constraints for their design and just take some tool defaults, which means they simply don't know why something fails, as the tools can't check anything. It is simply not enough to take the code, put it in a design and download to the FPGA, like people are used from microcontrollers. FPGA design is much more than that. Thus, most of my work is without using a FPGA hardware at all, just a workstation (ok, I don't count the hours afterwards playing some games to check the stability of the setup
)
The problem can also reside also outside the CPU core, especially on the bus to the peripherals and memories. Or it is a simple integration issue of the CPU core - e.g. some control signals are not correctly used, e.g. together with memories - here I had most issues to connect properly to the DRAM controller, which had nothing to do with the T65, although this guy got stuck all the time. You need a proper simulation and analysis setup, then you will find the errors and/or bottlenecks of the design. There you can also proof everything you think it could be an issue.
The T65 core is grown over time and the core got modified by several people, thus many things are for sure not optimal in the code. One day it really makes sense to clean up the whole thing, but for now it is working fine and I think it is not worth to spend the effort on it in this phase - I'd say it is better to use and test it thoroughly and fix remaining issues. Then it will be a very good "golden model" for a "fresh" implementation later on...
/WoS