Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant Overhaul of the Interpreter's Timing Model #2235

Draft
wants to merge 307 commits into
base: master
Choose a base branch
from

Conversation

Jaklyy
Copy link
Contributor

@Jaklyy Jaklyy commented Dec 13, 2024

Heavily reworks the ARM9 & ARM7 timing models to greatly improve accuracy (and slaughter performance).
Builds upon my work in #2125 and uses the excellent cache implementation found in #1955 (probably want to merge those two first). (hopefully building this pr upon those two doesn't cause any stupid or weird issues with git...? Fingers crossed?)

Implements:

  1. Cache streaming
  2. Write buffer
  3. Bus cycle rounding
  4. Main RAM contention
  5. Improvements to certain instruction timings
  6. Memory stage cycles are now distinguished from the execute stage
  7. Interlocks
  8. Improvements to memory access timings
  9. Minor improvements to DMA timings
  10. ARM9 now only stops for DMA when accessing the bus
  11. Fix ExMemCnt having the incorrect default state. (at least for direct boot, non-direct boot state shouldn't matter...?) (also prevents software from toggling certain bits).
  12. Removes a few non-existent cp15 cache commands

Known Issues:

  1. JIT is completely broken and will most likely need a significant amount of effort to work again.
  2. Write Buffer is very approximate; it needs a lot more work to really be accurate...
  3. There are actually two different types of interlock, this treats all interlocks as identical, which is wrong.
  4. Most DSi stuff has either not been implemented, or extensively tested yet.
  5. There are probably oodles of regressions, freezes, and crashes I have yet to spot.
  6. Main RAM DMA Timings are slightly worse for long DMAs.
  7. Interpreter is roughly half the speed. This is unfortunately just a consequence of chasing high levels of accuracy, and unlikely to be fixed.
  8. ARM7 DMA has yet to be touched.
  9. Full ExMemCnt defaults have yet to be validated; all I know for sure is that bit 15 should be set by default. (TwilightMenu++ relies on this to boot).
  10. Write buffer also uses a shortcut of sorts. It doesn't actually use and increment the address value passed via the fifo. (should be the same as how hw does it?) Im not entirely sure why, but it caused issues.
  11. Nothing is included in savestates yet, so they may be a little broken.

Jaklyy added 30 commits June 7, 2024 23:46
also remove no longer needed variable
remove some checks for interlock that im pretty sure can't trigger
not implemented for direct boot
I believe this also applies to other loads as well, but currently untested.
need to verify if they apply to all store instructions
@@ -171,20 +219,48 @@ class ARM
u32 DataRegion;
s32 DataCycles;

u32 R[16]; // heh
alignas(64) u32 R[16]; // heh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean u64 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alignas? no. i explicitly meant to align it to a host cacheline. which should be 64 bytes. it seemed to give a noticeable performance boost doing so in a few places (though maybe that was just luck?)

Copy link
Contributor

@JesseTG JesseTG Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I think you might want std::hardware_destructive_interference_size or std::hardware_constructive_interference_size, so that you don't need to hardcode the cacheline size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between the two?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the linked reference page:

  1. Minimum offset between two objects to avoid false sharing. Guaranteed to be at least alignof(std::max_align_t)
  2. Maximum size of contiguous memory to promote true sharing. Guaranteed to be at least alignof(std::max_align_t)

It has details and examples.

oh no that was covering up SO many bugs hhhhsdfghhg
caused innumerable issues
will need a more comprehensive rewrite later
this should fix something?
the hack is to make arm9 dma contention work with prior improvements to synchronization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants