Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIEX] Schedule SWP epilogue with "free" instructions #247

Draft
wants to merge 7 commits into
base: aie-public
Choose a base branch
from

Conversation

andcarminati
Copy link
Collaborator

@andcarminati andcarminati commented Dec 10, 2024

This PR adds support for Epilogue scheduling. In this way, it also adds:

  • Support for top-down scheduling with explicit emission cycle.
  • Top-down logic for AIEMachineScheduler.

I recommend reviewing the pull request in the order of commits, although some of them are closely related, so I plan to combine them in the future.

Ongoing work related to EmitFixedSUnits: we current add all WAR and RAW dependencies related to the top-insert and the rest. However, the bot-insert handling can be changed to use bot register events as well.

for (const auto &Bundle : Bundles) {
for (MachineInstr *SrcMI : Bundle.getInstrs()) {
for (unsigned OpNum = 0; OpNum < SrcMI->getNumOperands(); OpNum++) {
unsigned SrcClass = SrcMI->getDesc().getSchedClass();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: invariant, can be declared one loop level up.

std::optional<unsigned> OptSrcCycle =
InstrItins->getOperandCycle(SrcClass, OpNum);
assert(OptSrcCycle);
int Latency = *OptSrcCycle;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const.

@@ -170,6 +175,11 @@ class AIEPostRASchedStrategy : public PostGenericScheduler {
// After scheduling a block, fill in nops, apply bundling, etc.
void commitBlockSchedule(MachineBasicBlock *BB);

// This function returns true when it is possible to continue
// with top-down without entering in loop because all remaining instructions
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: infinite loop.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referring to an earlier commit, you say 'all remaining instructions'. Implementing it that way rather than focusing on the N=1 case with a delayslot would improve readability.

// We want to insert above it.
return std::lower_bound(IsTopNode ? begin() : bottom(),
IsTopNode ? top() : end(), *EmissionCycle,
HasGreaterOrLessOrEqEmissionCycle);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah just split in two lower_bound calls.

IsPreRA(IsPreRA), SchedZone(SchedBoundary::BotQID, "Zone") {}
IsPreRA(IsPreRA),
SchedZone(IsTopDown ? SchedBoundary::TopQID : SchedBoundary::BotQID,
"Zone") {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks as if "Zone" could be more descriptive

const BlockState &LBS = getBlockState(Loop);

// Epilogues of pipelined loops should emit the bundles swp epilog.
// in a dedicated exit. If there isn't one, spawn a new block,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: emit the bundles of the swp epilog in a dedicated exit.

if (getBlockState(S).Kind == BlockType::Loop) {
getBlockState(L).Kind = BlockType::Epilogue;
}
});
Copy link
Collaborator

@martien-de-jong martien-de-jong Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, perhaps something along the lines if (! any_of(predecessors, IsLoop)) BS.Kind = BlockType::Regular);

ArrayRef<MachineBundle> TopFixedBundles;
ArrayRef<MachineBundle> TopFixedBundles =
RegionBegin == BB->begin() ? ArrayRef<MachineBundle>(BS.TopInsert)
: ArrayRef<MachineBundle>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check: TopFixedBundles was empty before, triggering no further action.

const int DeltaCycles = CurrCycle - BotReadyCycle;
return FixedSU == &SU && DeltaCycles >= MinDelta;
if (Zone.isTop()) {
return FixedSU == &SU && CurrCycle == TopReadyCycle;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: early return false on FixedSU != &SU

@@ -639,7 +640,7 @@ void AIEPostRASchedStrategy::commitBlockSchedule(MachineBasicBlock *BB) {

// Safety margin, swp epilogue
// Note that the prologue is handled in a different way. See enterMBB.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is out of date, we now only handle the safety margin here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: Can we have both a safety margin and a top-fixed region? If not, can we assert it doesn't happen?

Copy link
Collaborator Author

@andcarminati andcarminati Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot have! If we need to supply safety margin for swp loop, it means an incorrect schedule. We cannot calculate the safety margin for swp-loop without triggering this assert:

llc: ../llvm/lib/Target/AIE/AIEInterBlockScheduling.cpp:903: auto llvm::AIE::InterBlockScheduling::getCyclesToRespectTiming(const llvm::AIE::BlockState &, const llvm::AIE::BlockState &)::(anonymous class)::operator()(const llvm::AIE::Region &) const: Assertion `R.top_fixed_instrs().empty() && "SWP epilogue already emitted?"' failed.
+ /scratch/llvm-aie/build-public-mem/bin/FileCheck /scratch/llvm-aie/llvm/test/CodeGen/AIE/aie2/schedule/postpipeliner/add-store.mir

// This function returns false when the available queue is empty and there is a
// single instruction in the pending queue that has a delay slot. Continuing
// with a top-down approach in this scenario would lead to an infinite loop,
// since instructions with delay slots are never available for the top zones.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the last observation is more important. In fact, progress is blocked if no instruction in the pending queue can become available in top down. The fact that currently only delayslot instructions apply and that we can only have one delay slot instruction in a region is a coincidence

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should remove the instruction from the pending queue altogether. Basically never have it in the queues of the Top zone.

// single instruction in the pending queue that has a delay slot. Continuing
// with a top-down approach in this scenario would lead to an infinite loop,
// since instructions with delay slots are never available for the top zones.
bool AIEPostRASchedStrategy::canContinueTopDown() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would revert the logic value, e.g. mustSwitchToBottomUp

@@ -570,6 +595,10 @@ bool AIEPostRASchedStrategy::isAvailableNode(SUnit &SU, SchedBoundary &Zone,
if (isFixedSU(SU, !Zone.isTop()))
return false;

// Instruction with delay slot should bever be scheduled in top-down.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int Instructions

// Instruction with delay slot should bever be scheduled in top-down.
if (Zone.isTop() && SU.getInstr()->hasDelaySlot())
return false;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It would be nice to have a named predicate for this, like 'doesNotProgress' that is used both here and in the logic to switch to BottomUp

unsigned getMaxSrcOperandLatency(const MachineInstr &MI) const {
unsigned MaxLatency = 0;
for (const MachineOperand &MO : MI.all_uses()) {
if (!MO.isReg() || !MO.isUse())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can one of all_uses() be anything else?


// First, create SUnits for all "fixed" instructions
// Those will be chained from/to the EntrySU/ExitSU to ensure they are
// placed in the correct cycle. The scheduler will enforce that these fixed
// SUnits get placed exactly at their depth (for the Top zone) or height
// (for the Bot zone).
SUnit *Pred = &DAG->EntrySU;
for (MachineInstr &MI : CurRegion.top_fixed_instrs()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps mention that we iterate over bundles

@@ -359,6 +442,65 @@ class EmitFixedSUnits : public ScheduleDAGMutation {
AIE::maxLatency(&MI, *TII, *ItinData, /*IncludeStages=*/true));
FixedDepSU->addPred(Dep, /*Required=*/true);
}

// We only need to focus on top-fixed instructions when there is an Epilog
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Epilogue

RAT.computeAvailabilityCycles(LoopTimedBundles, /*PastTheEndCycles*/ true);

auto IsNotTopFixedSU = [Scheduler](const SUnit &SU) {
return !Scheduler->isFixedSU(SU, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: /*IsTop=*/true ?

const MachineInstr &MI = *FixedSU.getInstr();
if (const unsigned Latency = RAT.getMaxSrcOperandLatency(MI)) {
SDep Dep(&FixedSU, SDep::Artificial);
int latency =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Latency

}
// Otherwise, the loop is the fallthrough predecessor by construction
for (auto *Pred : MBB.predecessors()) {
if (Pred->isLayoutSuccessor(&MBB)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. 'By Construction' only holds true for the InterBlock construction.

const BlockState &BS =
Scheduler->getInterBlock().getBlockState(DAG->getBB());
const Region &CurRegion = BS.getCurrentRegion();
RegAvailabilityTracker RAT{ItinData, TRI};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the name 🤣

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is funny indeed.... ;-)

// separate mutator, doing so could be costly, as it would prevent the
// creation of multiple edges from EntrySU to each free instruction that
// depends on both timed regions (TopFixed and LoopTimed).
RAT.computeAvailabilityCycles(LoopTimedBundles, /*PastTheEndCycles*/ true);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fact: we don't need a full sweep in LoopTimedBundles.

@@ -314,20 +315,102 @@ class RegionEndEdges : public ScheduleDAGMutation {
/// "fixed" SUnits.
class EmitFixedSUnits : public ScheduleDAGMutation {
public:
struct RegAvailabilityTracker {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even go a bit further and compute a "register event view". As a first conservative version:

  • Every instruction would create "read" events in its first cycle for every input operand
  • and "write" events in its last cycle (determined from maxLatency()) for every output operand

This would allow us to use the view for all cases:

  • Deps between a non-pipelined loop and its epilogue
  • Deps between top fixed and free instructions
  • Deps between free and bottom fixed instructions
  • etc.

What do you think? This could later become the base of timing-aware live ranges if we ever do register allocation after scheduling.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside would be: it cannot be called RAT anymore. REV is still quite cool though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm, interesting. But I miss the point about the usage of maxLatency() here. For example, a post in load produces two outputs in different cycles and so I think it will be too pessimistic.

Copy link
Collaborator Author

@andcarminati andcarminati Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gboss, I believe we should adopt a KISS approach for this REV, given our current requirements. Specifically, we need to have a clear understanding of:

  • The last cycle in which a register is defined - top and bot part.
  • The first cycle in which a register is read - top and bot part.

With this information, we can replace the findEarliestRef method by directly connecting free SUs to ExitSU. Although this may slightly increase the cost for isAvailableNode due to a higher number of pending SUs, it will simplify the process of comparing all free instructions against the BotFixed bundles to identify the first reference.

I propose creating this event view as a separate class, outside of the subtarget, so that we can extend it as needed in the future.

What are your thoughts?

@@ -48,6 +48,10 @@ getSingleBlockLoopMBBs(const MachineFunction &MF);
/// Check if this block is a single block loop.
bool isSingleMBBLoop(const MachineBasicBlock *MBB);

/// Considering that MBB has a single predecessor that is a loop
/// and also layout predecessor, return it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it can return a layout predecessor that is not a loop, a unique predecessor that is not a loop or a null pointer (which is also not a loop)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a refactoring. Any free use is dangerous, so I put this comment. Maybe we should name it as getLayoutPredecessor and then we don't care if it is a loop or not. In this case, it is not a loop utils function anymore.... Any suggestion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: it is enough to assert isSingleMBBLoop.

for (const MachineOperand &MO : MI.all_uses()) {
if (!MO.isReg() || !MO.isUse())
continue;
for (MCRegAliasIterator Ali(MO.getReg(), TRI, true); Ali.isValid();
Copy link
Collaborator Author

@andcarminati andcarminati Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use alias iterator here, because we populated RegisterToCycle with all aliases, so we are creating false alias cases, like bmh0 aliasing to bml0. Eve better is to not populate RegisterToCycle with alias and use alias here.

This commit prepares to schedule top fixed bundles. We also create dedicated
loop exits early, handling new blocks along with their corresponding block states.
If we have TopFixed instructions, we start top-down and we change
to bottom-up when we fill as much as possible the slots related
to those instructions. Special care is needed for instructions
with delay slot and bottom-fixed instructions.
@andcarminati andcarminati force-pushed the andreu.swp.epilogue.scheduling branch from 4776f1d to cc37cc3 Compare December 12, 2024 09:20
// into account MaxLatency.
for (SUnit &FixedSU : make_filter_range(DAG->SUnits, IsTopFixedSU)) {
const MachineInstr &MI = *FixedSU.getInstr();
if (const unsigned Latency = RAT.getMaxSrcOperandLatency(MI)) {
Copy link
Collaborator Author

@andcarminati andcarminati Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if was wrongly copy pasted. We need just maxLatency here.

@andcarminati andcarminati force-pushed the andreu.swp.epilogue.scheduling branch from cc37cc3 to 4168aba Compare December 12, 2024 16:05
@andcarminati andcarminati force-pushed the andreu.swp.epilogue.scheduling branch from 4168aba to c940c2e Compare December 12, 2024 16:11
@Xilinx Xilinx deleted a comment from martien-de-jong Dec 13, 2024
if (DedicatedExit == BB) {

// Trim excedent empty bundles.
while (BS.TopInsert.back().empty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: With the latests changes to the post-pipeliner, it seems we can end up with a pipeline of 1 stage, essentially meaning the loop isn't pipelined. It's probably an oversight, and the loop should not have been considered as isPipelined(). But still, it makes the BS.TopInsert.back() code above crash.

I'd suggest understanding the root cause (you can check llvm/test/CodeGen/AIE/aie2/schedule/postpipeliner/crash.mir which now crashed again 😆), and adding an assert in e.g. isPipelined() that if a loop is pipelined, it has non-empty top and bottom inserts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants