-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIEX] Schedule SWP epilogue with "free" instructions #247
base: aie-public
Are you sure you want to change the base?
Conversation
a905dec
to
4776f1d
Compare
for (const auto &Bundle : Bundles) { | ||
for (MachineInstr *SrcMI : Bundle.getInstrs()) { | ||
for (unsigned OpNum = 0; OpNum < SrcMI->getNumOperands(); OpNum++) { | ||
unsigned SrcClass = SrcMI->getDesc().getSchedClass(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: invariant, can be declared one loop level up.
std::optional<unsigned> OptSrcCycle = | ||
InstrItins->getOperandCycle(SrcClass, OpNum); | ||
assert(OptSrcCycle); | ||
int Latency = *OptSrcCycle; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const.
@@ -170,6 +175,11 @@ class AIEPostRASchedStrategy : public PostGenericScheduler { | |||
// After scheduling a block, fill in nops, apply bundling, etc. | |||
void commitBlockSchedule(MachineBasicBlock *BB); | |||
|
|||
// This function returns true when it is possible to continue | |||
// with top-down without entering in loop because all remaining instructions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: infinite loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referring to an earlier commit, you say 'all remaining instructions'. Implementing it that way rather than focusing on the N=1 case with a delayslot would improve readability.
// We want to insert above it. | ||
return std::lower_bound(IsTopNode ? begin() : bottom(), | ||
IsTopNode ? top() : end(), *EmissionCycle, | ||
HasGreaterOrLessOrEqEmissionCycle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah just split in two lower_bound calls.
IsPreRA(IsPreRA), SchedZone(SchedBoundary::BotQID, "Zone") {} | ||
IsPreRA(IsPreRA), | ||
SchedZone(IsTopDown ? SchedBoundary::TopQID : SchedBoundary::BotQID, | ||
"Zone") {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks as if "Zone" could be more descriptive
const BlockState &LBS = getBlockState(Loop); | ||
|
||
// Epilogues of pipelined loops should emit the bundles swp epilog. | ||
// in a dedicated exit. If there isn't one, spawn a new block, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: emit the bundles of the swp epilog in a dedicated exit.
if (getBlockState(S).Kind == BlockType::Loop) { | ||
getBlockState(L).Kind = BlockType::Epilogue; | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, perhaps something along the lines if (! any_of(predecessors, IsLoop)) BS.Kind = BlockType::Regular);
ArrayRef<MachineBundle> TopFixedBundles; | ||
ArrayRef<MachineBundle> TopFixedBundles = | ||
RegionBegin == BB->begin() ? ArrayRef<MachineBundle>(BS.TopInsert) | ||
: ArrayRef<MachineBundle>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check: TopFixedBundles was empty before, triggering no further action.
const int DeltaCycles = CurrCycle - BotReadyCycle; | ||
return FixedSU == &SU && DeltaCycles >= MinDelta; | ||
if (Zone.isTop()) { | ||
return FixedSU == &SU && CurrCycle == TopReadyCycle; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: early return false on FixedSU != &SU
@@ -639,7 +640,7 @@ void AIEPostRASchedStrategy::commitBlockSchedule(MachineBasicBlock *BB) { | |||
|
|||
// Safety margin, swp epilogue | |||
// Note that the prologue is handled in a different way. See enterMBB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is out of date, we now only handle the safety margin here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: Can we have both a safety margin and a top-fixed region? If not, can we assert it doesn't happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot have! If we need to supply safety margin for swp loop, it means an incorrect schedule. We cannot calculate the safety margin for swp-loop without triggering this assert:
llc: ../llvm/lib/Target/AIE/AIEInterBlockScheduling.cpp:903: auto llvm::AIE::InterBlockScheduling::getCyclesToRespectTiming(const llvm::AIE::BlockState &, const llvm::AIE::BlockState &)::(anonymous class)::operator()(const llvm::AIE::Region &) const: Assertion `R.top_fixed_instrs().empty() && "SWP epilogue already emitted?"' failed.
+ /scratch/llvm-aie/build-public-mem/bin/FileCheck /scratch/llvm-aie/llvm/test/CodeGen/AIE/aie2/schedule/postpipeliner/add-store.mir
// This function returns false when the available queue is empty and there is a | ||
// single instruction in the pending queue that has a delay slot. Continuing | ||
// with a top-down approach in this scenario would lead to an infinite loop, | ||
// since instructions with delay slots are never available for the top zones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the last observation is more important. In fact, progress is blocked if no instruction in the pending queue can become available in top down. The fact that currently only delayslot instructions apply and that we can only have one delay slot instruction in a region is a coincidence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should remove the instruction from the pending queue altogether. Basically never have it in the queues of the Top zone.
// single instruction in the pending queue that has a delay slot. Continuing | ||
// with a top-down approach in this scenario would lead to an infinite loop, | ||
// since instructions with delay slots are never available for the top zones. | ||
bool AIEPostRASchedStrategy::canContinueTopDown() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would revert the logic value, e.g. mustSwitchToBottomUp
@@ -570,6 +595,10 @@ bool AIEPostRASchedStrategy::isAvailableNode(SUnit &SU, SchedBoundary &Zone, | |||
if (isFixedSU(SU, !Zone.isTop())) | |||
return false; | |||
|
|||
// Instruction with delay slot should bever be scheduled in top-down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int Instructions
// Instruction with delay slot should bever be scheduled in top-down. | ||
if (Zone.isTop() && SU.getInstr()->hasDelaySlot()) | ||
return false; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It would be nice to have a named predicate for this, like 'doesNotProgress' that is used both here and in the logic to switch to BottomUp
unsigned getMaxSrcOperandLatency(const MachineInstr &MI) const { | ||
unsigned MaxLatency = 0; | ||
for (const MachineOperand &MO : MI.all_uses()) { | ||
if (!MO.isReg() || !MO.isUse()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can one of all_uses() be anything else?
|
||
// First, create SUnits for all "fixed" instructions | ||
// Those will be chained from/to the EntrySU/ExitSU to ensure they are | ||
// placed in the correct cycle. The scheduler will enforce that these fixed | ||
// SUnits get placed exactly at their depth (for the Top zone) or height | ||
// (for the Bot zone). | ||
SUnit *Pred = &DAG->EntrySU; | ||
for (MachineInstr &MI : CurRegion.top_fixed_instrs()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps mention that we iterate over bundles
@@ -359,6 +442,65 @@ class EmitFixedSUnits : public ScheduleDAGMutation { | |||
AIE::maxLatency(&MI, *TII, *ItinData, /*IncludeStages=*/true)); | |||
FixedDepSU->addPred(Dep, /*Required=*/true); | |||
} | |||
|
|||
// We only need to focus on top-fixed instructions when there is an Epilog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Epilogue
RAT.computeAvailabilityCycles(LoopTimedBundles, /*PastTheEndCycles*/ true); | ||
|
||
auto IsNotTopFixedSU = [Scheduler](const SUnit &SU) { | ||
return !Scheduler->isFixedSU(SU, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: /*IsTop=*/true
?
const MachineInstr &MI = *FixedSU.getInstr(); | ||
if (const unsigned Latency = RAT.getMaxSrcOperandLatency(MI)) { | ||
SDep Dep(&FixedSU, SDep::Artificial); | ||
int latency = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Latency
} | ||
// Otherwise, the loop is the fallthrough predecessor by construction | ||
for (auto *Pred : MBB.predecessors()) { | ||
if (Pred->isLayoutSuccessor(&MBB)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. 'By Construction' only holds true for the InterBlock construction.
const BlockState &BS = | ||
Scheduler->getInterBlock().getBlockState(DAG->getBB()); | ||
const Region &CurRegion = BS.getCurrentRegion(); | ||
RegAvailabilityTracker RAT{ItinData, TRI}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the name 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is funny indeed.... ;-)
// separate mutator, doing so could be costly, as it would prevent the | ||
// creation of multiple edges from EntrySU to each free instruction that | ||
// depends on both timed regions (TopFixed and LoopTimed). | ||
RAT.computeAvailabilityCycles(LoopTimedBundles, /*PastTheEndCycles*/ true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fact: we don't need a full sweep in LoopTimedBundles
.
@@ -314,20 +315,102 @@ class RegionEndEdges : public ScheduleDAGMutation { | |||
/// "fixed" SUnits. | |||
class EmitFixedSUnits : public ScheduleDAGMutation { | |||
public: | |||
struct RegAvailabilityTracker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would even go a bit further and compute a "register event view". As a first conservative version:
- Every instruction would create "read" events in its first cycle for every input operand
- and "write" events in its last cycle (determined from
maxLatency()
) for every output operand
This would allow us to use the view for all cases:
- Deps between a non-pipelined loop and its epilogue
- Deps between top fixed and free instructions
- Deps between free and bottom fixed instructions
- etc.
What do you think? This could later become the base of timing-aware live ranges if we ever do register allocation after scheduling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside would be: it cannot be called RAT anymore. REV is still quite cool though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, interesting. But I miss the point about the usage of maxLatency()
here. For example, a post in load produces two outputs in different cycles and so I think it will be too pessimistic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gboss, I believe we should adopt a KISS approach for this REV, given our current requirements. Specifically, we need to have a clear understanding of:
- The last cycle in which a register is defined - top and bot part.
- The first cycle in which a register is read - top and bot part.
With this information, we can replace the findEarliestRef
method by directly connecting free SUs to ExitSU. Although this may slightly increase the cost for isAvailableNode
due to a higher number of pending SUs, it will simplify the process of comparing all free instructions against the BotFixed bundles to identify the first reference.
I propose creating this event view as a separate class, outside of the subtarget, so that we can extend it as needed in the future.
What are your thoughts?
@@ -48,6 +48,10 @@ getSingleBlockLoopMBBs(const MachineFunction &MF); | |||
/// Check if this block is a single block loop. | |||
bool isSingleMBBLoop(const MachineBasicBlock *MBB); | |||
|
|||
/// Considering that MBB has a single predecessor that is a loop | |||
/// and also layout predecessor, return it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it can return a layout predecessor that is not a loop, a unique predecessor that is not a loop or a null pointer (which is also not a loop)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just a refactoring. Any free use is dangerous, so I put this comment. Maybe we should name it as getLayoutPredecessor
and then we don't care if it is a loop or not. In this case, it is not a loop utils function anymore.... Any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: it is enough to assert isSingleMBBLoop
.
llvm/test/CodeGen/AIE/aie2/schedule/postpipeliner/load-add-store-renamed.mir
Show resolved
Hide resolved
for (const MachineOperand &MO : MI.all_uses()) { | ||
if (!MO.isReg() || !MO.isUse()) | ||
continue; | ||
for (MCRegAliasIterator Ali(MO.getReg(), TRI, true); Ali.isValid(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use alias iterator here, because we populated RegisterToCycle
with all aliases, so we are creating false alias cases, like bmh0
aliasing to bml0
. Eve better is to not populate RegisterToCycle
with alias and use alias here.
Similar to what is done for bottom-up.
Only for free instructions.
This commit prepares to schedule top fixed bundles. We also create dedicated loop exits early, handling new blocks along with their corresponding block states.
If we have TopFixed instructions, we start top-down and we change to bottom-up when we fill as much as possible the slots related to those instructions. Special care is needed for instructions with delay slot and bottom-fixed instructions.
4776f1d
to
cc37cc3
Compare
// into account MaxLatency. | ||
for (SUnit &FixedSU : make_filter_range(DAG->SUnits, IsTopFixedSU)) { | ||
const MachineInstr &MI = *FixedSU.getInstr(); | ||
if (const unsigned Latency = RAT.getMaxSrcOperandLatency(MI)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if was wrongly copy pasted. We need just maxLatency
here.
cc37cc3
to
4168aba
Compare
Also add all related dependencies to have safety margins.
4168aba
to
c940c2e
Compare
if (DedicatedExit == BB) { | ||
|
||
// Trim excedent empty bundles. | ||
while (BS.TopInsert.back().empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: With the latests changes to the post-pipeliner, it seems we can end up with a pipeline of 1 stage, essentially meaning the loop isn't pipelined. It's probably an oversight, and the loop should not have been considered as isPipelined()
. But still, it makes the BS.TopInsert.back()
code above crash.
I'd suggest understanding the root cause (you can check llvm/test/CodeGen/AIE/aie2/schedule/postpipeliner/crash.mir
which now crashed again 😆), and adding an assert in e.g. isPipelined()
that if a loop is pipelined, it has non-empty top and bottom inserts.
This PR adds support for Epilogue scheduling. In this way, it also adds:
I recommend reviewing the pull request in the order of commits, although some of them are closely related, so I plan to combine them in the future.
Ongoing work related to EmitFixedSUnits: we current add all WAR and RAW dependencies related to the top-insert and the rest. However, the bot-insert handling can be changed to use bot register events as well.