The new InstrSchedModel is easier to use than the instruction
itineraries. It will be used to model instruction latency and throughput
in modern Intel microarchitectures like Sandy Bridge.
InstrSchedModel should be able to coexist with instruction itinerary
classes, but for cleanliness we should switch the Atom processor model
to the new InstrSchedModel as well.
llvm-svn: 177122
See the Mips16ISetLowering.cpp patch to see a use of this.
For now now the extra code in Mips16ISetLowering.cpp is a nop but is
used for test purposes. Mips32 registers are setup and then removed and
then the Mips16 registers are setup.
Normally you need to add register classes and then call
computeRegisterProperties.
llvm-svn: 177120
This is a generic function (derived from PEI); moving it into
MachineFrameInfo eliminates a current redundancy between the ARM and AArch64
backends, and will allow it to be used by the PowerPC target code.
No functionality change intended.
llvm-svn: 177111
Add the current PEI register scavenger as a parameter to the
processFunctionBeforeFrameFinalized callback.
This change is necessary in order to allow the PowerPC target code to
set the register scavenger frame index after the save-area offset
adjustments performed by processFunctionBeforeFrameFinalized. Only
after these adjustments have been made is it possible to estimate
the size of the stack frame.
llvm-svn: 177108
Make requiresFrameIndexScavenging return true, and create virtual registers in
the spilling code instead of using the register scavenger directly. This makes
the target-level code simpler, and importantly, delays the scavenging until
after callee-saved register processing (which will be important for later
changes).
Also cleans up trackLivenessAfterRegAlloc (makes it inline in the header with
the other related functions). This makes it clear that it always returns true.
No functionality change intended.
llvm-svn: 177107
We used to add a spill slot for the register scavenger whenever the function
has a frame pointer. This is unnecessarily conservative: We may need the spill
slot for dynamic stack allocations, and functions with dynamic stack
allocations always have a FP, but we might also have a FP for other reasons
(such as the user explicitly disabling frame-pointer elimination), and we don't
necessarily need a spill slot for those functions.
The structsinregs test needed adjustment because it disables FP elimination.
llvm-svn: 177106
By terrible I mean we store/load from the stack.
This matters on PAQp8 in _Z5trainPsS_ii (which is inlined into Mixer::update)
where we decide to vectorize a loop with a VF of 8 resulting in a 25%
degradation on a cortex-a8.
LV: Found an estimated cost of 2 for VF 8 For instruction: icmp slt i32
LV: Found an estimated cost of 2 for VF 8 For instruction: select i1, i32, i32
The bug that tracks the CodeGen part is PR14868.
radar://13403975
llvm-svn: 177105
I don't think that it is otherwise clear how the overlapping offsets
are processed into distinct spill slots. Comment that this is done
in processFunctionBeforeFrameFinalized.
llvm-svn: 177094
This doesn't reset all of the target options within the TargetOptions
object. This is because some of those are ABI-specific and must be determined if
it's okay to change those on the fly.
llvm-svn: 176986
Increase the cost of v8/v16-i8 to v8/v16-i32 casts and truncates as the backend
currently lowers those using stack accesses.
This was responsible for a significant degradation on
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1
where we vectorize one loop to a vector factor of 16. After this patch we select
a vector factor of 4 which will generate reasonable code.
unsigned char cle[32];
void test(short c) {
unsigned short compte;
for (compte = 0; compte <= 31; compte++) {
cle[compte] = cle[compte] ^ c;
}
}
radar://13220512
llvm-svn: 176898
Now that only the register-scavenger version of the CR spilling code remains,
we no longer need the Darwin R2 hack. Darwin can use R0 as a spare register in
any case where the System V ABI uses it (R0 is special architecturally, and so
is reserved under all common ABIs).
A few test cases needed to be updated to reflect the register-allocation changes.
llvm-svn: 176868
This removes the -disable-ppc[32|64]-regscavenger options; the code
that uses the register scavenger has been working well (and has been the default)
for some time, and we don't need options to enable the old (broken) CR spilling code.
llvm-svn: 176865
The strlen+memcmp was hidden in a call to StringRef::operator==. We check if
there are any null bytes in the string upfront so we can simplify the comparison
Small speedup when compiling code with many function calls.
llvm-svn: 176766
fold selectcc (selectcc x, y, a, b, cc), b, a, b, setne ->
selectcc x, y, a, b, cc
Reviewed-by: Christian König <christian.koenig@amd.com>
llvm-svn: 176700
Two changes:
1. Prefer SET* instructions when possible
2. Handle the CND*_INT case with floating-point args
Reviewed-by: Christian König <christian.koenig@amd.com>
llvm-svn: 176699
LegalizeDAG.cpp uses the value of the comparison operands when checking
the legality of BR_CC, so DAGCombiner should do the same.
v2:
- Expand more BR_CC value types for NVPTX
v3:
- Expand correct BR_CC value types for Hexagon, Mips, and XCore.
llvm-svn: 176694
This is certainly not the last word on scheduling for this target, but
right now this allows a few apps to run / finish with radeonsi, most
notably UT2004 / Lightsmark. They fail to compile some shaders with the
default scheduler because it ends up trying to spill registers, which
we don't support yet (and which is probably a bad idea in general for
performance if it can be avoided).
NOTE: This is a candidate for the Mesa stable branch.
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 176687
That can usually be lowered efficiently and is common in sandybridge code.
It would be nice to do this in DAGCombiner but we can't insert arbitrary
BUILD_VECTORs this late.
Fixes PR15462.
llvm-svn: 176634
v2: update CMakeLists.txt as well
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 176626
v2: fix R600 regressions
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 176624
Just encode the type as target specific attribute.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 176622
- Phi nodes should be replaced/updated after lowering CMOV into branch
because 'mainMBB' updating operand in Phi node is changed.
- Add EFLAGS in livein before lowering the 2nd CMOV. It's necessary as
we will reuse the EFLAGS generated before the 1st lowered CMOV, which
won't clobber EFLAGS. However, we need explicitly specify that.
- '-attr=-cmov' test case are added.
llvm-svn: 176598
- Clear 'mayStore' flag when loading from the atomic variable before the
spin loop
- Clear kill flag from one use to multiple use in registers forming the
address to that atomic variable
- don't use a physical register as live-in register in BB (neither entry
nor landing pad.) by copying it into virtual register
(patch by Cameron Zwarich)
llvm-svn: 176538
This calling convention was added just to handle functions which return vector
of floats. The fix committed in r165585 solves the problem.
llvm-svn: 176530
This patch adds many more functions to the target library information.
All of the functions being added were discovered while doing the migration
of the simplify-libcalls attribute annotation functionality to the
functionattrs pass. As a part of that work the attribute annotation logic
will query TLI to determine if a function should be annotated or not.
Signed-off-by: Meador Inge <meadori@codesourcery.com>
llvm-svn: 176514
This is a skeleton for a pre-RA MachineInstr scheduler strategy. Currently
it only tries to expose more parallelism for ALU instructions (this also
makes the distribution of GPR channels more uniform and increases the
chances of ALU instructions to be packed together in a single VLIW group).
Also it tries to reduce clause switching by grouping instruction of the
same kind (ALU/FETCH/CF) together.
Vincent Lejeune:
- Support for VLIW4 Slot assignement
- Recomputation of ScheduleDAG to get more parallelism opportunities
Tom Stellard:
- Fix assertion failure when trying to determine an instruction's slot
based on its destination register's class
- Fix some compiler warnings
Vincent Lejeune: [v2]
- Remove recomputation of ScheduleDAG (will be provided in a later patch)
- Improve estimation of an ALU clause size so that heuristic does not emit cf
instructions at the wrong position.
- Make schedule heuristic smarter using SUnit Depth
- Take constant read limitations into account
Vincent Lejeune: [v3]
- Fix some uninitialized values in ConstPair
- Add asserts to ensure an ALU slot is always populated
llvm-svn: 176498
Maintaining CONST_COPY Instructions until Pre Emit may prevent some ifcvt case
and taking them in account for scheduling is difficult for no real benefit.
llvm-svn: 176488
Reviewed-by: Tom Stellard <thomas.stellard at amd.com>
mayLoad complexify scheduling and does not bring any usefull info
as the location is not writeable at all.
llvm-svn: 176486
one-byte NOPs. If the processor actually executes those NOPs, as it sometimes
does with aligned bundling, this can have a performance impact. From my
micro-benchmarks run on my one machine, a 15-byte NOP followed by twelve
one-byte NOPs is about 20% worse than a 15 followed by a 12. This patch
changes NOP emission to emit as many 15-byte (the maximum) as possible followed
by at most one shorter NOP.
llvm-svn: 176464
* Only apply divide bypass optimization when not optimizing for size.
* Fixed bug caused by constant for 0 value of type Int32,
used dividend type to generate the constant instead.
* For atom x86-64 apply the divide bypass to use 16-bit divides instead of
64-bit divides when operand values are small enough.
* Added lit tests for 64-bit divide bypass.
Patch by Tyler Nowicki!
llvm-svn: 176442
The VDUP instruction source register doesn't allow a non-constant lane
index, so make sure we don't construct a ARM::VDUPLANE node asking it to
do so.
rdar://13328063
http://llvm.org/bugs/show_bug.cgi?id=13963
llvm-svn: 176413
This matters for example in following matrix multiply:
int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
int i, j, k, val;
for (i=0; i<rows; i++) {
for (j=0; j<cols; j++) {
val = 0;
for (k=0; k<cols; k++) {
val += m1[i][k] * m2[k][j];
}
m3[i][j] = val;
}
}
return(m3);
}
Taken from the test-suite benchmark Shootout.
We estimate the cost of the multiply to be 2 while we generate 9 instructions
for it and end up being quite a bit slower than the scalar version (48% on my
machine).
Also, properly differentiate between avx1 and avx2. On avx-1 we still split the
vector into 2 128bits and handle the subvector muls like above with 9
instructions.
Only on avx-2 will we have a cost of 9 for v4i64.
I changed the test case in test/Transforms/LoopVectorize/X86/avx1.ll to use an
add instead of a mul because with a mul we now no longer vectorize. I did
verify that the mul would be indeed more expensive when vectorized with 3
kernels:
for (i ...)
r += a[i] * 3;
for (i ...)
m1[i] = m1[i] * 3; // This matches the test case in avx1.ll
and a matrix multiply.
In each case the vectorized version was considerably slower.
radar://13304919
llvm-svn: 176403
This patch eliminates the need to emit a constant move instruction when this
pattern is matched:
(select (setgt a, Constant), T, F)
The pattern above effectively turns into this:
(conditional-move (setlt a, Constant + 1), F, T)
llvm-svn: 176384
- ISD::SHL/SRL/SRA must have either both scalar or both vector operands
but TLI.getShiftAmountTy() so far only return scalar type. As a
result, backend logic assuming that breaks.
- Rename the original TLI.getShiftAmountTy() to
TLI.getScalarShiftAmountTy() and re-define TLI.getShiftAmountTy() to
return target-specificed scalar type or the same vector type as the
1st operand.
- Fix most TICG logic assuming TLI.getShiftAmountTy() a simple scalar
type.
llvm-svn: 176364
dispatch code. As far as I can tell the thumb2 code is behaving as expected.
I was able to compile and run the associated test case for both arm and thumb1.
rdar://13066352
llvm-svn: 176363
v2: based on Michels patch, but now allows copying of all registers sizes.
Signed-off-by: Michel Dänzer <michel.daenzer@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
llvm-svn: 176346