Commit Graph

686 Commits

Author SHA1 Message Date
Nick Lewycky 7a63c3b389 Revert r212572 "improve BasicAA CS-CS queries", it causes PR20303.
llvm-svn: 213024
2014-07-15 00:53:38 +00:00
Hal Finkel 8ae0f8d618 Improve BasicAA CS-CS queries
BasicAA contains knowledge of certain intrinsics, such as memcpy and memset,
and uses that information to form more-accurate answers to CallSite vs. Loc
ModRef queries. Unfortunately, it did not use this information when answering
CallSite vs. CallSite queries.

Generically, when an intrinsic takes one or more pointers and the intrinsic is
marked only to read/write from its arguments, the offset/size is unknown. As a
result, the generic code that answers CallSite vs. CallSite (and CallSite vs.
Loc) queries in AA uses UnknownSize when forming Locs from an intrinsic's
arguments. While BasicAA's CallSite vs. Loc override could use more-accurate
size information for some intrinsics, it did not do the same for CallSite vs.
CallSite queries.

This change refactors the intrinsic-specific logic in BasicAA into a generic AA
query function: getArgLocation, which is overridden by BasicAA to supply the
intrinsic-specific knowledge, and used by AA's generic implementation. This
allows the intrinsic-specific knowledge to be used by both CallSite vs. Loc and
CallSite vs. CallSite queries, and simplifies the BasicAA implementation.

Currently, only one function, Mac's memset_pattern16, is handled by BasicAA
(all the rest are intrinsics). As a side-effect of this refactoring, BasicAA's
getModRefBehavior override now also returns OnlyAccessesArgumentPointees for
this function (which is an improvement).

llvm-svn: 212572
2014-07-08 23:16:49 +00:00
Andrea Di Biagio c8e8bda58f [CostModel][x86] Improved cost model for alternate shuffles.
This patch:
 1) Improves the cost model for x86 alternate shuffles (originally
added at revision 211339);
 2) Teaches the Cost Model Analysis pass how to analyze alternate shuffles.

Alternate shuffles are a special kind of blend; on x86, we can often
easily lowered alternate shuffled into single blend
instruction (depending on the subtarget features).

The existing cost model didn't take into account subtarget features.
Also, it had a couple of "dead" entries for vector types that are never
legal (example: on x86 types v2i32 and v2f32 are not legal; those are
always either promoted or widened to 128-bit vector types).

The new x86 cost model takes into account what target features we have
before returning the shuffle cost (i.e. the number of instructions
after the blend is lowered/expanded).

This patch also teaches the Cost Model Analysis how to identify and analyze
alternate shuffles (i.e. 'SK_Alternate' shufflevector instructions):
 - added function 'isAlternateVectorMask';
 - added some logic to check if an instruction is a alternate shuffle and, in
   case, call the target specific TTI to get the corresponding shuffle cost;
 - added a test to verify the cost model analysis on alternate shuffles.

llvm-svn: 212296
2014-07-03 22:24:18 +00:00
Alp Toker d3d017cf00 Reduce verbiage of lit.local.cfg files
We can just split targets_to_build in one place and make it immutable.

llvm-svn: 210496
2014-06-09 22:42:55 +00:00
Tobias Grosser 40ac10085a ScalarEvolution: Derive element size from the type of the loaded element
Before, we where looking at the size of the pointer type that specifies the
location from which to load the element. This did not make any sense at all.

This change fixes a bug in the delinearization where we failed to delinerize
certain load instructions.

llvm-svn: 210435
2014-06-08 19:21:20 +00:00
Sebastian Pop a6e5860513 remove constant terms
The delinearization is needed only to remove the non linearity induced by
expressions involving multiplications of parameters and induction variables.
There is no problem in dealing with constant times parameters, or constant times
an induction variable.

For this reason, the current patch discards all constant terms and multipliers
before running the delinearization algorithm on the terms. The only thing
remaining in the term expressions are parameters and multiply expressions of
parameters: these simplified term expressions are passed to the array shape
recognizer that will not recognize constant dimensions anymore: these will be
recognized as different strides in parametric subscripts.

The only important special case of a constant dimension is the size of elements.
Instead of relying on the delinearization to infer the size of an element,
compute the element size from the base address type. This is a much more precise
way of computing the element size than before, as we would have mixed together
the size of an element with the strides of the innermost dimension.

llvm-svn: 209691
2014-05-27 22:41:45 +00:00
Dinesh Dwivedi c0e6703360 Adding testcase for PR18886.
Differential Revision: http://reviews.llvm.org/D3837

llvm-svn: 209645
2014-05-27 06:44:25 +00:00
Tim Northover 3b0846e8f7 AArch64/ARM64: move ARM64 into AArch64's place
This commit starts with a "git mv ARM64 AArch64" and continues out
from there, renaming the C++ classes, intrinsics, and other
target-local objects for consistency.

"ARM64" test directories are also moved, and tests that began their
life in ARM64 use an arm64 triple, those from AArch64 use an aarch64
triple. Both should be equivalent though.

This finishes the AArch64 merge, and everyone should feel free to
continue committing as normal now.

llvm-svn: 209577
2014-05-24 12:50:23 +00:00
Andrew Trick b429083aff Test case comments. Fix sloppiness.
llvm-svn: 209551
2014-05-23 20:46:21 +00:00
Andrew Trick 839e30b2c0 Fix and improve SCEV ComputeBackedgeTankCount.
This is a follow-up to r209358: PR19799: Indvars miscompile due to an
incorrect max backedge taken count from SCEV.

That fix was incomplete as pointed out by Arnold and Michael Z. The
code was also too confusing. It needed a careful rewrite with more
unit tests. This version will also happen to optimize more cases.

<rdar://17005101> PR19799: Indvars miscompile...

llvm-svn: 209545
2014-05-23 19:47:13 +00:00
Andrew Trick e255359b57 Fix a bug in SCEV's backedge taken count computation from my prior fix in Jan.
This has to do with the trip count computation for loops with multiple
exits, which is quite subtle. Most passes just ask for a single trip
count number, so we must be conservative assuming any exit could be
taken.  Normally, we rely on the "exact" trip count, which was
correctly given as "unknown". However, SCEV also gives a "max"
back-edge taken count. The loops max BE taken count is conservatively
a maximum over the max of each exit's non-exiting iterations
count. Note that some exit tests can be skipped so the max loop
back-edge taken count can actually exceed the max non-exiting
iterations for some exits. However, when we know the loop *latch*
cannot be skipped, we can directly use its max taken count
disregarding other exits. I previously took the minimum here without
checking whether the other exit could be skipped. The correct, and
simpler thing to do here is just to directly use the loop latch's max
non-exiting iterations as the loops max back-edge count.

In the problematic test case, the first loop exit had a max of zero
non-exiting iterations, but could be skipped. The loop latch was known
not to be skipped but had max of one non-exiting iteration. We
incorrectly claimed the loop back-edge could be taken zero times, when
it is actually taken one time.

Fixes Loop %for.body.i: <multiple exits> Unpredictable backedge-taken count.
Loop %for.body.i: max backedge-taken count is 1.

llvm-svn: 209358
2014-05-22 00:37:03 +00:00
Filipe Cabecinhas 7b12d773e3 Added tests for the cost of lowering VSELECT instructions.
llvm-svn: 209045
2014-05-16 22:47:58 +00:00
Alp Toker beaca19c7c Fix typos
llvm-svn: 208839
2014-05-15 01:52:21 +00:00
Adam Nemet 63e4b30f79 [Test] Trim unnecessary .c and .cpp from config.suffix in lit.local.cfg
Tested by comparing make check VERBOSE=1 before and after to make sure
no tests are missed.  (VERBOSE=1 prints the list of tests.)

Only one test :( remains where .cpp is required:

tools/llvm-cov/range_based_for.cpp:// RUN: llvm-cov range_based_for.cpp | FileCheck %s --check-prefix=STDOUT

The topic was discussed in this thread:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140428/214905.html

llvm-svn: 208621
2014-05-12 19:57:31 +00:00
Sebastian Pop b1a548f72d do not assert when delinearization fails
llvm-svn: 208615
2014-05-12 19:01:53 +00:00
Sebastian Pop 45dd14bac2 add testcase for r208237: do not collect undef terms
llvm-svn: 208347
2014-05-08 18:38:58 +00:00
Sebastian Pop 448712b1a6 split delinearization pass in 3 steps
To compute the dimensions of the array in a unique way, we split the
delinearization analysis in three steps:

- find parametric terms in all memory access functions
- compute the array dimensions from the set of terms
- compute the delinearized access functions for each dimension

The first step is executed on all the memory access functions such that we
gather all the patterns in which an array is accessed. The second step reduces
all this information in a unique description of the sizes of the array. The
third step is delinearizing each memory access function following the common
description of the shape of the array computed in step 2.

This rewrite of the delinearization pass also solves a problem we had with the
previous implementation: because the previous algorithm was by induction on the
structure of the SCEV, it would not correctly recognize the shape of the array
when the memory access was not following the nesting of the loops: for example,
see polly/test/ScopInfo/multidim_only_ivs_3d_reverse.ll

; void foo(long n, long m, long o, double A[n][m][o]) {
;
;   for (long i = 0; i < n; i++)
;     for (long j = 0; j < m; j++)
;       for (long k = 0; k < o; k++)
;         A[i][k][j] = 1.0;

Starting with this patch we no longer delinearize access functions that do not
contain parameters, for example in test/Analysis/DependenceAnalysis/GCD.ll

;;  for (long int i = 0; i < 100; i++)
;;    for (long int j = 0; j < 100; j++) {
;;      A[2*i - 4*j] = i;
;;      *B++ = A[6*i + 8*j];

these accesses will not be delinearized as the upper bound of the loops are
constants, and their access functions do not contain SCEVUnknown parameters.

llvm-svn: 208232
2014-05-07 18:01:20 +00:00
Benjamin Kramer 1625bfccbe TTI: Estimate @llvm.fmuladd cost as fmul + fadd when FMA's aren't legal on the target.
llvm-svn: 208115
2014-05-06 18:36:23 +00:00
Duncan P. N. Exon Smith c5a3139ebd Reapply "blockfreq: Approximate irreducible control flow"
This reverts commit r207287, reapplying r207286.

I'm hoping that declaring an explicit struct and instantiating
`addBlockEdges()` directly works around the GCC crash from r207286.
This is a lot more boilerplate, though.

llvm-svn: 207438
2014-04-28 20:02:29 +00:00
Benjamin Kramer ce4b3fee72 X86TTI: Adjust sdiv cost now that we can lower it on plain SSE2.
Includes a fix for a horrible typo that caused all SDIV costs to be
slightly off :)

llvm-svn: 207371
2014-04-27 18:47:54 +00:00
Benjamin Kramer 7c3722724b X86TTI: i16/i32 vector div with a constant (splat) divisor are reasonably cheap now.
Turn vectorization back on.

llvm-svn: 207320
2014-04-26 14:53:05 +00:00
Duncan P. N. Exon Smith 42292ceaa9 Revert "blockfreq: Approximate irreducible control flow"
This reverts commit r207286.  It causes an ICE on the
cmake-llvm-x86_64-linux buildbot [1]:

    llvm/lib/Analysis/BlockFrequencyInfo.cpp: In lambda function:
    llvm/lib/Analysis/BlockFrequencyInfo.cpp:182:1: internal compiler error: in get_expr_operands, at tree-ssa-operands.c:1035

[1]: http://bb.pgr.jp/builders/cmake-llvm-x86_64-linux/builds/12093/steps/build_llvm/logs/stdio

llvm-svn: 207287
2014-04-25 23:16:58 +00:00
Duncan P. N. Exon Smith 384d0e8ad4 blockfreq: Approximate irreducible control flow
Previously, irreducible backedges were ignored.  With this commit,
irreducible SCCs are discovered on the fly, and modelled as loops with
multiple headers.

This approximation specifies the headers of irreducible sub-SCCs as its
entry blocks and all nodes that are targets of a backedge within it
(excluding backedges within true sub-loops).  Block frequency
calculations act as if we insert a new block that intercepts all the
edges to the headers.  All backedges and entries to the irreducible SCC
point to this imaginary block.  This imaginary block has an edge (with
even probability) to each header block.

The result is now reasonable enough that I've added a number of
testcases for irreducible control flow.  I've outlined in
`BlockFrequencyInfoImpl.h` ways to improve the approximation.

<rdar://problem/14292693>

llvm-svn: 207286
2014-04-25 23:08:57 +00:00
Duncan P. N. Exon Smith cb7d29d30c blockfreq: Only one mass distribution per node
Remove the concepts of "forward" and "general" mass distributions, which
was wrong.  The split might have made sense in an early version of the
algorithm, but it's definitely wrong now.

<rdar://problem/14292693>

llvm-svn: 207195
2014-04-25 04:38:43 +00:00
Duncan P. N. Exon Smith 84408d1fda blockfreq: Use better branch weights in multiexit test
The branch weights were even before.  Make them different.

<rdar://problem/14292693>

llvm-svn: 207193
2014-04-25 04:38:37 +00:00
Duncan P. N. Exon Smith 58c8948a0c blockfreq: Clean up irreducible testcases
Strip irreducible testcases to pure control flow.  The function calls
made the branch weights more believable but cluttered it up a lot.
There isn't going to be any constant analysis here, so just use dumb
branch logic to clarify the important parts.

<rdar://problem/14292693>

llvm-svn: 207192
2014-04-25 04:38:35 +00:00
Duncan P. N. Exon Smith b3380ea60a blockfreq: Skip irreducible backedges inside functions
The branch that skips irreducible backedges was only active when
propagating mass at the top-level.  In particular, when propagating mass
through a loop recognized by `LoopInfo` with irreducible control flow
inside, irreducible backedges would not be skipped.

Not sure where that idea came from, but the result was that mass was
lost until after loop exit.  Added a testcase that covers this case.

llvm-svn: 206860
2014-04-22 03:31:53 +00:00
Duncan P. N. Exon Smith 10be9a8868 Reapply "blockfreq: Rewrite BlockFrequencyInfoImpl"
This reverts commit r206707, reapplying r206704.  The preceding commit
to CalcSpillWeights should have sorted out the failing buildbots.

<rdar://problem/14292693>

llvm-svn: 206766
2014-04-21 17:57:07 +00:00
Duncan P. N. Exon Smith e63327e967 Revert "blockfreq: Rewrite BlockFrequencyInfoImpl"
This reverts commit r206704, as expected.

llvm-svn: 206707
2014-04-19 22:46:00 +00:00
Duncan P. N. Exon Smith 875ddfac75 Reapply "blockfreq: Rewrite BlockFrequencyInfoImpl"
This reverts commit r206677, reapplying my BlockFrequencyInfo rewrite.

I've done a careful audit, added some asserts, and fixed a couple of
bugs (unfortunately, they were in unlikely code paths).  There's a small
chance that this will appease the failing bots [1][2].  (If so, great!)

If not, I have a follow-up commit ready that will temporarily add
-debug-only=block-freq to the two failing tests, allowing me to compare
the code path between what the failing bots and what my machines (and
the rest of the bots) are doing.  Once I've triggered those builds, I'll
revert both commits so the bots go green again.

[1]: http://bb.pgr.jp/builders/ninja-x64-msvc-RA-centos6/builds/1816
[2]: http://llvm-amd64.freebsd.your.org/b/builders/clang-i386-freebsd/builds/18445

<rdar://problem/14292693>

llvm-svn: 206704
2014-04-19 22:34:26 +00:00
Duncan P. N. Exon Smith 76b813619a Revert "blockfreq: Rewrite BlockFrequencyInfoImpl" (#2)
This reverts commit r206666, as planned.

Still stumped on why the bots are failing.  Sanitizer bots haven't
turned anything up.  If anyone can help me debug either of the failures
(referenced in r206666) I'll owe them a beer.  (In the meantime, I'll be
auditing my patch for undefined behaviour.)

llvm-svn: 206677
2014-04-19 00:42:46 +00:00
Duncan P. N. Exon Smith b3caf3646f Reapply "blockfreq: Rewrite BlockFrequencyInfoImpl" (#2)
This reverts commit r206628, reapplying r206622 (and r206626).

Two tests are failing only on buildbots [1][2]: i.e., I can't reproduce
on Darwin, and Chandler can't reproduce on Linux.  Asan and valgrind
don't tell us anything, but we're hoping the msan bot will catch it.

So, I'm applying this again to get more feedback from the bots.  I'll
leave it in long enough to trigger builds in at least the sanitizer
buildbots (it was failing for reasons unrelated to my commit last time
it was in), and hopefully a few others.... and then I expect to revert a
third time.

[1]: http://bb.pgr.jp/builders/ninja-x64-msvc-RA-centos6/builds/1816
[2]: http://llvm-amd64.freebsd.your.org/b/builders/clang-i386-freebsd/builds/18445

llvm-svn: 206666
2014-04-18 22:30:03 +00:00
Duncan P. N. Exon Smith 0842ff36a6 Revert "blockfreq: Rewrite BlockFrequencyInfoImpl" (#2)
This reverts commit r206622 and the MSVC fixup in r206626.

Apparently the remotely failing tests are still failing, despite my
attempt to fix the nondeterminism in r206621.

llvm-svn: 206628
2014-04-18 17:56:08 +00:00
Duncan P. N. Exon Smith f8361d127a Reapply "blockfreq: Rewrite BlockFrequencyInfoImpl"
This reverts commit r206556, effectively reapplying commit r206548 and
its fixups in r206549 and r206550.

In an intervening commit I've added target triples to the tests that
were failing remotely [1] (but passing locally).  I'm hoping the mystery
is solved?  I'll revert this again if the tests are still failing
remotely.

[1]: http://bb.pgr.jp/builders/ninja-x64-msvc-RA-centos6/builds/1816

llvm-svn: 206622
2014-04-18 17:22:25 +00:00
Chandler Carruth 18eadd9260 [LCG] Add support for building persistent and connected SCCs to the
LazyCallGraph. This is the start of the whole point of this different
abstraction, but it is just the initial bits. Here is a run-down of
what's going on here. I'm planning to incorporate some (or all) of this
into comments going forward, hopefully with better editing and wording.
=]

The crux of the problem with the traditional way of building SCCs is
that they are ephemeral. The new pass manager however really needs the
ability to associate analysis passes and results of analysis passes with
SCCs in order to expose these analysis passes to the SCC passes. Making
this work is kind-of the whole point of the new pass manager. =]

So, when we're building SCCs for the call graph, we actually want to
build persistent nodes that stick around and can be reasoned about
later. We'd also like the ability to walk the SCC graph in more complex
ways than just the traditional postorder traversal of the current CGSCC
walk. That means that in addition to being persistent, the SCCs need to
be connected into a useful graph structure.

However, we still want the SCCs to be formed lazily where possible.

These constraints are quite hard to satisfy with the SCC iterator. Also,
using that would bypass our ability to actually add data to the nodes of
the call graph to facilite implementing the Tarjan walk. So I've
re-implemented things in a more direct and embedded way. This
immediately makes it easy to get the persistence and connectivity
correct, and it also allows leveraging the existing nodes to simplify
the algorithm. I've worked somewhat to make this implementation more
closely follow the traditional paper's nomenclature and strategy,
although it is still a bit obtuse because it isn't recursive, using
an explicit stack and a tail call instead, and it is interruptable,
resuming each time we need another SCC.

The other tricky bit here, and what actually took almost all the time
and trials and errors I spent building this, is exactly *what* graph
structure to build for the SCCs. The naive thing to build is the call
graph in its newly acyclic form. I wrote about 4 versions of this which
did precisely this. Inevitably, when I experimented with them across
various use cases, they became incredibly awkward. It was all
implementable, but it felt like a complete wrong fit. Square peg, round
hole. There were two overriding aspects that pushed me in a different
direction:

1) We want to discover the SCC graph in a postorder fashion. That means
   the root node will be the *last* node we find. Using the call-SCC DAG
   as the graph structure of the SCCs results in an orphaned graph until
   we discover a root.

2) We will eventually want to walk the SCC graph in parallel, exploring
   distinct sub-graphs independently, and synchronizing at merge points.
   This again is not helped by the call-SCC DAG structure.

The structure which, quite surprisingly, ended up being completely
natural to use is the *inverse* of the call-SCC DAG. We add the leaf
SCCs to the graph as "roots", and have edges to the caller SCCs. Once
I switched to building this structure, everything just fell into place
elegantly.

Aside from general cleanups (there are FIXMEs and too few comments
overall) that are still needed, the other missing piece of this is
support for iterating across levels of the SCC graph. These will become
useful for implementing #2, but they aren't an immediate priority.

Once SCCs are in good shape, I'll be working on adding mutation support
for incremental updates and adding the pass manager that this analysis
enables.

llvm-svn: 206581
2014-04-18 10:50:32 +00:00
Duncan P. N. Exon Smith e576167df8 Revert "blockfreq: Rewrite BlockFrequencyInfoImpl"
This reverts commits r206548, r206549 and r206549.

There are some unit tests failing that aren't failing locally [1], so
reverting until I have time to investigate.

[1]: http://bb.pgr.jp/builders/ninja-x64-msvc-RA-centos6/builds/1816

llvm-svn: 206556
2014-04-18 02:17:43 +00:00
Duncan P. N. Exon Smith 12e68e1733 blockfreq: Rewrite BlockFrequencyInfoImpl
Rewrite the shared implementation of BlockFrequencyInfo and
MachineBlockFrequencyInfo entirely.

The old implementation had a fundamental flaw:  precision losses from
nested loops (or very wide branches) compounded past loop exits (and
convergence points).

The @nested_loops testcase at the end of
test/Analysis/BlockFrequencyAnalysis/basic.ll is motivating.  This
function has three nested loops, with branch weights in the loop headers
of 1:4000 (exit:continue).  The old analysis gives non-sensical results:

    Printing analysis 'Block Frequency Analysis' for function 'nested_loops':
    ---- Block Freqs ----
     entry = 1.0
     for.cond1.preheader = 1.00103
     for.cond4.preheader = 5.5222
     for.body6 = 18095.19995
     for.inc8 = 4.52264
     for.inc11 = 0.00109
     for.end13 = 0.0

The new analysis gives correct results:

    Printing analysis 'Block Frequency Analysis' for function 'nested_loops':
    block-frequency-info: nested_loops
     - entry: float = 1.0, int = 8
     - for.cond1.preheader: float = 4001.0, int = 32007
     - for.cond4.preheader: float = 16008001.0, int = 128064007
     - for.body6: float = 64048012001.0, int = 512384096007
     - for.inc8: float = 16008001.0, int = 128064007
     - for.inc11: float = 4001.0, int = 32007
     - for.end13: float = 1.0, int = 8

Most importantly, the frequency leaving each loop matches the frequency
entering it.

The new algorithm leverages BlockMass and PositiveFloat to maintain
precision, separates "probability mass distribution" from "loop
scaling", and uses dithering to eliminate probability mass loss.  I have
unit tests for these types out of tree, but it was decided in the review
to make the classes private to BlockFrequencyInfoImpl, and try to shrink
them (or remove them entirely) in follow-up commits.

The new algorithm should generally have a complexity advantage over the
old.  The previous algorithm was quadratic in the worst case.  The new
algorithm is still worst-case quadratic in the presence of irreducible
control flow, but it's linear without it.

The key difference between the old algorithm and the new is that control
flow within a loop is evaluated separately from control flow outside,
limiting propagation of precision problems and allowing loop scale to be
calculated independently of mass distribution.  Loops are visited
bottom-up, their loop scales are calculated, and they are replaced by
pseudo-nodes.  Mass is then distributed through the function, which is
now a DAG.  Finally, loops are revisited top-down to multiply through
the loop scales and the masses distributed to pseudo nodes.

There are some remaining flaws.

  - Irreducible control flow isn't modelled correctly.  LoopInfo and
    MachineLoopInfo ignore irreducible edges, so this algorithm will
    fail to scale accordingly.  There's a note in the class
    documentation about how to get closer.  See also the comments in
    test/Analysis/BlockFrequencyInfo/irreducible.ll.

  - Loop scale is limited to 4096 per loop (2^12) to avoid exhausting
    the 64-bit integer precision used downstream.

  - The "bias" calculation proposed on llvmdev is *not* incorporated
    here.  This will be added in a follow-up commit, once comments from
    this review have been handled.

llvm-svn: 206548
2014-04-18 01:57:45 +00:00
Akira Hatanaka 5638b89944 Fix a bug in which BranchProbabilityInfo wasn't setting branch weights of basic blocks inside loops correctly.
Previously, BranchProbabilityInfo::calcLoopBranchHeuristics would determine the weights of basic blocks inside loops even when it didn't have enough information to estimate the branch probabilities correctly. This patch fixes the function to exit early if it doesn't see any exit edges or back edges and let the later heuristics determine the weights.

This fixes PR18705 and <rdar://problem/15991090>.

Differential Revision: http://reviews.llvm.org/D3363

llvm-svn: 206194
2014-04-14 16:56:19 +00:00
Hal Finkel 56bf297e3a Don't assert in BasicTTI::getMemoryOpCost for non-simple types
BasicTTI::getMemoryOpCost must explicitly check for non-simple types; setting
AllowUnknown=true with TLI->getSimpleValueType is not sufficient because, for
example, non-power-of-two vector types return non-simple EVTs (not MVT::Other).

llvm-svn: 206150
2014-04-14 05:59:09 +00:00
Sebastian Pop b5b84e0963 in findGCD of multiply expr return the gcd
we used to return 1 instead of the gcd

llvm-svn: 205800
2014-04-08 21:21:05 +00:00
Hal Finkel de0b413ec0 [PowerPC] Adjust load/store costs in PPCTTI
This provides more realistic costs for the insert/extractelement instructions
(which are load/store pairs), accounts for the cheap unaligned Altivec load
sequence, and for unaligned VSX load/stores.

Bad news:
MultiSource/Applications/sgefa/sgefa - 35% slowdown (this will require more investigation)
SingleSource/Benchmarks/McGill/queens - 20% slowdown (we no longer vectorize this, but it was a constant store that was scalarized)
MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2 - 2% slowdown

Good news:
SingleSource/Benchmarks/Shootout/ary3 - 54% speedup
SingleSource/Benchmarks/Shootout-C++/ary - 40% speedup
MultiSource/Benchmarks/Ptrdist/ks/ks - 35% speedup
MultiSource/Benchmarks/FreeBench/neural/neural - 30% speedup
MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt - 20% speedup

Unfortunately, estimating the costs of the stack-based scalarization sequences
is hard, and adjusting these costs is like a game of whac-a-mole :( I'll
revisit this again after we have better codegen for vector extloads and
truncstores and unaligned load/stores.

llvm-svn: 205658
2014-04-04 23:51:18 +00:00
Hal Finkel 6fd19ab35e Account for scalarization costs in BasicTTI::getMemoryOpCost for extending vector loads
When a vector type legalizes to a larger vector type, and the target does not
support the associated extending load (or truncating store), then legalization
will scalarize the load (or store) resulting in an associated scalarization
cost.  BasicTTI::getMemoryOpCost needs to account for this.

Between this, and r205487, PowerPC on the P7 with VSX enabled shows:

MultiSource/Benchmarks/PAQ8p/paq8p: 43% speedup
SingleSource/Benchmarks/BenchmarkGame/puzzle: 51% speedup
SingleSource/UnitTests/Vectorizer/gcc-loops 28% speedup

(some of these are new; some of these, such as PAQ8p, just reverse regressions
that VSX support would trigger)

llvm-svn: 205495
2014-04-03 00:53:59 +00:00
Hal Finkel 55312debee Fix multi-register costs in BasicTTI::getCastInstrCost
For an cast (extension, etc.), the currently logic predicts a low cost if the
associated operation (keyed on the destination type) is legal (or promoted).
This is not true when the number of values required to legalize the type is
changing. For example, <8 x i16> being sign extended by <8 x i32> is not
generically cheap on PPC with VSX, even though sign extension to v4i32 is
legal, because two output v4i32 values are required compared to the single
v8i16 input value, and without custom logic in the target, this conversion will
scalarize.

llvm-svn: 205487
2014-04-02 23:18:54 +00:00
Tim Northover 00ed9964c6 ARM64: initial backend import
This adds a second implementation of the AArch64 architecture to LLVM,
accessible in parallel via the "arm64" triple. The plan over the
coming weeks & months is to merge the two into a single backend,
during which time thorough code review should naturally occur.

Everything will be easier with the target in-tree though, hence this
commit.

llvm-svn: 205090
2014-03-29 10:18:08 +00:00
Arnold Schwaighofer 1a444489e9 PR15967 Fix in basicaa for faulty returning no alias.
This commit consist of two parts.
The first part fix the PR15967. The wrong conclusion was made when the MaxLookup
limit was reached. The fix introduce a out parameter (MaxLookupReached) to
DecomposeGEPExpression that the function aliasGEP can act upon.
The second part is introducing the constant MaxLookupSearchDepth to make sure
that DecomposeGEPExpression and GetUnderlyingObject use the same search depth.
This is a small cleanup to clarify the original algorithm.

Patch by Karl-Johan Karlsson!

llvm-svn: 204859
2014-03-26 21:30:19 +00:00
Benjamin Kramer e75eaca32f ScalarEvolution: Compute exit counts for loops with a power-of-2 step.
If we have a loop of the form
for (unsigned n = 0; n != (k & -32); n += 32) {}
then we know that n is always divisible by 32 and the loop must
terminate. Even if we have a condition where the loop counter will
overflow it'll always hold this invariant.

PR19183. Our loop vectorizer creates this pattern and it's also
occasionally formed by loop counters derived from pointers.

llvm-svn: 204728
2014-03-25 16:25:12 +00:00
Rafael Espindola f3336bc1d5 Reject alias to undefined symbols in the verifier.
On ELF and COFF an alias is just another name for a position in the file.
There is no way to refer to a position in another file, so an alias to
undefined is meaningless.

MachO currently doesn't support aliases. The spec has a N_INDR, which when
implemented will have a different set of restrictions. Adding support for
it shouldn't be harder than any other IR extension.

For now, having the IR represent what is actually possible with current
tools makes it easier to fix the design of GlobalAlias.

llvm-svn: 203705
2014-03-12 20:15:49 +00:00
Raul E. Silvera ce376c0fcb When analyzing vectors of element type that require legalization,
the legalization cost must be included to get an accurate
estimation of the total cost of the scalarized vector.
The inaccurate cost triggered unprofitable SLP vectorization on
32-bit X86.

Summary:
Include legalization overhead when computing scalarization cost

Reviewers: hfinkel, nadav

CC: chandlerc, rnk, llvm-commits

Differential Revision: http://llvm-reviews.chandlerc.com/D2992

llvm-svn: 203509
2014-03-10 22:59:13 +00:00
Matt Arsenault a236ea551c Teach lint about address spaces
llvm-svn: 203132
2014-03-06 17:33:55 +00:00
Sebastian Pop f05ba89bd3 add -da-delinearize runs and checks to MIV testcases
llvm-svn: 201869
2014-02-21 18:15:18 +00:00
Nico Rieck 5ba5226ab9 Add extra CHECK prefix to tests with explicit prefix
These tests mistakenly assume that CHECK is still available even if an
explicit prefix is specified.

llvm-svn: 201492
2014-02-16 13:28:15 +00:00
Nico Rieck 35a237d4ed Actually call FileCheck in tests
llvm-svn: 201491
2014-02-16 13:27:39 +00:00
Nico Rieck 7647178738 Fix broken CHECK lines
llvm-svn: 201479
2014-02-16 07:31:05 +00:00
Andrea Di Biagio b7882b3bd1 [Vectorizer] Add a new 'OperandValueKind' in TargetTransformInfo called
'OK_NonUniformConstValue' to identify operands which are constants but
not constant splats.

The cost model now allows returning 'OK_NonUniformConstValue'
for non splat operands that are instances of ConstantVector or
ConstantDataVector.

With this change, targets are now able to compute different costs
for instructions with non-uniform constant operands.
For example, On X86 the cost of a vector shift may vary depending on whether
the second operand is a uniform or non-uniform constant.

This patch applies the following changes:
 - The cost model computation now takes into account non-uniform constants;
 - The cost of vector shift instructions has been improved in
   X86TargetTransformInfo analysis pass;
 - BBVectorize, SLPVectorizer and LoopVectorize now know how to distinguish
   between non-uniform and uniform constant operands.

Added a new test to verify that the output of opt
'-cost-model -analyze' is valid in the following configurations: SSE2,
SSE4.1, AVX, AVX2.

llvm-svn: 201272
2014-02-12 23:43:47 +00:00
Craig Topper 522af29465 Test case I forgot to 'add' for r201126.
llvm-svn: 201207
2014-02-12 03:58:47 +00:00
Benjamin Kramer 5a188549ad ScalarEvolution: Analyze trip count of loops with a switch guarding the exit.
llvm-svn: 201159
2014-02-11 15:44:32 +00:00
Tim Northover f0e21616f3 X86: add costs for 64-bit vector ext/trunc & rebalance
The most important part of this is probably adding any cost at all for
operations like zext <8 x i8> to <8 x i32>. Before they were being
recorded as extremely costly (24, I believe) which made LLVM fall back
on a 4-wide vectorisation of a loop.

It also rebalances the values for sext, zext and trunc. Lacking any
other sane metric that might work across CPU microarchitectures I went
for instructions. This seems to be in reasonable accord with the rest
of the table (sitofp, ...) though no doubt at least one value is
sub-optimal for some bizarre reason.

Finally, separate AVX and AVX2 values are provided where appropriate.
The CodeGen is quite different in many cases.

rdar://problem/15981990

llvm-svn: 200928
2014-02-06 18:18:36 +00:00
Chandler Carruth bf71a34eb9 [PM] Add a new "lazy" call graph analysis pass for the new pass manager.
The primary motivation for this pass is to separate the call graph
analysis used by the new pass manager's CGSCC pass management from the
existing call graph analysis pass. That analysis pass is (somewhat
unfortunately) over-constrained by the existing CallGraphSCCPassManager
requirements. Those requirements make it *really* hard to cleanly layer
the needed functionality for the new pass manager on top of the existing
analysis.

However, there are also a bunch of things that the pass manager would
specifically benefit from doing differently from the existing call graph
analysis, and this new implementation tries to address several of them:

- Be lazy about scanning function definitions. The existing pass eagerly
  scans the entire module to build the initial graph. This new pass is
  significantly more lazy, and I plan to push this even further to
  maximize locality during CGSCC walks.
- Don't use a single synthetic node to partition functions with an
  indirect call from functions whose address is taken. This node creates
  a huge choke-point which would preclude good parallelization across
  the fanout of the SCC graph when we got to the point of looking at
  such changes to LLVM.
- Use a memory dense and lightweight representation of the call graph
  rather than value handles and tracking call instructions. This will
  require explicit update calls instead of some updates working
  transparently, but should end up being significantly more efficient.
  The explicit update calls ended up being needed in many cases for the
  existing call graph so we don't really lose anything.
- Doesn't explicitly model SCCs and thus doesn't provide an "identity"
  for an SCC which is stable across updates. This is essential for the
  new pass manager to work correctly.
- Only form the graph necessary for traversing all of the functions in
  an SCC friendly order. This is a much simpler graph structure and
  should be more memory dense. It does limit the ways in which it is
  appropriate to use this analysis. I wish I had a better name than
  "call graph". I've commented extensively this aspect.

This is still very much a WIP, in fact it is really just the initial
bits. But it is about the fourth version of the initial bits that I've
implemented with each of the others running into really frustrating
problms. This looks like it will actually work and I'd like to split the
actual complexity across commits for the sake of my reviewers. =] The
rest of the implementation along with lots of wiring will follow
somewhat more rapidly now that there is a good path forward.

Naturally, this doesn't impact any of the existing optimizer. This code
is specific to the new pass manager.

A bunch of thanks are deserved for the various folks that have helped
with the design of this, especially Nick Lewycky who actually sat with
me to go through the fundamentals of the final version here.

llvm-svn: 200903
2014-02-06 04:37:03 +00:00
Nick Lewycky 629199ccb3 Fix crasher introduced in r200203 and caught by a libc++ buildbot. Don't assume that getMulExpr returns a SCEVMulExpr, it may have simplified it to something else!
llvm-svn: 200210
2014-01-27 10:47:44 +00:00
Nick Lewycky 31eaca5513 Teach SCEV to handle more cases of 'and X, CST', specifically where CST is any number of contiguous 1 bits in a row, with any number of leading and trailing 0 bits.
Unfortunately, this in turn led to some lower quality SCEVs due to some different paths through expression simplification, so add getUDivExactExpr and use it. This fixes all instances of the problems that I found, but we can make that function smarter as necessary.

Merge test "xor-and.ll" into "and-xor.ll" since I needed to update it anyways. Test 'nsw-offset.ll' analyzes a little deeper, %n now gets a scev in terms of %no instead of a SCEVUnknown.

llvm-svn: 200203
2014-01-27 10:04:03 +00:00
Alp Toker cb40291100 Fix known typos
Sweep the codebase for common typos. Includes some changes to visible function
names that were misspelt.

llvm-svn: 200018
2014-01-24 17:20:08 +00:00
Arnold Schwaighofer e3ac099726 BasicAA: We need to check both access sizes when comparing a gep and an
underlying object of unknown size.

Fixes PR18460.

llvm-svn: 199351
2014-01-16 04:53:18 +00:00
Benjamin Kramer c10563d14e Fix broken CHECK lines.
llvm-svn: 199016
2014-01-11 21:06:00 +00:00
Stepan Dyatkovskiy 431993b57b Fixed old typo in ScalarEvolution, that caused wrong SCEVs zext operation.
Detailed description is here:
http://llvm.org/bugs/show_bug.cgi?id=18000#c16

For participation in bugfix process special thanks to David Wiberg.

llvm-svn: 198863
2014-01-09 12:26:12 +00:00
Arnold Schwaighofer 833a82ecde BasicAA: Use reachabilty instead of dominance for checking value equality in phi
cycles

This allows the value equality check to work even if we don't have a dominator
tree. Also add some more comments.

I was worried about compile time impacts and did not implement reachability but
used the dominance check in the initial patch. The trade-off was that the
dominator tree was required.
The llvm utility function isPotentiallyReachable cuts off the recursive search
after 32 visits. Testing did not show any compile time regressions showing my
worries unjustfied.

No compile time or performance regressions at O3 -flto -mavx on test-suite +
externals.

Addresses review comments from r198290.

llvm-svn: 198400
2014-01-03 05:47:03 +00:00
Arnold Schwaighofer 0d10a9d579 BasicAA: Fix value equality and phi cycles
When there are cycles in the value graph we have to be careful interpreting
"Value*" identity as "value" equivalence. We interpret the value of a phi node
as the value of its operands.
When we check for value equivalence now we make sure that the "Value*" dominates
all cycles (phis).

%0 = phi [%noaliasval, %addr2]
%l = load %ptr
%addr1 = gep @a, 0, %l
%addr2 = gep @a, 0, (%l + 1)
store %ptr ...

Before this patch we would return NoAlias for (%0, %addr1) which is wrong
because the value of the load is from different iterations of the loop.

Tested on x86_64 -mavx at O3 and O3 -flto with no performance or compile time
regressions.

PR18068
radar://15653794

llvm-svn: 198290
2014-01-02 03:31:36 +00:00
Matt Arsenault a8fe22baba Use correct size for address space in BasicAA.
The tests just hit this with a different sized
address space since I haven't figured out how
to use this to break it.

I thought I committed this a long time ago,
and I'm not sure why missing this hasn't caused
any problems.

llvm-svn: 194903
2013-11-16 00:36:43 +00:00
Sebastian Pop a1cc34b981 improve dependence analysis testcases
print the name of the function on which the dependence analysis is performed
such that changes to the testcase are easier to review.

llvm-svn: 194528
2013-11-12 22:47:30 +00:00
Sebastian Pop c62c679c1b delinearization of arrays
llvm-svn: 194527
2013-11-12 22:47:20 +00:00
Andrew Trick 34e2f0c4ea Rewrite SCEV's backedge taken count computation.
Patch by Michele Scandale!

Rewrite of the functions used to compute the backedge taken count of a
loop on LT and GT comparisons.

I decided to split the handling of LT and GT cases becasue the trick
"a > b == -a < -b" in some cases prevents the trip count computation
due to the multiplication by -1 on the two operands of the
comparison. This issue comes from the conservative computation of
value range of SCEVs: taking the negative SCEV of an expression that
have a small positive range (e.g. [0,31]), we would have a SCEV with a
fullset as value range.

Indeed, in the new rewritten function I tried to better handle the
maximum backedge taken count computation when MAX/MIN expression are
used to handle the cases where no entry guard is found.

Some test have been modified in order to check the new value correctly
(I manually check them and reasoning on possible overflow the new
values seem correct).

I finally added a new test case related to the multiplication by -1
issue on GT comparisons.

llvm-svn: 194116
2013-11-06 02:08:26 +00:00
Hal Finkel 4d94930bcb Consider (x == -1) unlikely in BranchProbabilityInfo
This adds another heuristic to BPI, similar to the existing heuristic that
considers (x == 0) unlikely to be true. As suggested in the PACT'98 paper by
Deitrich, Cheng, and Hwu, -1 is often used to indicate an invalid index, and
equality comparisons with -1 are also unlikely to succeed. Local
experimentation supports this hypothesis: This yields a 1-2% speedup in the
test-suite sqlite benchmark on the PPC A2 core, with no significant
regressions.

llvm-svn: 193855
2013-11-01 10:58:22 +00:00
Benjamin Kramer 6094f30da2 SCEV: Make the final add of an inbounds GEP nuw if we know that the index is positive.
We can't do this for the general case as saying a GEP with a negative index
doesn't have unsigned wrap isn't valid for negative indices.
  %gep = getelementptr inbounds i32* %p, i64 -1

But an inbounds GEP cannot run past the end of address space. So we check for
the very common case of a positive index and make GEPs derived from that NUW.
Together with Andy's recent non-unit stride work this lets us analyze loops
like

  void foo3(int *a, int *b) {
    for (; a < b; a++) {}
  }

PR12375, PR12376.

Differential Revision: http://llvm-reviews.chandlerc.com/D2033

llvm-svn: 193514
2013-10-28 07:30:06 +00:00
Shuxin Yang 2e1890e18b Revert r193251 : Use address-taken to disambiguate global variable and indirect memops.
llvm-svn: 193489
2013-10-27 03:08:44 +00:00
Benjamin Kramer 0ccab2d66c X86: Custom lower sext v16i8 to v16i16, and the corresponding truncate.
Also update the cost model.

llvm-svn: 193270
2013-10-23 21:06:07 +00:00
Shuxin Yang e4fb375995 Use address-taken to disambiguate global variable and indirect memops.
Major steps include:
 1). introduces a not-addr-taken bit-field in GlobalVariable
 2). GlobalOpt pass sets "not-address-taken" if it proves a global varirable 
    dosen't have its address taken.
 3). AA use this info for disambiguation. 

llvm-svn: 193251
2013-10-23 17:28:19 +00:00
Manman Ren a1ce9891ac Simplify testing case (Thanks Rafael for the testing case).
llvm-svn: 193177
2013-10-22 18:15:50 +00:00
Manman Ren f3850c0807 TBAA: fix PR17620.
We can have a struct type with a single field and the field does not start
with 0. In that case, we should correctly update the offset.

llvm-svn: 193137
2013-10-22 01:40:25 +00:00
Matt Arsenault be18b8a3ca Fix creating bitcasts between address spaces in SCEV.
The test before wasn't successfully testing this
since it was missing the datalayout piece to change
the size of the second address space.

llvm-svn: 193102
2013-10-21 18:41:10 +00:00
Andrew Trick 768b917dc8 SCEV should use NSW to get trip count for positive nonunit stride loops.
SCEV currently fails to compute loop counts for nonunit stride
loops. This comes up frequently. It prevents loop optimization and
forces vectorization to insert extra loop checks.

For example:
void foo(int n, int *x) {
 for (int i = 0; i < n; i += 3) {
   x[i] = i;
   x[i+1] = i+1;
   x[i+2] = i+2;
 }
}

We need to properly handle the case in which limit > INT_MAX-stride. In
the above case: n > INT_MAX-3. In this case the loop counter will step
beyond the limit and overflow at the same time. However, knowing that
signed integer overlow in undefined, we can assume the loop test
behavior is arbitrary after overflow. This obeys both C undefined
behavior rules, and the more strict LLVM poison value rules.

I'm finally fixing this in response to Hal Finkel's persistence.
The most probable reason that we never optimized this before is that
we were being careful to handle case where the developer expected a
side-effect free infinite loop relying on overflow:

for (int i = 0; i < n; i += s) {
  ++j;
}
return j;

If INT_MAX+1 is a multiple of s and n > INT_MAX-s, then we might
expect an infinite loop. However there are plenty of ways to achieve
this effect without relying on undefined behavior of signed overflow.

llvm-svn: 193015
2013-10-18 23:43:53 +00:00
Chandler Carruth ea56494625 Remove the very substantial, largely unmaintained legacy PGO
infrastructure.

This was essentially work toward PGO based on a design that had several
flaws, partially dating from a time when LLVM had a different
architecture, and with an effort to modernize it abandoned without being
completed. Since then, it has bitrotted for several years further. The
result is nearly unusable, and isn't helping any of the modern PGO
efforts. Instead, it is getting in the way, adding confusion about PGO
in LLVM and distracting everyone with maintenance on essentially dead
code. Removing it paves the way for modern efforts around PGO.

Among other effects, this removes the last of the runtime libraries from
LLVM. Those are being developed in the separate 'compiler-rt' project
now, with somewhat different licensing specifically more approriate for
runtimes.

llvm-svn: 191835
2013-10-02 15:42:23 +00:00
Matt Arsenault 21981a1a0d Use CHECK-LABEL
llvm-svn: 191713
2013-09-30 23:31:55 +00:00
Manman Ren 0ed04fc9ab TBAA: handle scalar TBAA format and struct-path aware TBAA format.
Remove the command line argument "struct-path-tbaa" since we should not depend
on command line argument to decide which format the IR file is using. Instead,
we check the first operand of the tbaa tag node, if it is a MDNode, we treat
it as struct-path aware TBAA format, otherwise, we treat it as scalar TBAA
format.

When clang starts to use struct-path aware TBAA format no matter whether
struct-path-tbaa is no, and we can auto-upgrade existing bc files, the support
for scalar TBAA format can be dropped.

Existing testing cases are updated to use the struct-path aware TBAA format.

llvm-svn: 191538
2013-09-27 18:34:27 +00:00
Yi Jiang 5c343de8d3 X86 horizontal vector reduction cost model
llvm-svn: 191021
2013-09-19 17:48:48 +00:00
Arnold Schwaighofer cae8735a54 Costmodel: Add support for horizontal vector reductions
Upcoming SLP vectorization improvements will want to be able to estimate costs
of horizontal reductions. Add infrastructure to support this.

We model reductions as a series of (shufflevector,add) tuples ultimately
followed by an extractelement. For example, for an add-reduction of <4 x float>
we could generate the following sequence:

 (v0, v1, v2, v3)
   \   \  /  /
     \  \  /
       +  +

 (v0+v2, v1+v3, undef, undef)
    \      /
 ((v0+v2) + (v1+v3), undef, undef)

 %rdx.shuf = shufflevector <4 x float> %rdx, <4 x float> undef,
                           <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
 %bin.rdx = fadd <4 x float> %rdx, %rdx.shuf
 %rdx.shuf7 = shufflevector <4 x float> %bin.rdx, <4 x float> undef,
                          <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
 %bin.rdx8 = fadd <4 x float> %bin.rdx, %rdx.shuf7
 %r = extractelement <4 x float> %bin.rdx8, i32 0

This commit adds a cost model interface "getReductionCost(Opcode, Ty, Pairwise)"
that will allow clients to ask for the cost of such a reduction (as backends
might generate more efficient code than the cost of the individual instructions
summed up). This interface is excercised by the CostModel analysis pass which
looks for reduction patterns like the one above - starting at extractelements -
and if it sees a matching sequence will call the cost model interface.

We will also support a second form of pairwise reduction that is well supported
on common architectures (haddps, vpadd, faddp).

 (v0, v1, v2, v3)
  \   /    \  /
 (v0+v1, v2+v3, undef, undef)
    \     /
 ((v0+v1)+(v2+v3), undef, undef, undef)

  %rdx.shuf.0.0 = shufflevector <4 x float> %rdx, <4 x float> undef,
        <4 x i32> <i32 0, i32 2 , i32 undef, i32 undef>
  %rdx.shuf.0.1 = shufflevector <4 x float> %rdx, <4 x float> undef,
        <4 x i32> <i32 1, i32 3, i32 undef, i32 undef>
  %bin.rdx.0 = fadd <4 x float> %rdx.shuf.0.0, %rdx.shuf.0.1
  %rdx.shuf.1.0 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef,
        <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
  %rdx.shuf.1.1 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef,
        <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %bin.rdx.1 = fadd <4 x float> %rdx.shuf.1.0, %rdx.shuf.1.1
  %r = extractelement <4 x float> %bin.rdx.1, i32 0

llvm-svn: 190876
2013-09-17 18:06:50 +00:00
Matt Arsenault a90a18e0ea Teach ScalarEvolution about pointer address spaces
llvm-svn: 190425
2013-09-10 19:55:24 +00:00
Matt Arsenault 5faa669b66 Fix lint assert on integer vector division
llvm-svn: 189290
2013-08-26 23:29:33 +00:00
Bill Wendling a08bb49c61 FileCheck-ize tests.
llvm-svn: 188971
2013-08-22 00:51:19 +00:00
Daniel Dunbar 9efbedfd35 [tests] Cleanup initialization of test suffixes.
- Instead of setting the suffixes in a bunch of places, just set one master
   list in the top-level config. We now only modify the suffix list in a few
   suites that have one particular unique suffix (.ml, .mc, .yaml, .td, .py).

 - Aside from removing the need for a bunch of lit.local.cfg files, this enables
   4 tests that were inadvertently being skipped (one in
   Transforms/BranchFolding, a .s file each in DebugInfo/AArch64 and
   CodeGen/PowerPC, and one in CodeGen/SI which is now failing and has been
   XFAILED).

 - This commit also fixes a bunch of config files to use config.root instead of
   older copy-pasted code.

llvm-svn: 188513
2013-08-16 00:37:11 +00:00
Bill Wendling dc17270968 FileCheckize some of the testcases.
llvm-svn: 187756
2013-08-05 23:43:18 +00:00
Renato Golin 0178a25fc5 Fixes ARM LNT bot from SLP change in O3
This patch fixes the multiple breakages on ARM test-suite after the SLP
vectorizer was introduced by default on O3. The problem was an illegal
vector type on ARMTTI::getCmpSelInstrCost() <3 x i1> which is not simple.

The guard protects this code from breaking (cause of the problems) but
doesn't fix the issue that is generating the odd vector in the first
place, which also needs to be investigated.

llvm-svn: 187658
2013-08-02 17:10:04 +00:00
Stephen Lin 6dd347b39f Add newlines at end of test files, no functionality change
llvm-svn: 186263
2013-07-13 22:00:58 +00:00
Hal Finkel ec474f28e3 Add the nearbyint -> FNEARBYINT mapping to BasicTargetTransformInfo
This fixes an oversight that Intrinsic::nearbyint was not being mapped to
ISD::FNEARBYINT (thus fixing the over-optimistic cost we were assigning to
nearbyint calls for some targets).

llvm-svn: 185783
2013-07-08 03:24:07 +00:00
Nick Lewycky c2ec0725ce Extend 'readonly' and 'readnone' to work on function arguments as well as
functions. Make the function attributes pass add it to known library functions
and when it can deduce it.

llvm-svn: 185735
2013-07-06 00:29:58 +00:00
Jakob Stoklund Olesen 0b075103cd Minimize precision loss when computing cyclic probabilities.
Allow block frequencies to exceed 32 bits by using the new
BlockFrequency division function.

llvm-svn: 185236
2013-06-28 22:40:43 +00:00
Preston Briggs 6c286b6029 (no commit message)
llvm-svn: 185187
2013-06-28 18:44:48 +00:00
Nadav Rotem f9ecbcb835 CostModel: improve the cost model for load/store of non power-of-two types such as <3 x float>, which are popular in graphics.
llvm-svn: 185085
2013-06-27 17:52:04 +00:00
Jakob Stoklund Olesen 6e630d46d2 Print block frequencies in decimal form.
This is easier to read than the internal fixed-point representation.

If anybody knows the correct algorithm for converting fixed-point
numbers to base 10, feel free to fix it.

llvm-svn: 184881
2013-06-25 21:57:38 +00:00
Arnold Schwaighofer a04b9ef1e8 X86 cost model: Vectorizing integer division is a bad idea
radar://14057959

llvm-svn: 184872
2013-06-25 19:14:09 +00:00
Benjamin Kramer 866793109e BlockFrequency: Bump up the entry frequency a bit.
This is a band-aid to fix the most severe regressions we're seeing from basing
spill decisions on block frequencies, until we have a better solution.

llvm-svn: 184835
2013-06-25 13:34:40 +00:00
Benjamin Kramer bfb84d0bd6 Revert "BlockFrequency: Saturate at 1 instead of 0 when multiplying a frequency with a branch probability."
This reverts commit r184584. Breaks PPC selfhost.

llvm-svn: 184590
2013-06-21 20:20:27 +00:00
Benjamin Kramer bd0f107929 BlockFrequency: Saturate at 1 instead of 0 when multiplying a frequency with a branch probability.
Zero is used by BlockFrequencyInfo as a special "don't know" value. It also
causes a sink for frequencies as you can't ever get off a zero frequency with
more multiplies.

This recovers a 10% regression on MultiSource/Benchmarks/7zip. A zero frequency
was propagated into an inner loop causing excessive spilling.

PR16402.

llvm-svn: 184584
2013-06-21 19:30:05 +00:00
Andrew Trick 3c944ba2f0 Unit test for SCEV fix r182989, PR16130.
llvm-svn: 183017
2013-05-31 16:42:41 +00:00
Michael Kuperstein f3e663af39 Make BasicAliasAnalysis recognize the fact a noalias argument cannot alias another argument, even if the other argument is not itself marked noalias.
llvm-svn: 182755
2013-05-28 08:17:48 +00:00
Diego Novillo c63995394d Add a new function attribute 'cold' to functions.
Other than recognizing the attribute, the patch does little else.
It changes the branch probability analyzer so that edges into
blocks postdominated by a cold function are given low weight.

Added analysis and code generation tests.  Added documentation for the
new attribute.

llvm-svn: 182638
2013-05-24 12:26:52 +00:00
Tim Northover 7dbbc28f72 AArch64: use MCJIT by default and enable related tests.
This just enables some testing I'd missed after implementing MCJIT
support.

llvm-svn: 181215
2013-05-06 16:51:08 +00:00
Matt Arsenault c23753a53e Fix unchecked uses of DominatorTree in MemoryDependenceAnalysis.
Use unknown results for places where it would be needed

llvm-svn: 181176
2013-05-06 02:07:24 +00:00
Tobias Grosser a7ddc98206 RegionInfo: Do not crash if unreachable block is found
llvm-svn: 181025
2013-05-03 15:48:34 +00:00
Manman Ren 662ece49de TBAA: remove !tbaa from testing cases if not used.
This will make it easier to turn on struct-path aware TBAA since the metadata
format will change.

llvm-svn: 180743
2013-04-29 22:42:01 +00:00
Manman Ren 5c37106d65 Struct-path aware TBAA: change the format of TBAAStructType node.
We switch the order of offset and field type to make TBAAStructType node
(name, parent node, offset) similar to scalar TBAA node (name, parent node).
TypeIsImmutable is added to TBAAStructTag node.

llvm-svn: 180654
2013-04-27 00:26:11 +00:00
Arnold Schwaighofer 9881dcf2f2 ARM cost model: Integer div and rem is lowered to a function call
Reflect this in the cost model. I observed this in MiBench/consumer-lame.

radar://13354716

llvm-svn: 180576
2013-04-25 21:16:18 +00:00
Jim Grosbach 563983c8a3 Legalize vector truncates by parts rather than just splitting.
Rather than just splitting the input type and hoping for the best, apply
a bit more cleverness. Just splitting the types until the source is
legal often leads to an illegal result time, which is then widened and a
scalarization step is introduced which leads to truly horrible code
generation. With the loop vectorizer, these sorts of operations are much
more common, and so it's worth extra effort to do them well.

Add a legalization hook for the operands of a TRUNCATE node, which will
be encountered after the result type has been legalized, but if the
operand type is still illegal. If simple splitting of both types
ends up with the result type of each half still being legal, just
do that (v16i16 -> v16i8 on ARM, for example). If, however, that would
result in an illegal result type (v8i32 -> v8i8 on ARM, for example),
we can get more clever with power-two vectors. Specifically,
split the input type, but also widen the result element size, then
concatenate the halves and truncate again.  For example on ARM,
To perform a "%res = v8i8 trunc v8i32 %in" we transform to:
  %inlo = v4i32 extract_subvector %in, 0
  %inhi = v4i32 extract_subvector %in, 4
  %lo16 = v4i16 trunc v4i32 %inlo
  %hi16 = v4i16 trunc v4i32 %inhi
  %in16 = v8i16 concat_vectors v4i16 %lo16, v4i16 %hi16
  %res = v8i8 trunc v8i16 %in16

This allows instruction selection to generate three VMOVN instructions
instead of a sequences of moves, stores and loads.

Update the ARMTargetTransformInfo to take this improved legalization
into account.

Consider the simplified IR:

define <16 x i8> @test1(<16 x i32>* %ap) {
  %a = load <16 x i32>* %ap
  %tmp = trunc <16 x i32> %a to <16 x i8>
  ret <16 x i8> %tmp
}

define <8 x i8> @test2(<8 x i32>* %ap) {
  %a = load <8 x i32>* %ap
  %tmp = trunc <8 x i32> %a to <8 x i8>
  ret <8 x i8> %tmp
}

Previously, we would generate the truly hideous:
	.syntax unified
	.section	__TEXT,__text,regular,pure_instructions
	.globl	_test1
	.align	2
_test1:                                 @ @test1
@ BB#0:
	push	{r7}
	mov	r7, sp
	sub	sp, sp, #20
	bic	sp, sp, #7
	add	r1, r0, #48
	add	r2, r0, #32
	vld1.64	{d24, d25}, [r0:128]
	vld1.64	{d16, d17}, [r1:128]
	vld1.64	{d18, d19}, [r2:128]
	add	r1, r0, #16
	vmovn.i32	d22, q8
	vld1.64	{d16, d17}, [r1:128]
	vmovn.i32	d20, q9
	vmovn.i32	d18, q12
	vmov.u16	r0, d22[3]
	strb	r0, [sp, #15]
	vmov.u16	r0, d22[2]
	strb	r0, [sp, #14]
	vmov.u16	r0, d22[1]
	strb	r0, [sp, #13]
	vmov.u16	r0, d22[0]
	vmovn.i32	d16, q8
	strb	r0, [sp, #12]
	vmov.u16	r0, d20[3]
	strb	r0, [sp, #11]
	vmov.u16	r0, d20[2]
	strb	r0, [sp, #10]
	vmov.u16	r0, d20[1]
	strb	r0, [sp, #9]
	vmov.u16	r0, d20[0]
	strb	r0, [sp, #8]
	vmov.u16	r0, d18[3]
	strb	r0, [sp, #3]
	vmov.u16	r0, d18[2]
	strb	r0, [sp, #2]
	vmov.u16	r0, d18[1]
	strb	r0, [sp, #1]
	vmov.u16	r0, d18[0]
	strb	r0, [sp]
	vmov.u16	r0, d16[3]
	strb	r0, [sp, #7]
	vmov.u16	r0, d16[2]
	strb	r0, [sp, #6]
	vmov.u16	r0, d16[1]
	strb	r0, [sp, #5]
	vmov.u16	r0, d16[0]
	strb	r0, [sp, #4]
	vldmia	sp, {d16, d17}
	vmov	r0, r1, d16
	vmov	r2, r3, d17
	mov	sp, r7
	pop	{r7}
	bx	lr

	.globl	_test2
	.align	2
_test2:                                 @ @test2
@ BB#0:
	push	{r7}
	mov	r7, sp
	sub	sp, sp, #12
	bic	sp, sp, #7
	vld1.64	{d16, d17}, [r0:128]
	add	r0, r0, #16
	vld1.64	{d20, d21}, [r0:128]
	vmovn.i32	d18, q8
	vmov.u16	r0, d18[3]
	vmovn.i32	d16, q10
	strb	r0, [sp, #3]
	vmov.u16	r0, d18[2]
	strb	r0, [sp, #2]
	vmov.u16	r0, d18[1]
	strb	r0, [sp, #1]
	vmov.u16	r0, d18[0]
	strb	r0, [sp]
	vmov.u16	r0, d16[3]
	strb	r0, [sp, #7]
	vmov.u16	r0, d16[2]
	strb	r0, [sp, #6]
	vmov.u16	r0, d16[1]
	strb	r0, [sp, #5]
	vmov.u16	r0, d16[0]
	strb	r0, [sp, #4]
	ldm	sp, {r0, r1}
	mov	sp, r7
	pop	{r7}
	bx	lr

Now, however, we generate the much more straightforward:
	.syntax unified
	.section	__TEXT,__text,regular,pure_instructions
	.globl	_test1
	.align	2
_test1:                                 @ @test1
@ BB#0:
	add	r1, r0, #48
	add	r2, r0, #32
	vld1.64	{d20, d21}, [r0:128]
	vld1.64	{d16, d17}, [r1:128]
	add	r1, r0, #16
	vld1.64	{d18, d19}, [r2:128]
	vld1.64	{d22, d23}, [r1:128]
	vmovn.i32	d17, q8
	vmovn.i32	d16, q9
	vmovn.i32	d18, q10
	vmovn.i32	d19, q11
	vmovn.i16	d17, q8
	vmovn.i16	d16, q9
	vmov	r0, r1, d16
	vmov	r2, r3, d17
	bx	lr

	.globl	_test2
	.align	2
_test2:                                 @ @test2
@ BB#0:
	vld1.64	{d16, d17}, [r0:128]
	add	r0, r0, #16
	vld1.64	{d18, d19}, [r0:128]
	vmovn.i32	d16, q8
	vmovn.i32	d17, q9
	vmovn.i16	d16, q8
	vmov	r0, r1, d16
	bx	lr

llvm-svn: 179989
2013-04-21 23:47:41 +00:00
Arnold Schwaighofer c0c7ff4ac0 X86 cost model: Exit before calling getSimpleVT on non-simple VTs
getSimpleVT can only handle simple value types.

radar://13676022

llvm-svn: 179714
2013-04-17 20:04:53 +00:00
Nadav Rotem 87a0af6e1b CostModel: increase the default cost of supported floating point operations from 1 to two. Fixed a few tests that changes because now the cost of one insert + a vector operation on two doubles is lower than two scalar operations on doubles.
llvm-svn: 179413
2013-04-12 21:15:03 +00:00
Manman Ren 06a9d50a35 Aliasing rules for struct-path aware TBAA.
Added PathAliases to check if two struct-path tags can alias.
Added command line option -struct-path-tbaa.

llvm-svn: 179337
2013-04-11 23:24:18 +00:00
Arnold Schwaighofer f47d2d7f6b X86 cost model: Model cost for uitofp and sitofp on SSE2
The costs are overfitted so that I can still use the legalization factor.

For example the following kernel has about half the throughput vectorized than
unvectorized when compiled with SSE2. Before this patch we would vectorize it.

unsigned short A[1024];
double B[1024];
void f() {
  int i;
  for (i = 0; i < 1024; ++i) {
    B[i] = (double) A[i];
  }
}

radar://13599001

llvm-svn: 179033
2013-04-08 18:05:48 +00:00
Arnold Schwaighofer 995ce6c388 TargetLowering: Fix getTypeConversion handling of extended vector types
The code in getTypeConversion attempts to promote the element vector type
before it trys to split or widen the vector.
After it failed finding a legal vector type by promoting it would continue using
the promoted vector element type. Thereby missing legal splitted vector types.
For example the type v32i32 that has a legal split of 4 x v3i32 on x86/sse2
would be transformed to: v32i256 and from there on successively split to:
v16i256, v8i256, v1i256 and then finally ends up as an i64 type.
By resetting the vector element type to the original vector element type that
existed before the promotion the code will attempt to split the vector type to
smaller vector widths of the same type.

llvm-svn: 178999
2013-04-07 20:22:56 +00:00
Arnold Schwaighofer 44f902ed7d X86 cost model: Differentiate cost for vector shifts of constants
SSE2 has efficient support for shifts by a scalar. My previous change of making
shifts expensive did not take this into account marking all shifts as expensive.
This would prevent vectorization from happening where it is actually beneficial.

With this change we differentiate between shifts of constants and other shifts.

radar://13576547

llvm-svn: 178808
2013-04-04 23:26:24 +00:00
Arnold Schwaighofer e9b5016411 X86 cost model: Vector shifts are expensive in most cases
The default logic does not correctly identify costs of casts because they are
marked as custom on x86.

For some cases, where the shift amount is a scalar we would be able to generate
better code. Unfortunately, when this is the case the value (the splat) will get
hoisted out of the loop, thereby making it invisible to ISel.

radar://13130673
radar://13537826

llvm-svn: 178703
2013-04-03 21:46:05 +00:00
Benjamin Kramer 52ceb44331 X86TTI: Add accurate costs for itofp operations, based on the actual instruction counts.
llvm-svn: 178459
2013-04-01 10:23:49 +00:00
Andrew Trick 9093e15066 Fix SCEV forgetMemoizedResults should search and destroy backedge exprs.
Fixes PR15570: SEGV: SCEV back-edge info invalid after dead code removal.

Indvars creates a SCEV expression for the loop's back edge taken
count, then determines that the comparison is always true and
removes it.

When loop-unroll asks for the expression, it contains a NULL
SCEVUnknkown (as a CallbackVH).

forgetMemoizedResults should invalidate the loop back edges expression.

llvm-svn: 177986
2013-03-26 03:14:53 +00:00
Jyotsna Verma 5bd02926b4 Disable profiling tests for Hexagon since it doesn't support JIT.
llvm-svn: 177917
2013-03-25 21:15:11 +00:00
Manman Ren 0827e97700 Support in AAEvaluator to print alias queries of loads/stores with TBAA tags.
Add "evaluate-tbaa" to print alias queries of loads/stores. Alias queries
between pointers do not include TBAA tags.

Add testing case for "placement new". TBAA currently says NoAlias.

llvm-svn: 177772
2013-03-22 22:34:41 +00:00
Michael Liao 70dd7f999d Correct cost model for vector shift on AVX2
- After moving logic recognizing vector shift with scalar amount from
  DAG combining into DAG lowering, we declare to customize all vector
  shifts even vector shift on AVX is legal. As a result, the cost model
  needs special tuning to identify these legal cases.

llvm-svn: 177586
2013-03-20 22:01:10 +00:00
Nadav Rotem 0f1bc60d51 Optimize sext <4 x i8> and <4 x i16> to <4 x i64>.
Patch by Ahmad, Muhammad T <muhammad.t.ahmad@intel.com>

llvm-svn: 177421
2013-03-19 18:38:27 +00:00
Renato Golin 227eb6fc5f Improve long vector sext/zext lowering on ARM
The ARM backend currently has poor codegen for long sext/zext
operations, such as v8i8 -> v8i32. This patch addresses this
by performing a custom expansion in ARMISelLowering. It also
adds/changes the cost of such lowering in ARMTTI.

This partially addresses PR14867.

Patch by Pete Couperus

llvm-svn: 177380
2013-03-19 08:15:38 +00:00
Arnold Schwaighofer ae0052f114 ARM cost model: Make some vector integer to float casts cheaper
The default logic marks them as too expensive.

For example, before this patch we estimated:
  cost of 16 for instruction:   %r = uitofp <4 x i16> %v0 to <4 x float>

While this translates to:
  vmovl.u16 q8, d16
  vcvt.f32.u32  q8, q8

All other costs are left to the values assigned by the fallback logic. Theses
costs are mostly reasonable in the sense that they get progressively more
expensive as the instruction sequences emitted get longer.

radar://13445992

llvm-svn: 177334
2013-03-18 22:47:09 +00:00
Arnold Schwaighofer 6c9c3a8b99 ARM cost model: Correct cost for some cheap float to integer conversions
Fix cost of some "cheap" cast instructions. Before this patch we used to
estimate for example:
  cost of 16 for instruction:   %r = fptoui <4 x float> %v0 to <4 x i16>

While we would emit:
  vcvt.s32.f32  q8, q8
  vmovn.i32 d16, q8
  vuzp.8  d16, d17

All other costs are left to the values assigned by the fallback logic. Theses
costs are mostly reasonable in the sense that they get progressively more
expensive as the instruction sequences emitted get longer.

radar://13434072

llvm-svn: 177333
2013-03-18 22:47:06 +00:00
Arnold Schwaighofer 9d7a3827e4 ARM cost model: Fix costs for some vector selects
I was too pessimistic in r177105. Vector selects that fit into a legal register
type lower just fine. I was mislead by the code fragment that I was using. The
stores/loads that I saw in those cases came from lowering the conditional off
an address.

Changing the code fragment to:

%T0_3 = type <8 x i18>
%T1_3 = type <8 x i1>

define void @func_blend3(%T0_3* %loadaddr, %T0_3* %loadaddr2,
                         %T1_3* %blend, %T0_3* %storeaddr) {
  %v0 = load %T0_3* %loadaddr
  %v1 = load %T0_3* %loadaddr2
==> FROM:
  ;%c = load %T1_3* %blend
==> TO:
  %c = icmp slt %T0_3 %v0, %v1
==> USE:
  %r = select %T1_3 %c, %T0_3 %v0, %T0_3 %v1

  store %T0_3 %r, %T0_3* %storeaddr
  ret void
}

revealed this mistake.

radar://13403975

llvm-svn: 177170
2013-03-15 18:31:01 +00:00
Arnold Schwaighofer f5284ff61f ARM cost model: Fix cost of fptrunc and fpext instructions
A vector fptrunc and fpext simply gets split into scalar instructions.

radar://13192358

llvm-svn: 177159
2013-03-15 15:10:47 +00:00
Arnold Schwaighofer 8070b382ec ARM cost model: Increase cost of some vector selects we do terrible on
By terrible I mean we store/load from the stack.

This matters on PAQp8 in _Z5trainPsS_ii (which is inlined into Mixer::update)
where we decide to vectorize a loop with a VF of 8 resulting in a 25%
degradation on a cortex-a8.

LV: Found an estimated cost of 2 for VF 8 For instruction:   icmp slt i32
LV: Found an estimated cost of 2 for VF 8 For instruction:   select i1, i32, i32

The bug that tracks the CodeGen part is PR14868.

radar://13403975

llvm-svn: 177105
2013-03-14 19:17:02 +00:00
Arnold Schwaighofer 90774f3c8f ARM cost model: Increase the cost for vector casts that use the stack
Increase the cost of v8/v16-i8 to v8/v16-i32 casts and truncates as the backend
currently lowers those using stack accesses.

This was responsible for a significant degradation on
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1
where we vectorize one loop to a vector factor of 16. After this patch we select
a vector factor of 4 which will generate reasonable code.

unsigned char cle[32];

void test(short c) {
  unsigned short compte;
  for (compte = 0; compte <= 31; compte++) {
    cle[compte] = cle[compte] ^ c;
  }
}

radar://13220512

llvm-svn: 176898
2013-03-12 21:19:22 +00:00
Jan Wen Voung 6dc3076080 Revert the test moves from 176733. Use "REQUIRES: asserts" instead.
llvm-svn: 176873
2013-03-12 16:27:52 +00:00
Jan Wen Voung 7857a64909 Disable statistics on Release builds and move tests that depend on -stats.
Summary:
Statistics are still available in Release+Asserts (any +Asserts builds),
and stats can also be turned on with LLVM_ENABLE_STATS.

Move some of the FastISel stats that were moved under DEBUG()
back out of DEBUG(), since stats are disabled across the board now.

Many tests depend on grepping "-stats" output.  Move those into
a orig_dir/Stats/. so that they can be marked as unsupported
when building without statistics.

Differential Revision: http://llvm-reviews.chandlerc.com/D486

llvm-svn: 176733
2013-03-08 22:56:31 +00:00
Shuxin Yang 408bdad5b4 Memory Dependence Analysis (not mem-dep test) take advantage of "invariant.load" metadata.
The "invariant.load" metadata indicates the memory unit being accessed is immutable.
A load annotated with this metadata can be moved across any store.

As I am not sure if it is legal to move such loads across barrier/fence, this
change dose not allow such transformation.

rdar://11311484

Thank Arnold for code review.

llvm-svn: 176562
2013-03-06 17:48:48 +00:00
Arnold Schwaighofer 20ef54f4c1 X86 cost model: Adjust cost for custom lowered vector multiplies
This matters for example in following matrix multiply:

int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
  int i, j, k, val;
  for (i=0; i<rows; i++) {
    for (j=0; j<cols; j++) {
      val = 0;
      for (k=0; k<cols; k++) {
        val += m1[i][k] * m2[k][j];
      }
      m3[i][j] = val;
    }
  }
  return(m3);
}

Taken from the test-suite benchmark Shootout.

We estimate the cost of the multiply to be 2 while we generate 9 instructions
for it and end up being quite a bit slower than the scalar version (48% on my
machine).

Also, properly differentiate between avx1 and avx2. On avx-1 we still split the
vector into 2 128bits and handle the subvector muls like above with 9
instructions.
Only on avx-2 will we have a cost of 9 for v4i64.

I changed the test case in test/Transforms/LoopVectorize/X86/avx1.ll to use an
add instead of a mul because with a mul we now no longer vectorize. I did
verify that the mul would be indeed more expensive when vectorized with 3
kernels:

for (i ...)
   r += a[i] * 3;
for (i ...)
  m1[i] = m1[i] * 3; // This matches the test case in avx1.ll
and a matrix multiply.

In each case the vectorized version was considerably slower.

radar://13304919

llvm-svn: 176403
2013-03-02 04:02:52 +00:00
Benjamin Kramer f7cfac7a14 Cost model support for lowered math builtins.
We make the cost for calling libm functions extremely high as emitting the
calls is expensive and causes spills (on x86) so performance suffers. We still
vectorize important calls like ceilf and friends on SSE4.1. and fabs.

Differential Revision: http://llvm-reviews.chandlerc.com/D466

llvm-svn: 176287
2013-02-28 19:09:33 +00:00
Bill Wendling a032374ea0 Use references to attribute groups on the call/invoke instructions.
Listing all of the attributes for the callee of a call/invoke instruction is way
too much and makes the IR unreadable. Use references to attributes instead.

llvm-svn: 175877
2013-02-22 09:09:42 +00:00
Elena Demikhovsky 0ccdd1315b I optimized the following patterns:
sext <4 x i1> to <4 x i64>
 sext <4 x i8> to <4 x i64>
 sext <4 x i16> to <4 x i64>
 
I'm running Combine on SIGN_EXTEND_IN_REG and revert SEXT patterns:
 (sext_in_reg (v4i64 anyext (v4i32 x )), ExtraVT) -> (v4i64 sext (v4i32 sext_in_reg (v4i32 x , ExtraVT)))
 
 The sext_in_reg (v4i32 x) may be lowered to shl+sar operations.
 The "sar" does not exist on 64-bit operation, so lowering sext_in_reg (v4i64 x) has no vector solution.

I also added a cost of this operations to the AVX costs table.

llvm-svn: 175619
2013-02-20 12:42:54 +00:00
Bill Wendling 90bc19cd91 Modify the LLVM assembly output so that it uses references to represent function attributes.
This makes the LLVM assembly look better. E.g.:

     define void @foo() #0 { ret void }
     attributes #0 = { nounwind noinline ssp }

llvm-svn: 175605
2013-02-20 07:21:42 +00:00
Tim Northover 67d3c09332 AArch64: adjust tests which rely on a default JIT
Profiling tests *do* need a JIT. They'll pass if a cross-compiler targetting
AArch64 by default has been built, but fail if a native AArch64 compiler has
been build. Therefore XFAIL is inappropriate and we mark them unsupported.

ExecutionEngine tests are JIT by definition, they should also be unsupported.

Transforms/LICM only uses the interpreter to check the output is still sane
after optimisation. It can be switched to use an interpreter.

llvm-svn: 175433
2013-02-18 11:08:37 +00:00
Arnold Schwaighofer 89aef93841 ARM cost model: Add vector reverse shuffle costs
A reverse shuffle is lowered to a vrev and possibly a vext instruction (quad
word).

radar://13171406

llvm-svn: 174933
2013-02-12 02:40:39 +00:00
Bill Schmidt 62fe7a5b17 Refine fix to bug 15041.
Thanks to help from Nadav and Hal, I have a more reasonable (and even
correct!) approach.  This specifically penalizes the insertelement
and extractelement operations for the performance hit that will occur
on PowerPC processors.

llvm-svn: 174725
2013-02-08 18:19:17 +00:00
Arnold Schwaighofer 594fa2dc2b ARM cost model: Address computation in vector mem ops not free
Adds a function to target transform info to query for the cost of address
computation. The cost model analysis pass now also queries this interface.
The code in LoopVectorize adds the cost of address computation as part of the
memory instruction cost calculation. Only there, we know whether the instruction
will be scalarized or not.
Increase the penality for inserting in to D registers on swift. This becomes
necessary because we now always assume that address computation has a cost and
three is a closer value to the architecture.

radar://13097204

llvm-svn: 174713
2013-02-08 14:50:48 +00:00
Arnold Schwaighofer 213fced704 ARM cost model: Add costs for vector selects
Vector selects are cheap on NEON. They get lowered to a vbsl instruction.

radar://13158753

llvm-svn: 174631
2013-02-07 16:10:15 +00:00
Arnold Schwaighofer a804bbee9b ARM cost model: Cost for scalar integer casts and floating point conversions
Also adds some costs for vector integer float conversions.

llvm-svn: 174371
2013-02-05 14:05:55 +00:00
Arnold Schwaighofer 98f1012f9b ARM cost model: Penalize insertelement into D subregisters
Swift has a renaming dependency if we load into D subregisters. We don't have a
way of distinguishing between insertelement operations of values from loads and
other values. Therefore, we are pessimistic for now (The performance problem
showed up in example 14 of gcc-loops).

radar://13096933

llvm-svn: 174300
2013-02-04 02:52:05 +00:00
Hal Finkel 4e5ca9e578 Initial implementation of PPCTargetTransformInfo
This provides a place to add customized operation cost information and
control some other target-specific IR-level transformations.

The only non-trivial logic in this checkin assigns a higher cost to
unaligned loads and stores (covered by the included test case).

llvm-svn: 173520
2013-01-25 23:05:59 +00:00
Nadav Rotem b1615b1ac4 Make opt grab the triple from the module and use it to initialize the target machine.
llvm-svn: 171341
2013-01-01 08:00:32 +00:00
Dmitri Gribenko 56bf2e1830 Tests: rewrite 'opt ... %s' to 'opt ... < %s' so that opt does not emit a ModuleID
This is done to avoid odd test failures, like the one fixed in r171243.

llvm-svn: 171250
2012-12-30 02:33:22 +00:00
Dmitri Gribenko 10c4b4d249 Add a check to the test Analysis/ScalarEvolution/2010-09-03-RequiredTransitive.ll
This test did not test anything at all (except for opt crashing, but that was
not the reason why it was added).

llvm-svn: 171248
2012-12-30 01:42:34 +00:00