192 lines
9.2 KiB
ReStructuredText
192 lines
9.2 KiB
ReStructuredText
.. _autocounter:
|
|
|
|
AutoCounter: Profiling with Out-of-Band Performance Counter Collection
|
|
======================================================================
|
|
|
|
FireSim can provide visibility into a simulated CPU's architectural and
|
|
microarchitectural state over the course of execution through the use of counters. These
|
|
are similar to performance counters provided by processor vendors, and more general
|
|
counters provided by architectural simulators.
|
|
|
|
This functionality is provided by the AutoCounter feature (introduced in our `FirePerf
|
|
paper at ASPLOS 2020 <https://sagark.org/assets/pubs/fireperf-asplos2020.pdf>`_), and
|
|
can be used for profiling and debugging. Since AutoCounter injects counters only in
|
|
simulation (unlike target-level performance counters), these counters do not affect the
|
|
behavior of the simulated machine, no matter how often they are sampled.
|
|
|
|
Chisel Interface
|
|
----------------
|
|
|
|
AutoCounter enables the addition of ad-hoc counters using the ``PerfCounter`` object in
|
|
the `midas.targetutils` package. PerfCounters counters can be added in one of two modes:
|
|
|
|
1. `Accumulate`, using the standard ``PerfCounter.apply`` method. Here the annotated
|
|
UInt (1 or more bits) is added to a 64b accumulation register: the target is treated
|
|
as representing an N-bit UInt and will increment the counter by a value between [0,
|
|
2^n - 1] per cycle.
|
|
2. `Identity`, using the ``PerfCounter.identity`` method. Here the annotated UInt is
|
|
sampled directly. This can be used to annotate a sample with values are not
|
|
accumulator-like (e.g., a PC), and permits the user to define more complex
|
|
instrumentation logic in the target itself.
|
|
|
|
We give examples of using PerfCounter below:
|
|
|
|
.. code-block:: scala
|
|
|
|
// A standard boolean event. Increments by 1 or 0 every local clock cycle.
|
|
midas.targetutils.PerfCounter(en_clock, "gate_clock", "Core clock gated")
|
|
|
|
// A multibit example. If the core can retire three isntructions per cycle,
|
|
// encode this as a two-bit unit. Extra-width is OK but the encoding to the UInt
|
|
// (e.g., doing a pop count), must be done by the user.
|
|
midas.targetutils.PerfCounter(insns_ret, "iret", "Instructions retired")
|
|
|
|
// An identity value. Note: the pc here must be <= 64b wide.
|
|
midas.targetutils.PerfCounter.identity(pc, "pc", "The value of the program counter at the time of a sample")
|
|
|
|
See the `PerfCounter Scala API docs
|
|
<https://fires.im/firesim/latest/api/midas/targetutils/PerfCounter$.html>`_ for more
|
|
detail about the Chisel-side interface.
|
|
|
|
Enabling AutoCounter in Golden Gate
|
|
-----------------------------------
|
|
|
|
By default, annotated events are not synthesized into AutoCounters. To enable
|
|
AutoCounter when compiling a design, prepend the ``WithAutoCounter`` config to your
|
|
``PLATFORM_CONFIG``. During compilation, Golden Gate will print the signals it is
|
|
generating counters for.
|
|
|
|
Rocket Chip Cover Functions
|
|
---------------------------
|
|
|
|
The cover function is applied to various signals in the Rocket Chip generator repository
|
|
to mark points of interest (i.e., interesting signals) in the RTL. Tools are free to
|
|
provide their own implementation of this function to process these signals as they wish.
|
|
In FireSim, these functions can be used as a hook for automatic generation of counters.
|
|
|
|
Since cover functions are embedded throughout the code of Rocket Chip (and possibly
|
|
other code repositories), AutoCounter provides a filtering mechanism based on module
|
|
granularity. As such, only cover functions that appear within selected modules will
|
|
generate counters.
|
|
|
|
The filtered modules can be indicated using one of two methods:
|
|
|
|
1. An annotation attached to the module for which cover functions should be turned into
|
|
AutoCounters. The annotation requires a ``ModuleTarget`` which can be pointed to any
|
|
module in the design. Alternatively, the current module can be annotated as follows:
|
|
|
|
.. code-block:: scala
|
|
|
|
class SomeModule(implicit p: Parameters) extends Module
|
|
{
|
|
chisel3.experimental.annotate(AutoCounterCoverModuleAnnotation(
|
|
Module.currentModule.get.toTarget))
|
|
}
|
|
|
|
2. An input file with a list of module names. This input file is named
|
|
``autocounter-covermodules.txt``, and includes a list of module names separated by
|
|
new lines (no commas).
|
|
|
|
.. _autocounter-runtime-parameters:
|
|
|
|
AutoCounter Runtime Parameters
|
|
------------------------------
|
|
|
|
AutoCounter currently takes a single runtime configurable parameter, defined under the
|
|
``autocounter:`` section in the ``config_runtime.yaml`` file. The ``read_rate``
|
|
parameter defines the rate at which the counters should be read, and is measured in
|
|
target-cycles of the base target-clock (clock 0 produced by the ClockBridge). Hence, if
|
|
the read_rate is defined to be 100 and the tile frequency is 2x the base clock (ex.,
|
|
which may drive the uncore), the simulator will read and print the values of the
|
|
counters every 200 core-clock cycles. If the core-domain clock is the base clock, it
|
|
would do so every 100 cycles. By default, the read_rate is set to 0 cycles, which
|
|
disables AutoCounter.
|
|
|
|
.. code-block:: yaml
|
|
|
|
autocounter:
|
|
# read counters every 100 cycles
|
|
read_rate: 100
|
|
|
|
.. note::
|
|
|
|
AutoCounter is designed as a coarse-grained observability mechanism, as sampling
|
|
each counter requires two (blocking) MMIO reads (each read takes O(100) ns on EC2
|
|
F1). As a result sampling at intervals less than O(10000) cycles may adversely
|
|
affect simulation performance for large numbers of counters. If you intend on
|
|
reading counters at a finer granularity, consider using synthesizable printfs.
|
|
|
|
AutoCounter CSV Output Format
|
|
-----------------------------
|
|
|
|
AutoCounter output files are CSVs generated in the working directory where the simulator
|
|
was invoked (this applies to metasimulators too), with the default names
|
|
``AUTOCOUNTERFILE<i>.csv``, one per clock domain. The CSV output format is depicted
|
|
below, assuming a sampling period of ``N`` base clock cycles.
|
|
|
|
.. csv-table:: AutoCounter CSV Format
|
|
:file: autocounter-csv-format.csv
|
|
|
|
Column Notes:
|
|
|
|
1. Each column beyond the first two corresponds to a PerfCounter instance in the clock
|
|
domain.
|
|
2. Column 0 past the header corresponds to the base clock cycle of the sample.
|
|
3. The local_cycle counter (column 1) is implemented as an always enabled single-bit
|
|
event, and increments even when the target is under reset.
|
|
|
|
Row Notes:
|
|
|
|
1. Header row 0: autocounter csv format version, an integer.
|
|
2. Header row 1: clock domain information.
|
|
3. Header row 2: the label parameter provided to PerfCounter suffixed with the instance
|
|
path.
|
|
4. Header row 3: the description parameter provided to PerfCounter. Quoted.
|
|
5. Header row 4: the width of the field annotated in the target.
|
|
6. Header row 5: the width of the accumulation register. Not configurable, but makes it
|
|
clear when to expect rollover.
|
|
7. Header row 6: indicates the accumulation scheme. Can be "Identity" or "Accumulate".
|
|
8. Sample row 0: sampled values at the bitwidth of the accumulation register.
|
|
9. Sample row k: ditto above, k * N base cycles later
|
|
|
|
Using TracerV Trigger with AutoCounter
|
|
--------------------------------------
|
|
|
|
In order to collect AutoCounter results from only from a particular region of interest
|
|
in the simulation, AutoCounter has been integrated with TracerV triggers. See the
|
|
:ref:`tracerv-trigger` section for more information.
|
|
|
|
AutoCounter using Synthesizable Printfs
|
|
---------------------------------------
|
|
|
|
The AutoCounter transformation in Golden Gate includes an event-driven mode that uses
|
|
Synthesizable Printfs (see :ref:`printf-synthesis`) to export counter results `as they
|
|
are updated` rather than sampling them periodically with a dedicated Bridge. This mode
|
|
can be enabled by prepending the ``WithAutoCounterCoverPrintf`` config to your
|
|
``PLATFORM_CONFIG`` instead of ``WithAutoCounterCover``. Based on the selected event
|
|
mode the printfs will have the following runtime behavior:
|
|
|
|
- `Accumulate`: On a non-zero increment, the local cycle count and the new counter value
|
|
are printed. This produces a series of prints with monotonically increasingly values.
|
|
- `Identity`: On a transition of the annotated target, the local cycle count and the new
|
|
value are printed. Thus a target that transitions every cycle will produce printf
|
|
traffic every cycle.
|
|
|
|
This mode may be useful for temporally fine-grained observation of counters. The counter
|
|
values will be printed to the same output stream as other synthesizable printfs. This
|
|
mode uses considerably more FPGA resources per counter, and may consume considerable
|
|
amounts of DMA bandwidth (since it prints every cycle a counter increments), which may
|
|
adversly affect simulation performance (increased FMR).
|
|
|
|
Reset & Timing Considerations
|
|
-----------------------------
|
|
|
|
- Events and identity values provided while under local reset, or while the
|
|
``GlobalResetCondition`` asserted, are zero-ed out. Similarly, printfs that might
|
|
otherwise be active under a reset are masked out.
|
|
- The sampling period in slower clock domains is currently calculated using a truncating
|
|
division of the period in the base clock domain. Thus, when the base clock period can
|
|
not be cleanly divided, samples in the slower clock domain will gradually fall out of
|
|
phase with samples in the base clock domain. In all cases, the "local_cycle" column is
|
|
most accurate measure of sample time.
|