firesim/docs/Advanced-Usage/Manager/Manager-Tasks.rst

497 lines
20 KiB
ReStructuredText

Manager Tasks
=============
This page outlines all of the tasks that the FireSim manager supports.
.. _firesim-managerinit:
``firesim managerinit``
-----------------------
This is a setup command that does the following:
- Backup existing config files if they exist (``config_runtime.yaml``,
``config_build.yaml``, ``config_build_recipes.yaml``, and ``config_hwdb.yaml``).
- Replace the default config files (``config_runtime.yaml``, ``config_build.yaml``,
``config_build_recipes.yaml``, and ``config_hwdb.yaml``) with clean example versions.
Then, do platform-specific init steps for the given ``--platform``.
.. tabs::
.. tab::
``f1``
- Run ``aws configure``, prompt for credentials
- Prompt the user for email address and subscribe them to
notifications for their own builds.
- Setup the ``config_runtime.yaml`` and ``config_build.yaml``
files with AWS run/build farm arguments.
.. tab::
All other platforms
This includes platforms such as: ``xilinx_alveo_u200``,
``xilinx_alveo_u250``, ``xilinx_alveo_u280``, ``xilinx_vcu118``,
and ``rhsresearch_nitefury_ii``.
- Setup the ``config_runtime.yaml`` and ``config_build.yaml``
files with externally provisioned run/build farm arguments.
You can re-run this whenever you want to get clean configuration files.
.. note::
For ``f1``, you can just hit Enter when prompted for ``aws configure`` credentials
and your email address, and both will keep your previously specified values.
If you run this command by accident and didn't mean to overwrite your configuration
files, you'll find backed-up versions in
``firesim/deploy/sample-backup-configs/backup*``.
.. _firesim-buildbitstream:
``firesim buildbitstream``
--------------------------
This command builds a FireSim bitstream using a **Build Farm** from the Chisel RTL for
the configurations that you specify. The process of defining configurations to build is
explained in the documentation for :ref:`config-build` and :ref:`config-build-recipes`.
For each config, the build process entails:
.. tabs::
.. tab::
F1
#. [Locally] Run the elaboration process for your hardware
configuration
#. [Locally] FAME-1 transform the design with MIDAS
#. [Locally] Attach simulation models (I/O widgets, memory model,
etc.)
#. [Locally] Emit Verilog to run through the FPGA Flow
#. Use a build farm configuration to launch/use build hosts for
each configuration you want to build
#. [Local/Remote] Prep build hosts, copy generated Verilog for
hardware configuration to build instance
#. [Local or Remote] Run Vivado Synthesis and P&R for the
configuration
#. [Local/Remote] Copy back all output generated by Vivado
including the final tar file
#. [Local/AWS Infra] Submit the tar file to the AWS backend for
conversion to an AFI
#. [Local] Wait for the AFI to become available, then notify the
user of completion by email
.. tab::
XDMA-based On-Prem.
#. [Locally] Run the elaboration process for your hardware
configuration
#. [Locally] FAME-1 transform the design with MIDAS
#. [Locally] Attach simulation models (I/O widgets, memory model,
etc.)
#. [Locally] Emit Verilog to run through the FPGA Flow
#. Use a build farm configuration to launch/use build hosts for
each configuration you want to build
#. [Local/Remote] Prep build hosts, copy generated Verilog for
hardware configuration to build instance
#. [Local or Remote] Run Vivado Synthesis and P&R for the
configuration
#. [Local/Remote] Copy back all output generated by Vivado
(including ``bit`` bitstream)
.. tab::
Vitis-based On-Prem.
#. [Locally] Run the elaboration process for your hardware
configuration
#. [Locally] FAME-1 transform the design with MIDAS
#. [Locally] Attach simulation models (I/O widgets, memory model,
etc.)
#. [Locally] Emit Verilog to run through the FPGA Flow
#. Use a build farm configuration to launch/use build hosts for
each configuration you want to build
#. [Local/Remote] Prep build hosts, copy generated Verilog for
hardware configuration to build instance
#. [Local or Remote] Run Vitis Synthesis and P&R for the
configuration
#. [Local/Remote] Copy back all output generated by Vitis
(including the ``bitstream_tar`` containing the ``xclbin``
bitstream)
This process happens in parallel for all of the builds you specify. The command will
exit when all builds are completed (but you will get notified as INDIVIDUAL builds
complete if on F1) and indicate whether all builds passed or a build failed by the exit
code.
.. note::
**It is highly recommended that you either run this command in a** ``screen`` **or
use** ``mosh`` **to access the manager instance. Builds will not finish if the
manager is killed due to ssh disconnection from the manager instance.**
When you run a build for a particular configuration, a directory named
``LAUNCHTIME-CONFIG_TRIPLET-BUILD_NAME`` is created in
``firesim/deploy/results-build/``. This directory will contain:
.. tabs::
.. tab::
F1
- ``AGFI_INFO``: Describes the state of the AFI being built,
while the manager is running. Upon build completion, this
contains the AGFI/AFI that was produced, along with its
metadata.
- ``cl_firesim:``: This directory is essentially the Vivado
project that built the FPGA image, in the state it was in when
the Vivado build process completed. This contains reports,
stdout from the build, and the final tar file produced by
Vivado. This also contains a copy of the generated verilog
(``FireSim-generated.sv``) used to produce this build.
.. tab::
XDMA-based On-Prem.
The Vivado project collateral that built the FPGA image, in the
state it was in when the Vivado build process completed. This
contains reports, ``stdout`` from the build, and the final
``bitstream_tar`` bitstream/metadata file produced by Vivado. This
also contains a copy of the generated verilog
(``FireSim-generated.sv``) used to produce this build.
.. tab::
Vitis-based On-Prem.
The Vitis project collateral that built the FPGA image, in the
state it was in when the Vitis build process completed. This
contains reports, ``stdout`` from the build, and the final
``bitstream_tar`` produced from the Vitis-generated ``xclbin``
bitstream. This also contains a copy of the generated verilog
(``FireSim-generated.sv``) used to produce this build.
If this command is cancelled by a SIGINT, it will prompt for confirmation that you want
to terminate the build instances. If you respond in the affirmative, it will move
forward with the termination. If you do not want to have to confirm the termination
(e.g. you are using this command in a script), you can give the command the
``--forceterminate`` command line argument. For example, the following will terminate
all build instances in the build farm without prompting for confirmation if a SIGINT is
received:
.. code-block:: bash
firesim buildbitstream --forceterminate
.. _firesim-builddriver:
``firesim builddriver``
-----------------------
For FPGA-based simulations (when ``metasimulation_enabled`` is ``false`` in
``config_runtime.yaml``), this command will build the host-side simulation driver, also
without requiring any simulation hosts to be launched or reachable. For complicated
designs, running this before running ``firesim launchrunfarm`` can reduce the time spent
leaving FPGA hosts idling while waiting for driver build.
For metasimulations (when ``metasimulation_enabled`` is ``true`` in
``config_runtime.yaml``), this command will build the entire software simulator without
requiring any simulation hosts to be launched or reachable. This is useful for example
if you are using FireSim metasimulations as your primary simulation tool while
developing target RTL, since it allows you to run the Chisel build flow and iterate on
your design without launching/setting up extra machines to run simulations.
.. _firesim-tar2afi:
``firesim tar2afi``
-------------------
.. note::
Can only be used for the F1 platform.
This command can be used to run only steps 9 & 10 from an aborted ``firesim
buildbitstream`` for F1 that has been manually corrected. ``firesim tar2afi`` assumes
that you have a
``firesim/deploy/results-build/LAUNCHTIME-CONFIG_TRIPLET-BUILD_NAME/cl_firesim``
directory tree that can be submitted to the AWS backend for conversion to an AFI.
When using this command, you need to also provide the ``--launchtime LAUNCHTIME``
cmdline argument, specifying an already existing LAUNCHTIME.
This command will run for the configurations specified in :ref:`config-build` and
:ref:`config-build-recipes` as with :ref:`firesim-buildbitstream`. It is likely that you
may want to comment out build recipe names that successfully completed the
:ref:`firesim-buildbitstream` process before running this command.
.. _firesim-shareagfi:
``firesim shareagfi``
---------------------
.. note::
Can only be used for the F1 platform.
This command allows you to share AGFIs that you have already built (that are listed in
:ref:`config-hwdb`) with other users. It will take the named hardware configurations
that you list in the ``agfis_to_share`` section of ``config_build.yaml``, grab the
respective AGFIs for each from ``config_hwdb.yaml``, and share them across all F1
regions with the users listed in the ``share_with_accounts`` section of
``config_build.yaml``. You can also specify ``public: public`` in
``share_with_accounts`` to make the AGFIs public.
You must own the AGFIs in order to do this -- this will NOT let you share AGFIs that
someone else owns and gave you access to.
.. _firesim-launchrunfarm:
``firesim launchrunfarm``
-------------------------
.. note::
Can only be used for the F1 platform.
This command launches a **Run Farm** on AWS EC2 on which you run simulations. Run farms
consist of a set of **run farm instances** that can be spawned on AWS EC2. The
``run_farm`` mapping in ``config_runtime.yaml`` determines the run farm used and its
configuration (see :ref:`config-runtime`). The ``base_recipe`` key/value pair specifies
the default set of arguments to use for a particular run farm type. To change the run
farm type, a new ``base_recipe`` file must be provided from ``deploy/run-farm-recipes``.
You are able to override the arguments given by a ``base_recipe`` by adding keys/values
to the ``recipe_arg_overrides`` mapping. These keys/values must match the same mapping
structure as the ``args`` mapping. Overridden arguments override recursively such that
all key/values present in the override args replace the default arguments given by the
``base_recipe``. In the case of sequences, a overridden sequence completely replaces the
corresponding sequence in the default args.
An AWS EC2 run farm consists of AWS instances like ``f1.16xlarge``, ``f1.4xlarge``,
``f1.2xlarge``, and ``m4.16xlarge`` instances. Before you run the command, you define
the number of each that you want in the ``recipe_arg_overrides`` section of
``config_runtime.yaml`` or in the ``base_recipe`` itself.
A launched run farm is tagged with a ``run_farm_tag``, which is used to disambiguate
multiple parallel run farms; that is, you can have many run farms running, each running
a different experiment at the same time, each with its own unique ``run_farm_tag``. One
convenient feature to add to your AWS management panel is the column for
``fsimcluster``, which contains the ``run_farm_tag`` value. You can see how to do that
in the :ref:`fsimcluster-aws-panel` section.
The other options in the ``run_farm`` section, ``run_instance_market``,
``spot_interruption_behavior``, and ``spot_max_price`` define *how* instances in the run
farm are launched. See the documentation for ``config_runtime.yaml`` for more details on
other arguments (see :ref:`config-runtime`).
**ERRATA**: One current requirement is that you must define a target config in the
``target_config`` section of ``config_runtime.yaml`` that does not require more
resources than the run farm you are trying to launch. Thus, you should also setup your
``target_config`` parameters before trying to launch the corresponding run farm. This
requirement will be removed in the future.
Once you setup your configuration and call ``firesim launchrunfarm``, the command will
launch the run farm. If all succeeds, you will see the command print out instance IDs
for the correct number/types of instances (you do not need to pay attention to these or
record them). If an error occurs, it will be printed to console.
.. warning::
On AWS EC2, once you run this command, your run farm will continue to run until you
call ``firesim terminaterunfarm``. This means you will be charged for the running
instances in your run farm until you call ``terminaterunfarm``. You are responsible
for ensuring that instances are only running when you want them to be by checking
the AWS EC2 Management Panel.
.. _firesim-terminaterunfarm:
``firesim terminaterunfarm``
----------------------------
.. note::
Can only be used for the F1 platform.
This command terminates some or all of the instances in the Run Farm defined in your
``config_runtime.yaml`` file by the ``run_farm`` ``base_recipe``, depending on the
command line arguments you supply.
By default, running ``firesim terminaterunfarm`` will terminate ALL instances with the
specified ``run_farm_tag``. When you run this command, it will prompt for confirmation
that you want to terminate the listed instances. If you respond in the affirmative, it
will move forward with the termination.
If you do not want to have to confirm the termination (e.g. you are using this command
in a script), you can give the command the ``--forceterminate`` command line argument.
For example, the following will TERMINATE ALL INSTANCES IN THE RUN FARM WITHOUT
PROMPTING FOR CONFIRMATION:
.. code-block:: bash
firesim terminaterunfarm --forceterminate
The ``--terminatesome=INSTANCE_TYPE:COUNT`` flag additionally allows you to terminate
only some (``COUNT``) of the instances of a particular type (``INSTANCE_TYPE``) in a
particular Run Farm.
Here are some examples:
.. code-block:: bash
[ start with 2 f1.16xlarges, 2 f1.2xlarges, 2 m4.16xlarges ]
firesim terminaterunfarm --terminatesome=f1.16xlarge:1 --forceterminate
[ now, we have: 1 f1.16xlarges, 2 f1.2xlarges, 2 m4.16xlarges ]
.. code-block:: bash
[ start with 2 f1.16xlarges, 2 f1.2xlarges, 2 m4.16xlarges ]
firesim terminaterunfarm --terminatesome=f1.16xlarge:1 --terminatesome=f1.2xlarge:2 --forceterminate
[ now, we have: 1 f1.16xlarges, 0 f1.2xlarges, 2 m4.16xlarges ]
.. warning::
On AWS EC2, once you call ``launchrunfarm``, you will be charged for running
instances in your Run Farm until you call ``terminaterunfarm``. You are responsible
for ensuring that instances are only running when you want them to be by checking
the AWS EC2 Management Panel.
.. _firesim-infrasetup:
``firesim infrasetup``
----------------------
Once you have launched a Run Farm and setup all of your configuration options, the
``infrasetup`` command will build all components necessary to run the simulation and
deploy those components to the machines in the Run Farm. Here is a rough outline of what
the command does:
- Constructs the internal representation of your simulation. This is a tree of
components in the simulation (simulated server blades, switches)
- For each type of server blade, rebuild the software simulation driver by querying the
bitstream metadata to get the build-quadruplet or using its override
- For each type of switch in the simulation, generate the switch model binary
- For each host instance in the Run Farm, collect information about all the resources
necessary to run a simulation on that host instance, then copy files and flash FPGAs
with the required bitstream.
Details about setting up your simulation configuration can be found in
:ref:`config-runtime`.
**Once you run a simulation, you should re-run** ``firesim infrasetup`` **before
starting another one, even if it is the same exact simulation on the same Run Farm.**
You can see detailed output from an example run of ``infrasetup`` in the
:ref:`single-node-sim` and :ref:`cluster-sim` Getting Started Guides.
.. _firesim-boot:
``firesim boot``
----------------
Once you have run ``firesim infrasetup``, this command will actually start simulations.
It begins by launching all switches (if they exist in your simulation config), then
launches all server blade simulations. This simply launches simulations and then exits
-- it does not perform any monitoring.
This command is useful if you want to launch a simulation, then plan to interact with
the simulation by-hand (i.e. by directly interacting with the console).
.. _firesim-kill:
``firesim kill``
----------------
Given a simulation configuration and simulations running on a Run Farm, this command
force-terminates all components of the simulation. Importantly, this does not allow any
outstanding changes to the filesystem in the simulated systems to be committed to the
disk image.
.. _firesim-runworkload:
``firesim runworkload``
-----------------------
This command is the standard tool that lets you launch simulations, monitor the progress
of workloads running on them, and collect results automatically when the workloads
complete. To call this command, you must have first called ``firesim infrasetup`` to
setup all required simulation infrastructure on the remote nodes.
This command will first create a directory in ``firesim/deploy/results-workload/`` named
as ``LAUNCH_TIME-WORKLOADNAME``, where results will be completed as simulations
complete. This command will then automatically call ``firesim boot`` to start
simulations. Then, it polls all the instances in the Run Farm every 10 seconds to
determine the state of the simulated system. If it notices that a simulation has
shutdown (i.e. the simulation disappears from the output of ``screen -ls``), it will
automatically copy back all results from the simulation, as defined in the workload
configuration (see the :ref:`deprecated-defining-custom-workloads` section).
For non-networked simulations, it will wait for ALL simulations to complete (copying
back results as each workload completes), then exit.
For globally-cycle-accurate networked simulations, the global simulation will stop when
any single node powers off. Thus, for these simulations, ``runworkload`` will copy back
results from all nodes and force them to terminate by calling ``kill`` when ANY SINGLE
ONE of them shuts down cleanly.
A simulation shuts down cleanly when the workload running on the simulator calls
``poweroff``.
.. _firesim-runcheck:
``firesim runcheck``
--------------------
This command is provided to let you debug configuration options without launching
instances. In addition to the output produced at command line/in the log, you will find
a pdf diagram of the topology you specify, annotated with information about the
workloads, hardware configurations, and abstract host mappings for each simulation (and
optionally, switch) in your design. These diagrams are located in
``firesim/deploy/generated-topology-diagrams/``, named after your topology.
Here is an example of such a diagram (click to expand/zoom, it will likely be illegible
without expanding):
.. figure:: runcheck_example.png
:scale: 50 %
:alt: Example diagram from running ``firesim runcheck``
Example diagram for an 8-node cluster with one ToR switch
.. _firesim-enumeratefpgas:
``firesim enumeratefpgas``
--------------------------
.. note::
Can only be used for XDMA-based On-Premises platforms.
This command should be run once for each on-premises Run Farm you plan to use that
contains XDMA-based FPGAs. When run, the command will generate a file
(``/opt/firesim-db.json``) on each Run Farm Machine in the run farm that contains a
mapping from the FPGA ID used for JTAG programming to the PCIe ID used to run
simulations for each FPGA attached to the machine.
If you ever change the physical layout of a Run Farm Machine in your Run Farm (e.g.,
which PCIe slot the FPGAs are attached to), you will need to re-run this command.