2019-04-30 22:36:47 +08:00
|
|
|
# This code is part of Qiskit.
|
Revise travis configuration, using cmake
* Revise the travis configuration for using `cmake` for the several
targets, and use "stages" instead of parallel jobs:
* define three stages that are executed if the previous one suceeds:
1. "linter and pure python test": executes the linter and a test
without compiling the binaries, with the idea of providing quick
feedback for PRs.
2. "test": launch the test, including the compilation of binaries,
under GNU/Linux Python 3.6 and 3.6; and osx Python 3.6.
3. "deploy doc and pypi": for the stable branch, deploy the docs
to the landing page, and when using a specific commit message,
build the GNU/Linux and osx wheels, uploading them to test.pypi.
* use yaml anchors and definitions to avoid repeating code (and
working around travis limitations).
* Modify the `cmake``configuration to accomodate the stages flow:
* allow conditional creation of compilation and QA targets, mainly
for saving some time in some jobs.
* move the tests to `cmake/tests.cmake`.
* Update the tests:
* add a `requires_qe_access` decorator that retrieves QE_TOKEN and
QE_URL and appends them to the parameters in an unified manner.
* add an environment variable `SKIP_ONLINE_TESTS` that allows to
skip the tests that need network access.
* replace `TRAVIS_FORK_PULL_REQUEST` with the previous two
mechanisms, adding support for AppVeyor as well.
* fix a problem with matplotlib under osx headless, effectively
skipping `test_visualization.py` during the travis osx jobs.
* Move Sphinx to `requirements-dev.txt`.
2018-02-13 05:11:28 +08:00
|
|
|
#
|
2019-04-30 22:36:47 +08:00
|
|
|
# (C) Copyright IBM 2017.
|
|
|
|
#
|
|
|
|
# This code is licensed under the Apache License, Version 2.0. You may
|
|
|
|
# obtain a copy of this license in the LICENSE.txt file in the root directory
|
|
|
|
# of this source tree or at http://www.apache.org/licenses/LICENSE-2.0.
|
|
|
|
#
|
|
|
|
# Any modifications or derivative works of this code must retain this
|
|
|
|
# copyright notice, and modified files need to carry a notice indicating
|
|
|
|
# that they have been altered from the originals.
|
Revise travis configuration, using cmake
* Revise the travis configuration for using `cmake` for the several
targets, and use "stages" instead of parallel jobs:
* define three stages that are executed if the previous one suceeds:
1. "linter and pure python test": executes the linter and a test
without compiling the binaries, with the idea of providing quick
feedback for PRs.
2. "test": launch the test, including the compilation of binaries,
under GNU/Linux Python 3.6 and 3.6; and osx Python 3.6.
3. "deploy doc and pypi": for the stable branch, deploy the docs
to the landing page, and when using a specific commit message,
build the GNU/Linux and osx wheels, uploading them to test.pypi.
* use yaml anchors and definitions to avoid repeating code (and
working around travis limitations).
* Modify the `cmake``configuration to accomodate the stages flow:
* allow conditional creation of compilation and QA targets, mainly
for saving some time in some jobs.
* move the tests to `cmake/tests.cmake`.
* Update the tests:
* add a `requires_qe_access` decorator that retrieves QE_TOKEN and
QE_URL and appends them to the parameters in an unified manner.
* add an environment variable `SKIP_ONLINE_TESTS` that allows to
skip the tests that need network access.
* replace `TRAVIS_FORK_PULL_REQUEST` with the previous two
mechanisms, adding support for AppVeyor as well.
* fix a problem with matplotlib under osx headless, effectively
skipping `test_visualization.py` during the travis osx jobs.
* Move Sphinx to `requirements-dev.txt`.
2018-02-13 05:11:28 +08:00
|
|
|
|
2019-02-22 02:58:06 +08:00
|
|
|
"The Qiskit Terra setup file."
|
2017-08-23 22:40:43 +08:00
|
|
|
|
2019-02-22 02:58:06 +08:00
|
|
|
import os
|
2023-11-30 06:37:31 +08:00
|
|
|
from setuptools import setup
|
Implement multithreaded stochastic swap in rust (#7658)
* Implement multithreaded stochastic swap in rust
This commit is a rewrite of the core swap trials functionality in the
StochasticSwap transpiler pass. Previously this core routine was written
using Cython (see #1789) which had great performance, but that
implementation was single threaded. The core of the stochastic swap
algorithm by it's nature is well suited to be executed in parallel, it
attempts a number of random trials and then picks the best result
from all the trials and uses that for that layer. These trials can
easily be run in parallel as there is no data dependency between the
trials (there are shared inputs but read-only). As the algorithm
generally scales exponentially the speed up from running the trials in
parallel can offset this and improve the scaling of the pass. Running
the pass in parallel was previously tried in #4781 using Python
multiprocessing but the overhead of launching an additional process and
serializing the input arrays for each trial was significantly larger
than the speed gains. To run the algorithm efficiently in parallel
multithreading is needed to leverage shared memory on shared inputs.
This commit rewrites the cython routine using rust. This was done for
two reasons. The first is that rust's safety guarantees make dealing
with and writing parallel code much easier and safer. It's also
multiplatform because the rust language supports native threading
primatives in language. The second is while writing parallel cython
code using open-mp there are limitations with it, mainly on windows. In
practice it was also difficult to write and maintain parallel cython
code as it has very strict requirements on python and c code
interactions. It was much faster and easier to port it to rust and the
performance for each iteration (outside of parallelism) is the same (in
some cases marginally faster) in rust. The implementation here reuses
the data structures that the previous cython implementation introduced
(mainly flattening all the terra objects into 1d or 2d numpy arrays for
efficient access from C).
The speedups from this PR can be significant, calling transpile() on a
400 qubit (with a depth of 10) QV model circuit targetting a 409 heavy
hex coupling map goes from ~200 seconds with the single threaded cython
to ~60 seconds with this PR locally on a 32 core system, When transpiling
a 1000 qubit (also with a depth of 10) QV model circuit targetting a 1081
qubit heavy hex coupling map goes from taking ~6500 seconds to ~720
seconds.
The tradeoff with this PR is for local qiskit-terra development a rust
compiler needs to be installed. This is made trivial using rustup
(https://rustup.rs/), but it is an additional burden and one that we
might not want to make. If so we can look at turning this PR into a
separate repository/package that qiskit-terra can depend on. The
tradeoff here is that we'll be adding friction to the api boundary
between the pass and the core swap trials interface. But, it does ease
the dependency on development for qiskit-terra.
* Sanitize packaging to support future modules
This commit fixes how we package the compiled rust module in
qiskit-terra. As a single rust project only gives us a single compiled
binary output we can't use the same scheme we did previously with cython
with a separate dynamic lib file for each module. This shifts us to
making the rust code build a `qiskit._accelerate` module and in that we
have submodules for everything we need from compiled code. For this PR
there is only one submodule, `stochastic_swap`, so for example the
parallel swap_trials routine can be imported from
`qiskit._accelerate.stochastic_swap.swap_trials`. In the future we can
have additional submodules for other pieces of compiled code in qiskit.
For example, the likely next candidate is the pauli expectation value
cython module, which we'll likely port to rust and also make parallel
(for sufficiently large number of qubits). In that case we'd add a new
submodule for that functionality.
* Adjust random normal distribution to use correct mean
This commit corrects the use of the normal distribution to have the mean
set to 1.0. Previously we were doing this out of band for each value by
adding 1 to the random value which wasn't necessary because we could
just generate it with a mean of 1.0.
* Remove unecessary extra scope from locked read
This commit removes an unecessary extra scope around the locked read for
where we store the best solution. The scope was previously there to
release the lock after we check if there is a solution or not. However
this wasn't actually needed as we can just do the check inline and the
lock will release after the condition block.
* Remove unecessary explicit type from opt_edges variable
* Fix indices typo in NLayout constructor
Co-authored-by: Jake Lishman <jake@binhbar.com>
* Remove explicit lifetime annotation from swap_trials
Previously the swap_trials() function had an explicit lifetime
annotation `'p` which wasn't necessary because the compiler can
determine this on it's own. Normally when dealing with numpy views and a
Python object (i.e. a GIL handle) we need a lifetime annotation to tell
the rust compiler the numpy view and the python gil handle will have the
same lifetime. But since swap_trials doesn't take a gil handle and
operates purely in rust we don't need this lifetime and the rust
compiler can deal with the lifetime of the numpy views on their own.
* Use sum() instead of fold()
* Fix lint and add rust style and lint checks to CI
This commit fixes the python lint failures and also updates the ci
configuration for the lint job to also run rust's style and lint
enforcement.
* Fix returned layout mapping from NLayout
This commit fixes the output list from the `layout_mapping()`
method of `NLayout`. Previously, it incorrectly would return the
wrong indices it should be a list of virtual -> physical to
qubit pairs. This commit corrects this error
Co-authored-by: georgios-ts <45130028+georgios-ts@users.noreply.github.com>
* Tweak tox configuration to try and reliably build rust extension
* Make swap_trials parallelization configurable
This commit makes the parallelization of the swap_trials() configurable.
This is dones in two ways, first a new argument parallel_threshold is
added which takes an optional int which is the number of qubits to
switch between a parallel and serial version. The second is that it
takes into account the the state of the QISKIT_IN_PARALLEL environment
variable. This variable is set to TRUE by parallel_map() when we're
running in a multiprocessing context. In those cases also running
stochastic swap in parallel will likely just cause too much load as
we're potentially oversubscribing work to the number of available CPUs.
So, if QISKIT_IN_PARALLEL is set to True we run swap_trials serially.
* Revert "Make swap_trials parallelization configurable"
This reverts commit 57790c84b03da10fd7296c57b38b54c5bccebf4c. That
commit attempted to sovle some issues in test running, mainly around
multiple parallel dispatch causing exceess load. But in practice it was
broken and caused more issues than it fixed. We'll investigate and add
control for the parallelization in a future commit separately after all
the tests are passing so we have a good baseline.
* Add docs to swap_trials() and remove unecessary num_gates arg
* Fix race condition leading to non-deterministic behavior
Previously, in the case of circuits that had multiple best possible
depth == 1 solutions for a layer, there was a race condition in the fast
exit path between the threads which could lead to a non-deterministic
result even with a fixed seed. The output was always valid, but which
result was dependent on which parallel thread with an ideal solution
finished last and wrote to the locked best result last. This was causing
weird non-deterministic test failures for some tests because of #1794 as
the exact match result would change between runs. This could be a bigger
issue because user expectations are that with a fixed seed set on the
transpiler that the output circuit will be deterministically
reproducible.
To address this is issue this commit trades off some performance to
ensure we're always returning a deterministic result in this case. This
is accomplished by updating/checking if a depth==1 solution has been
found in another trial thread we only act (so either exit early or
update the already found depth == 1 solution) if that solution already
found has a trial number that is less than this thread's trial number.
This does limit the effectiveness of the fast exit, but in practice it
should hopefully not effect the speed too much.
As part of this commit some tests are updated because the new
deterministic behavior is slightly different from the previous results
from the cython serial implementation. I manually verified that the
new output circuits are still valid (it also looks like the quality
of the results in some of those cases improved, but this is strictly
anecdotal and shouldn't be taken as a general trend with this PR).
* Apply suggestions from code review
Co-authored-by: georgios-ts <45130028+georgios-ts@users.noreply.github.com>
* Fix compiler errors in previous commit
* Revert accidental commit of parallel reduction in compute_cost
This was only a for local testing to prove it was a bad idea and was
accidently included in the branch. We should not nest the parallel
execution like this.
* Eliminate short circuit for depth == 1 swap_trial() result
This commit eliminates the short circuit fast return in swap_trial()
when another trial thread has found an ideal solution. Trying to do this
in a parallel context is tricky to make deterministic because in cases
of >1 depth == 1 solutions there is an inherent race condition between
the threads for writing out their depth == 1 result to the shared
location. Different strategies were tried to make this reliably
deterministic but there wa still a race condition. Since this was just a
performance optimization to avoid doing unnecessary work this commit
removes this step. Weighing improved performance against repeatability
in the output of the compiler, the reproducible results are more
important. After we've adopted a multithreaded stochastic swap we can
investigate adding this back as a potential future optimization.
* Add missing docstrings
* Add section to contributing on installing form source
* Make rust python classes pickleable
* Add rust compiler install to linux wheel jobs
* Try more tox changes to fix docs builds
* Revert "Eliminate short circuit for depth == 1 swap_trial() result"
This reverts commit c510764a770cb610661bdb3732337cd45ab587fd. The
removal there was premature and we had a fix for the non-determinism in
place, ignoring a typo which was preventing it from working.
Co-Authored-By: Georgios Tsilimigkounakis <45130028+georgios-ts@users.noreply.github.com>
* Fix submodule declaration and module attribute on rust classes
* Fix rust lint
* Fix docs job definition
* Disable multiprocessing parallelism in unit tests
This commit disables the multiprocessing based parallelism when running
unittest jobs in CI. We historically have defaulted the use of
multiprocessing in environments only where the "fork" start method is
available because this has the best performance and has no caveats
around how it is used by users (you don't need an
`if __name__ == "__main__"` guard). However, the use of the "fork"
method isn't always 100% reliable (see
https://bugs.python.org/issue40379), which we saw on Python 3.9 #6188.
In unittest CI (and tox) by default we use stestr which spawns (not using
fork) parallel workers to run tests in parallel. With this PR this means
in unittest we're now running multiple test runner subprocesses, which
are executing parallel dispatched code using multiprocessing's fork
start method, which is executing multithreaded rust code. This three layers
of nesting is fairly reliably hanging as Python's fork doesn't seem to
be able to handle this many layers of nested parallelism. There are 2
ways I've been able to fix this, the first is to change the start method
used by `parallel_map()` to either "spawn" or "forkserver" either of
these does not suffer from random hanging. However, doing this in the
unittest context causes significant overhead and slows down test
executing significantly. The other is to just disable the
multiprocessing which fixes the hanging and doesn't impact runtime
performance signifcantly (and might actually help in CI so we're not
oversubscribing the limited resources.
As I have not been able to reproduce `parallel_map()` hanging in
a standalone context with multithreaded stochastic swap this commit opts
for just disabling multiprocessing in CI and documenting the known issue
in the release notes as this is the simpler solution. It's unlikely that
users will nest parallel processes as it typically hurts performance
(and parallel_map() actively guards against it), we only did it in
testing previously because the tests which relied on it were a small
portion of the test suite (roughly 65 tests) and typically did not have
a significant impact on the total throughput of the test suite.
* Fix typo in azure pipelines config
* Remove unecessary extension compilation for image tests
* Add test script to explicitly verify parallel dispatch
In an earlier commit we disabled the use of parallel dispatch in
parallel_map() to avoid a bug in cpython associated with their fork()
based subprocess launch. Doing this works around the bug which was
reliably triggered by running multiprocessing in parallel subprocesses.
It also has the side benefit of providing a ~2x speed up for test suite
execution in CI. However, this meant we lost our test coverage in CI for
running parallel_map() with actual multiprocessing based parallel
dispatch. To ensure we don't inadvertandtly regress this code path
moving forward this commit adds a dedicated test script which runs a
simple transpilation in parallel and verifies that everything works as
expected with the default parallelism settings.
* Avoid multi-threading when run in a multiprocessing context
This commit adds a switch on running between a single threaded and a
multithreaded variant of the swap_trials loop based on whether the
QISKIT_IN_PARALLEL flag is set. If QISKIT_IN_PARALLEL is set to TRUE
this means the `parallel_map()` function is running in the outer python
context and we're running in multiprocessing already. This means we do
not want to be running in multiple threads generally as that will lead
to potential resource exhaustion by spawn n processes each potentially
running with m threads where `n` is `min(num_phys_cpus, num_tasks)` and
`m` is num_logical_cpus (although only
`min(num_logical_cpus, num_trials)` will be active) which on the typical
system there aren't enough cores to leverage both multiprocessing and
multithreading. However, in case a user does have such an environment
they can set the `QISKIT_FORCE_THREADS` env variable to `TRUE` which
will use threading regardless of the status of `QISKIT_IN_PARALLEL`.
* Apply suggestions from code review
Co-authored-by: Jake Lishman <jake@binhbar.com>
* Minor fixes from review comments
This commits fixes some minor details found during code review. It
expands the section on building from source to explain how to build a
release optimized binary with editable mode, makes the QISKIT_PARALLEL
env variable usage consistent across all jobs, and adds a missing
shebang to the `install_rush.sh` script which is used to install rust in
the manylinux container environment.
* Simplify tox configuration
In earlier commits the tox configuration was changed to try and fix the
docs CI job by going to great effort to try and enforce that
setuptools-rust was installed in all situations, even before it was
actually needed. However, the problem with the docs ci job was unrelated
to the tox configuration and this reverts the configuration to something
that works with more versions of tox and setuptools-rust.
* Add missing pieces of cargo configuration
Co-authored-by: Jake Lishman <jake@binhbar.com>
Co-authored-by: georgios-ts <45130028+georgios-ts@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2022-03-01 05:49:54 +08:00
|
|
|
from setuptools_rust import Binding, RustExtension
|
2021-06-22 05:49:31 +08:00
|
|
|
|
2023-11-30 06:37:31 +08:00
|
|
|
# Most of this configuration is managed by `pyproject.toml`. This only includes the extra bits to
|
|
|
|
# configure `setuptools-rust`, because we do a little dynamic trick with the debug setting, and we
|
|
|
|
# also want an explicit `setup.py` file to exist so we can manually call
|
|
|
|
#
|
|
|
|
# python setup.py build_rust --inplace --release
|
|
|
|
#
|
|
|
|
# to make optimised Rust components even for editable releases, which would otherwise be quite
|
|
|
|
# unergonomic to do otherwise.
|
Implement multithreaded stochastic swap in rust (#7658)
* Implement multithreaded stochastic swap in rust
This commit is a rewrite of the core swap trials functionality in the
StochasticSwap transpiler pass. Previously this core routine was written
using Cython (see #1789) which had great performance, but that
implementation was single threaded. The core of the stochastic swap
algorithm by it's nature is well suited to be executed in parallel, it
attempts a number of random trials and then picks the best result
from all the trials and uses that for that layer. These trials can
easily be run in parallel as there is no data dependency between the
trials (there are shared inputs but read-only). As the algorithm
generally scales exponentially the speed up from running the trials in
parallel can offset this and improve the scaling of the pass. Running
the pass in parallel was previously tried in #4781 using Python
multiprocessing but the overhead of launching an additional process and
serializing the input arrays for each trial was significantly larger
than the speed gains. To run the algorithm efficiently in parallel
multithreading is needed to leverage shared memory on shared inputs.
This commit rewrites the cython routine using rust. This was done for
two reasons. The first is that rust's safety guarantees make dealing
with and writing parallel code much easier and safer. It's also
multiplatform because the rust language supports native threading
primatives in language. The second is while writing parallel cython
code using open-mp there are limitations with it, mainly on windows. In
practice it was also difficult to write and maintain parallel cython
code as it has very strict requirements on python and c code
interactions. It was much faster and easier to port it to rust and the
performance for each iteration (outside of parallelism) is the same (in
some cases marginally faster) in rust. The implementation here reuses
the data structures that the previous cython implementation introduced
(mainly flattening all the terra objects into 1d or 2d numpy arrays for
efficient access from C).
The speedups from this PR can be significant, calling transpile() on a
400 qubit (with a depth of 10) QV model circuit targetting a 409 heavy
hex coupling map goes from ~200 seconds with the single threaded cython
to ~60 seconds with this PR locally on a 32 core system, When transpiling
a 1000 qubit (also with a depth of 10) QV model circuit targetting a 1081
qubit heavy hex coupling map goes from taking ~6500 seconds to ~720
seconds.
The tradeoff with this PR is for local qiskit-terra development a rust
compiler needs to be installed. This is made trivial using rustup
(https://rustup.rs/), but it is an additional burden and one that we
might not want to make. If so we can look at turning this PR into a
separate repository/package that qiskit-terra can depend on. The
tradeoff here is that we'll be adding friction to the api boundary
between the pass and the core swap trials interface. But, it does ease
the dependency on development for qiskit-terra.
* Sanitize packaging to support future modules
This commit fixes how we package the compiled rust module in
qiskit-terra. As a single rust project only gives us a single compiled
binary output we can't use the same scheme we did previously with cython
with a separate dynamic lib file for each module. This shifts us to
making the rust code build a `qiskit._accelerate` module and in that we
have submodules for everything we need from compiled code. For this PR
there is only one submodule, `stochastic_swap`, so for example the
parallel swap_trials routine can be imported from
`qiskit._accelerate.stochastic_swap.swap_trials`. In the future we can
have additional submodules for other pieces of compiled code in qiskit.
For example, the likely next candidate is the pauli expectation value
cython module, which we'll likely port to rust and also make parallel
(for sufficiently large number of qubits). In that case we'd add a new
submodule for that functionality.
* Adjust random normal distribution to use correct mean
This commit corrects the use of the normal distribution to have the mean
set to 1.0. Previously we were doing this out of band for each value by
adding 1 to the random value which wasn't necessary because we could
just generate it with a mean of 1.0.
* Remove unecessary extra scope from locked read
This commit removes an unecessary extra scope around the locked read for
where we store the best solution. The scope was previously there to
release the lock after we check if there is a solution or not. However
this wasn't actually needed as we can just do the check inline and the
lock will release after the condition block.
* Remove unecessary explicit type from opt_edges variable
* Fix indices typo in NLayout constructor
Co-authored-by: Jake Lishman <jake@binhbar.com>
* Remove explicit lifetime annotation from swap_trials
Previously the swap_trials() function had an explicit lifetime
annotation `'p` which wasn't necessary because the compiler can
determine this on it's own. Normally when dealing with numpy views and a
Python object (i.e. a GIL handle) we need a lifetime annotation to tell
the rust compiler the numpy view and the python gil handle will have the
same lifetime. But since swap_trials doesn't take a gil handle and
operates purely in rust we don't need this lifetime and the rust
compiler can deal with the lifetime of the numpy views on their own.
* Use sum() instead of fold()
* Fix lint and add rust style and lint checks to CI
This commit fixes the python lint failures and also updates the ci
configuration for the lint job to also run rust's style and lint
enforcement.
* Fix returned layout mapping from NLayout
This commit fixes the output list from the `layout_mapping()`
method of `NLayout`. Previously, it incorrectly would return the
wrong indices it should be a list of virtual -> physical to
qubit pairs. This commit corrects this error
Co-authored-by: georgios-ts <45130028+georgios-ts@users.noreply.github.com>
* Tweak tox configuration to try and reliably build rust extension
* Make swap_trials parallelization configurable
This commit makes the parallelization of the swap_trials() configurable.
This is dones in two ways, first a new argument parallel_threshold is
added which takes an optional int which is the number of qubits to
switch between a parallel and serial version. The second is that it
takes into account the the state of the QISKIT_IN_PARALLEL environment
variable. This variable is set to TRUE by parallel_map() when we're
running in a multiprocessing context. In those cases also running
stochastic swap in parallel will likely just cause too much load as
we're potentially oversubscribing work to the number of available CPUs.
So, if QISKIT_IN_PARALLEL is set to True we run swap_trials serially.
* Revert "Make swap_trials parallelization configurable"
This reverts commit 57790c84b03da10fd7296c57b38b54c5bccebf4c. That
commit attempted to sovle some issues in test running, mainly around
multiple parallel dispatch causing exceess load. But in practice it was
broken and caused more issues than it fixed. We'll investigate and add
control for the parallelization in a future commit separately after all
the tests are passing so we have a good baseline.
* Add docs to swap_trials() and remove unecessary num_gates arg
* Fix race condition leading to non-deterministic behavior
Previously, in the case of circuits that had multiple best possible
depth == 1 solutions for a layer, there was a race condition in the fast
exit path between the threads which could lead to a non-deterministic
result even with a fixed seed. The output was always valid, but which
result was dependent on which parallel thread with an ideal solution
finished last and wrote to the locked best result last. This was causing
weird non-deterministic test failures for some tests because of #1794 as
the exact match result would change between runs. This could be a bigger
issue because user expectations are that with a fixed seed set on the
transpiler that the output circuit will be deterministically
reproducible.
To address this is issue this commit trades off some performance to
ensure we're always returning a deterministic result in this case. This
is accomplished by updating/checking if a depth==1 solution has been
found in another trial thread we only act (so either exit early or
update the already found depth == 1 solution) if that solution already
found has a trial number that is less than this thread's trial number.
This does limit the effectiveness of the fast exit, but in practice it
should hopefully not effect the speed too much.
As part of this commit some tests are updated because the new
deterministic behavior is slightly different from the previous results
from the cython serial implementation. I manually verified that the
new output circuits are still valid (it also looks like the quality
of the results in some of those cases improved, but this is strictly
anecdotal and shouldn't be taken as a general trend with this PR).
* Apply suggestions from code review
Co-authored-by: georgios-ts <45130028+georgios-ts@users.noreply.github.com>
* Fix compiler errors in previous commit
* Revert accidental commit of parallel reduction in compute_cost
This was only a for local testing to prove it was a bad idea and was
accidently included in the branch. We should not nest the parallel
execution like this.
* Eliminate short circuit for depth == 1 swap_trial() result
This commit eliminates the short circuit fast return in swap_trial()
when another trial thread has found an ideal solution. Trying to do this
in a parallel context is tricky to make deterministic because in cases
of >1 depth == 1 solutions there is an inherent race condition between
the threads for writing out their depth == 1 result to the shared
location. Different strategies were tried to make this reliably
deterministic but there wa still a race condition. Since this was just a
performance optimization to avoid doing unnecessary work this commit
removes this step. Weighing improved performance against repeatability
in the output of the compiler, the reproducible results are more
important. After we've adopted a multithreaded stochastic swap we can
investigate adding this back as a potential future optimization.
* Add missing docstrings
* Add section to contributing on installing form source
* Make rust python classes pickleable
* Add rust compiler install to linux wheel jobs
* Try more tox changes to fix docs builds
* Revert "Eliminate short circuit for depth == 1 swap_trial() result"
This reverts commit c510764a770cb610661bdb3732337cd45ab587fd. The
removal there was premature and we had a fix for the non-determinism in
place, ignoring a typo which was preventing it from working.
Co-Authored-By: Georgios Tsilimigkounakis <45130028+georgios-ts@users.noreply.github.com>
* Fix submodule declaration and module attribute on rust classes
* Fix rust lint
* Fix docs job definition
* Disable multiprocessing parallelism in unit tests
This commit disables the multiprocessing based parallelism when running
unittest jobs in CI. We historically have defaulted the use of
multiprocessing in environments only where the "fork" start method is
available because this has the best performance and has no caveats
around how it is used by users (you don't need an
`if __name__ == "__main__"` guard). However, the use of the "fork"
method isn't always 100% reliable (see
https://bugs.python.org/issue40379), which we saw on Python 3.9 #6188.
In unittest CI (and tox) by default we use stestr which spawns (not using
fork) parallel workers to run tests in parallel. With this PR this means
in unittest we're now running multiple test runner subprocesses, which
are executing parallel dispatched code using multiprocessing's fork
start method, which is executing multithreaded rust code. This three layers
of nesting is fairly reliably hanging as Python's fork doesn't seem to
be able to handle this many layers of nested parallelism. There are 2
ways I've been able to fix this, the first is to change the start method
used by `parallel_map()` to either "spawn" or "forkserver" either of
these does not suffer from random hanging. However, doing this in the
unittest context causes significant overhead and slows down test
executing significantly. The other is to just disable the
multiprocessing which fixes the hanging and doesn't impact runtime
performance signifcantly (and might actually help in CI so we're not
oversubscribing the limited resources.
As I have not been able to reproduce `parallel_map()` hanging in
a standalone context with multithreaded stochastic swap this commit opts
for just disabling multiprocessing in CI and documenting the known issue
in the release notes as this is the simpler solution. It's unlikely that
users will nest parallel processes as it typically hurts performance
(and parallel_map() actively guards against it), we only did it in
testing previously because the tests which relied on it were a small
portion of the test suite (roughly 65 tests) and typically did not have
a significant impact on the total throughput of the test suite.
* Fix typo in azure pipelines config
* Remove unecessary extension compilation for image tests
* Add test script to explicitly verify parallel dispatch
In an earlier commit we disabled the use of parallel dispatch in
parallel_map() to avoid a bug in cpython associated with their fork()
based subprocess launch. Doing this works around the bug which was
reliably triggered by running multiprocessing in parallel subprocesses.
It also has the side benefit of providing a ~2x speed up for test suite
execution in CI. However, this meant we lost our test coverage in CI for
running parallel_map() with actual multiprocessing based parallel
dispatch. To ensure we don't inadvertandtly regress this code path
moving forward this commit adds a dedicated test script which runs a
simple transpilation in parallel and verifies that everything works as
expected with the default parallelism settings.
* Avoid multi-threading when run in a multiprocessing context
This commit adds a switch on running between a single threaded and a
multithreaded variant of the swap_trials loop based on whether the
QISKIT_IN_PARALLEL flag is set. If QISKIT_IN_PARALLEL is set to TRUE
this means the `parallel_map()` function is running in the outer python
context and we're running in multiprocessing already. This means we do
not want to be running in multiple threads generally as that will lead
to potential resource exhaustion by spawn n processes each potentially
running with m threads where `n` is `min(num_phys_cpus, num_tasks)` and
`m` is num_logical_cpus (although only
`min(num_logical_cpus, num_trials)` will be active) which on the typical
system there aren't enough cores to leverage both multiprocessing and
multithreading. However, in case a user does have such an environment
they can set the `QISKIT_FORCE_THREADS` env variable to `TRUE` which
will use threading regardless of the status of `QISKIT_IN_PARALLEL`.
* Apply suggestions from code review
Co-authored-by: Jake Lishman <jake@binhbar.com>
* Minor fixes from review comments
This commits fixes some minor details found during code review. It
expands the section on building from source to explain how to build a
release optimized binary with editable mode, makes the QISKIT_PARALLEL
env variable usage consistent across all jobs, and adds a missing
shebang to the `install_rush.sh` script which is used to install rust in
the manylinux container environment.
* Simplify tox configuration
In earlier commits the tox configuration was changed to try and fix the
docs CI job by going to great effort to try and enforce that
setuptools-rust was installed in all situations, even before it was
actually needed. However, the problem with the docs ci job was unrelated
to the tox configuration and this reverts the configuration to something
that works with more versions of tox and setuptools-rust.
* Add missing pieces of cargo configuration
Co-authored-by: Jake Lishman <jake@binhbar.com>
Co-authored-by: georgios-ts <45130028+georgios-ts@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2022-03-01 05:49:54 +08:00
|
|
|
|
2017-11-27 18:09:59 +08:00
|
|
|
|
Add Rust-based OpenQASM 2 converter (#9784)
* Add Rust-based OpenQASM 2 converter
This is a vendored version of qiskit-qasm2
(https://pypi.org/project/qiskit-qasm2), with this initial commit being
equivalent (barring some naming / documentation / testing conversions to
match Qiskit's style) to version 0.5.3 of that package.
This adds a new translation layer from OpenQASM 2 to Qiskit, which is
around an order of magnitude faster than the existing version in Python,
while being more type safe (in terms of disallowing invalid OpenQASM 2
programs rather than attempting to construction `QuantumCircuit`s that
are not correct) and more extensible.
The core logic is a hand-written lexer and parser combination written in
Rust, which emits a bytecode stream across the PyO3 boundary to a small
Python interpreter loop. The main bulk of the parsing logic is a simple
LL(1) recursive-descent algorithm, which delegates to more specific
recursive Pratt-based algorithm for handling classical expressions.
Many of the design decisions made (including why the lexer is written by
hand) are because the project originally started life as a way for me to
learn about implementations of the different parts of a parser stack;
this is the principal reason there are very few external crates used.
There are a few inefficiencies in this implementation, for example:
- the string interner in the lexer allocates twice for each stored
string (but zero times for a lookup). It may be possible to
completely eliminate allocations when parsing a string (or a file if
it's read into memory as a whole), but realistically there's only a
fairly small number of different tokens seen in most OpenQASM 2
programs, so it shouldn't be too big a deal.
- the hand-off from Rust to Python transfers small objects frequently.
It might be more efficient to have a secondary buffered iterator in
Python space, transferring more bytecode instructions at a time and
letting Python resolve them. This form could also be made
asynchronous, since for the most part, the Rust components only need
to acquire the CPython GIL at the API boundary.
- there are too many points within the lexer that can return a failure
result that needs unwrapping at every site. Since there are no tokens
that can span multiple lines, it should be possible to refactor so
that almost all of the byte-getter and -peeker routines cannot return
error statuses, at the cost of the main lexer loop becoming
responsible for advancing the line buffer, and moving the non-ASCII
error handling into each token constructor.
I'll probably keep playing with some of those in the `qiskit-qasm2`
package itself when I have free time, but at some point I needed to draw
the line and vendor the package. It's still ~10x faster than the
existing one:
In [1]: import qiskit.qasm2
...: prog = """
...: OPENQASM 2.0;
...: include "qelib1.inc";
...: qreg q[2];
...: """
...: prog += "rz(pi * 2) q[0];\ncx q[0], q[1];\n"*100_000
...: %timeit qiskit.qasm2.loads(prog)
...: %timeit qiskit.QuantumCircuit.from_qasm_str(prog)
2.26 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
22.5 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
`cx`-heavy programs like this one are actually the ones that the new
parser is (comparatively) slowest on, because the construction time of
`CXGate` is higher than most gates, and this dominates the execution
time for the Rust-based parser.
* Work around docs failure on Sphinx 5.3, Python 3.9
The version of Sphinx that we're constrained to use in the docs build
can't handle the `Unpack` operator, so as a temporary measure we can
just relax the type hint a little.
* Remove unused import
* Tweak documentation
* More specific PyO3 usage
* Use PathBuf directly for paths
* Format
* Freeze dataclass
* Use type-safe id types
This should have no impact on runtime or on memory usage, since each of
the new types has the same bit width and alignment as the `usize` values
they replace.
* Documentation tweaks
* Fix comments in lexer
* Fix lexing version number with separating comments
* Add test of pathological formatting
* Fixup release note
* Fix handling of u0 gate
* Credit reviewers
Co-authored-by: Luciano Bello <bel@zurich.ibm.com>
Co-authored-by: Kevin Hartman <kevin@hart.mn>
Co-authored-by: Eric Arellano <14852634+Eric-Arellano@users.noreply.github.com>
* Add test of invalid gate-body statements
* Refactor custom built-in gate definitions
The previous system was quite confusing, and required all accesses to
the global symbol table to know that the `Gate` symbol could be present
but overridable. This led to confusing logic, various bugs and
unnecessary constraints, such as it previously being (erroneously)
possible to provide re-definitions for any "built-in" gate.
Instead, we keep a separate store of instructions that may be redefined.
This allows the logic to be centralised to only to the place responsible
for performing those overrides, and remains accessible for error-message
builders to query in order to provide better diagnostics.
* Credit Sasha
Co-authored-by: Alexander Ivrii <alexi@il.ibm.com>
* Credit Matthew
Co-authored-by: Matthew Treinish <mtreinish@kortar.org>
* Remove dependency on `lazy_static`
For a hashset of only 6 elements that is only checked once, there's not
really any point to pull in an extra dependency or use a hash set at
all.
* Update PyO3 version
---------
Co-authored-by: Luciano Bello <bel@zurich.ibm.com>
Co-authored-by: Kevin Hartman <kevin@hart.mn>
Co-authored-by: Eric Arellano <14852634+Eric-Arellano@users.noreply.github.com>
Co-authored-by: Alexander Ivrii <alexi@il.ibm.com>
Co-authored-by: Matthew Treinish <mtreinish@kortar.org>
2023-04-13 00:00:54 +08:00
|
|
|
# If RUST_DEBUG is set, force compiling in debug mode. Else, use the default behavior of whether
|
|
|
|
# it's an editable installation.
|
|
|
|
rust_debug = True if os.getenv("RUST_DEBUG") == "1" else None
|
|
|
|
|
2017-08-23 22:40:43 +08:00
|
|
|
setup(
|
2023-03-08 07:12:39 +08:00
|
|
|
rust_extensions=[
|
2023-03-10 07:56:54 +08:00
|
|
|
RustExtension(
|
|
|
|
"qiskit._accelerate",
|
|
|
|
"crates/accelerate/Cargo.toml",
|
|
|
|
binding=Binding.PyO3,
|
Add Rust-based OpenQASM 2 converter (#9784)
* Add Rust-based OpenQASM 2 converter
This is a vendored version of qiskit-qasm2
(https://pypi.org/project/qiskit-qasm2), with this initial commit being
equivalent (barring some naming / documentation / testing conversions to
match Qiskit's style) to version 0.5.3 of that package.
This adds a new translation layer from OpenQASM 2 to Qiskit, which is
around an order of magnitude faster than the existing version in Python,
while being more type safe (in terms of disallowing invalid OpenQASM 2
programs rather than attempting to construction `QuantumCircuit`s that
are not correct) and more extensible.
The core logic is a hand-written lexer and parser combination written in
Rust, which emits a bytecode stream across the PyO3 boundary to a small
Python interpreter loop. The main bulk of the parsing logic is a simple
LL(1) recursive-descent algorithm, which delegates to more specific
recursive Pratt-based algorithm for handling classical expressions.
Many of the design decisions made (including why the lexer is written by
hand) are because the project originally started life as a way for me to
learn about implementations of the different parts of a parser stack;
this is the principal reason there are very few external crates used.
There are a few inefficiencies in this implementation, for example:
- the string interner in the lexer allocates twice for each stored
string (but zero times for a lookup). It may be possible to
completely eliminate allocations when parsing a string (or a file if
it's read into memory as a whole), but realistically there's only a
fairly small number of different tokens seen in most OpenQASM 2
programs, so it shouldn't be too big a deal.
- the hand-off from Rust to Python transfers small objects frequently.
It might be more efficient to have a secondary buffered iterator in
Python space, transferring more bytecode instructions at a time and
letting Python resolve them. This form could also be made
asynchronous, since for the most part, the Rust components only need
to acquire the CPython GIL at the API boundary.
- there are too many points within the lexer that can return a failure
result that needs unwrapping at every site. Since there are no tokens
that can span multiple lines, it should be possible to refactor so
that almost all of the byte-getter and -peeker routines cannot return
error statuses, at the cost of the main lexer loop becoming
responsible for advancing the line buffer, and moving the non-ASCII
error handling into each token constructor.
I'll probably keep playing with some of those in the `qiskit-qasm2`
package itself when I have free time, but at some point I needed to draw
the line and vendor the package. It's still ~10x faster than the
existing one:
In [1]: import qiskit.qasm2
...: prog = """
...: OPENQASM 2.0;
...: include "qelib1.inc";
...: qreg q[2];
...: """
...: prog += "rz(pi * 2) q[0];\ncx q[0], q[1];\n"*100_000
...: %timeit qiskit.qasm2.loads(prog)
...: %timeit qiskit.QuantumCircuit.from_qasm_str(prog)
2.26 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
22.5 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
`cx`-heavy programs like this one are actually the ones that the new
parser is (comparatively) slowest on, because the construction time of
`CXGate` is higher than most gates, and this dominates the execution
time for the Rust-based parser.
* Work around docs failure on Sphinx 5.3, Python 3.9
The version of Sphinx that we're constrained to use in the docs build
can't handle the `Unpack` operator, so as a temporary measure we can
just relax the type hint a little.
* Remove unused import
* Tweak documentation
* More specific PyO3 usage
* Use PathBuf directly for paths
* Format
* Freeze dataclass
* Use type-safe id types
This should have no impact on runtime or on memory usage, since each of
the new types has the same bit width and alignment as the `usize` values
they replace.
* Documentation tweaks
* Fix comments in lexer
* Fix lexing version number with separating comments
* Add test of pathological formatting
* Fixup release note
* Fix handling of u0 gate
* Credit reviewers
Co-authored-by: Luciano Bello <bel@zurich.ibm.com>
Co-authored-by: Kevin Hartman <kevin@hart.mn>
Co-authored-by: Eric Arellano <14852634+Eric-Arellano@users.noreply.github.com>
* Add test of invalid gate-body statements
* Refactor custom built-in gate definitions
The previous system was quite confusing, and required all accesses to
the global symbol table to know that the `Gate` symbol could be present
but overridable. This led to confusing logic, various bugs and
unnecessary constraints, such as it previously being (erroneously)
possible to provide re-definitions for any "built-in" gate.
Instead, we keep a separate store of instructions that may be redefined.
This allows the logic to be centralised to only to the place responsible
for performing those overrides, and remains accessible for error-message
builders to query in order to provide better diagnostics.
* Credit Sasha
Co-authored-by: Alexander Ivrii <alexi@il.ibm.com>
* Credit Matthew
Co-authored-by: Matthew Treinish <mtreinish@kortar.org>
* Remove dependency on `lazy_static`
For a hashset of only 6 elements that is only checked once, there's not
really any point to pull in an extra dependency or use a hash set at
all.
* Update PyO3 version
---------
Co-authored-by: Luciano Bello <bel@zurich.ibm.com>
Co-authored-by: Kevin Hartman <kevin@hart.mn>
Co-authored-by: Eric Arellano <14852634+Eric-Arellano@users.noreply.github.com>
Co-authored-by: Alexander Ivrii <alexi@il.ibm.com>
Co-authored-by: Matthew Treinish <mtreinish@kortar.org>
2023-04-13 00:00:54 +08:00
|
|
|
debug=rust_debug,
|
|
|
|
),
|
|
|
|
RustExtension(
|
2023-06-12 21:45:27 +08:00
|
|
|
"qiskit._qasm2",
|
|
|
|
"crates/qasm2/Cargo.toml",
|
|
|
|
binding=Binding.PyO3,
|
|
|
|
debug=rust_debug,
|
Add Rust-based OpenQASM 2 converter (#9784)
* Add Rust-based OpenQASM 2 converter
This is a vendored version of qiskit-qasm2
(https://pypi.org/project/qiskit-qasm2), with this initial commit being
equivalent (barring some naming / documentation / testing conversions to
match Qiskit's style) to version 0.5.3 of that package.
This adds a new translation layer from OpenQASM 2 to Qiskit, which is
around an order of magnitude faster than the existing version in Python,
while being more type safe (in terms of disallowing invalid OpenQASM 2
programs rather than attempting to construction `QuantumCircuit`s that
are not correct) and more extensible.
The core logic is a hand-written lexer and parser combination written in
Rust, which emits a bytecode stream across the PyO3 boundary to a small
Python interpreter loop. The main bulk of the parsing logic is a simple
LL(1) recursive-descent algorithm, which delegates to more specific
recursive Pratt-based algorithm for handling classical expressions.
Many of the design decisions made (including why the lexer is written by
hand) are because the project originally started life as a way for me to
learn about implementations of the different parts of a parser stack;
this is the principal reason there are very few external crates used.
There are a few inefficiencies in this implementation, for example:
- the string interner in the lexer allocates twice for each stored
string (but zero times for a lookup). It may be possible to
completely eliminate allocations when parsing a string (or a file if
it's read into memory as a whole), but realistically there's only a
fairly small number of different tokens seen in most OpenQASM 2
programs, so it shouldn't be too big a deal.
- the hand-off from Rust to Python transfers small objects frequently.
It might be more efficient to have a secondary buffered iterator in
Python space, transferring more bytecode instructions at a time and
letting Python resolve them. This form could also be made
asynchronous, since for the most part, the Rust components only need
to acquire the CPython GIL at the API boundary.
- there are too many points within the lexer that can return a failure
result that needs unwrapping at every site. Since there are no tokens
that can span multiple lines, it should be possible to refactor so
that almost all of the byte-getter and -peeker routines cannot return
error statuses, at the cost of the main lexer loop becoming
responsible for advancing the line buffer, and moving the non-ASCII
error handling into each token constructor.
I'll probably keep playing with some of those in the `qiskit-qasm2`
package itself when I have free time, but at some point I needed to draw
the line and vendor the package. It's still ~10x faster than the
existing one:
In [1]: import qiskit.qasm2
...: prog = """
...: OPENQASM 2.0;
...: include "qelib1.inc";
...: qreg q[2];
...: """
...: prog += "rz(pi * 2) q[0];\ncx q[0], q[1];\n"*100_000
...: %timeit qiskit.qasm2.loads(prog)
...: %timeit qiskit.QuantumCircuit.from_qasm_str(prog)
2.26 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
22.5 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
`cx`-heavy programs like this one are actually the ones that the new
parser is (comparatively) slowest on, because the construction time of
`CXGate` is higher than most gates, and this dominates the execution
time for the Rust-based parser.
* Work around docs failure on Sphinx 5.3, Python 3.9
The version of Sphinx that we're constrained to use in the docs build
can't handle the `Unpack` operator, so as a temporary measure we can
just relax the type hint a little.
* Remove unused import
* Tweak documentation
* More specific PyO3 usage
* Use PathBuf directly for paths
* Format
* Freeze dataclass
* Use type-safe id types
This should have no impact on runtime or on memory usage, since each of
the new types has the same bit width and alignment as the `usize` values
they replace.
* Documentation tweaks
* Fix comments in lexer
* Fix lexing version number with separating comments
* Add test of pathological formatting
* Fixup release note
* Fix handling of u0 gate
* Credit reviewers
Co-authored-by: Luciano Bello <bel@zurich.ibm.com>
Co-authored-by: Kevin Hartman <kevin@hart.mn>
Co-authored-by: Eric Arellano <14852634+Eric-Arellano@users.noreply.github.com>
* Add test of invalid gate-body statements
* Refactor custom built-in gate definitions
The previous system was quite confusing, and required all accesses to
the global symbol table to know that the `Gate` symbol could be present
but overridable. This led to confusing logic, various bugs and
unnecessary constraints, such as it previously being (erroneously)
possible to provide re-definitions for any "built-in" gate.
Instead, we keep a separate store of instructions that may be redefined.
This allows the logic to be centralised to only to the place responsible
for performing those overrides, and remains accessible for error-message
builders to query in order to provide better diagnostics.
* Credit Sasha
Co-authored-by: Alexander Ivrii <alexi@il.ibm.com>
* Credit Matthew
Co-authored-by: Matthew Treinish <mtreinish@kortar.org>
* Remove dependency on `lazy_static`
For a hashset of only 6 elements that is only checked once, there's not
really any point to pull in an extra dependency or use a hash set at
all.
* Update PyO3 version
---------
Co-authored-by: Luciano Bello <bel@zurich.ibm.com>
Co-authored-by: Kevin Hartman <kevin@hart.mn>
Co-authored-by: Eric Arellano <14852634+Eric-Arellano@users.noreply.github.com>
Co-authored-by: Alexander Ivrii <alexi@il.ibm.com>
Co-authored-by: Matthew Treinish <mtreinish@kortar.org>
2023-04-13 00:00:54 +08:00
|
|
|
),
|
2024-02-01 10:21:49 +08:00
|
|
|
RustExtension(
|
|
|
|
"qiskit._qasm3",
|
|
|
|
"crates/qasm3/Cargo.toml",
|
|
|
|
binding=Binding.PyO3,
|
|
|
|
debug=rust_debug,
|
|
|
|
),
|
2023-03-08 07:12:39 +08:00
|
|
|
],
|
2023-06-12 21:45:27 +08:00
|
|
|
options={"bdist_wheel": {"py_limited_api": "cp38"}},
|
2017-08-23 22:40:43 +08:00
|
|
|
)
|