Use EvoformerFusedAttention directly to avoid all-zero bias term in column attention

Merge branch 'main' into deepspeed-evo-attention
Hotfix to correct seed type in inference
2023-11-08 13:57:43 -05:00 · 2023-11-08 13:32:44 -05:00 · 2023-11-06 13:11:04 -05:00 · 2023-10-30 14:04:51 -04:00 · 2023-10-30 14:55:31 +00:00 · 2023-10-30 10:54:56 -04:00
26 changed files with 804 additions and 250 deletions
--- a/.github/dependabot.yml
+++ b/.github/dependabot.yml
@ -0,0 +1,7 @@
+version: 2
+updates:
+
+  - package-ecosystem: "github-actions"
+    directory: "/"
+    schedule:
+      interval: "daily"
--- a/.github/workflows/docker-image.yml
+++ b/.github/workflows/docker-image.yml
@ -10,6 +10,6 @@ jobs:
  build:
    runs-on: ubuntu-latest
    steps:
-    - uses: actions/checkout@v2
+    - uses: actions/checkout@v4
    - name: Build the Docker image
      run: docker build . --file Dockerfile --tag openfold:$(date +%s)
--- a/.github/workflows/undefined_names.yml
+++ b/.github/workflows/undefined_names.yml
@ -4,8 +4,8 @@ jobs:
  undefined_names:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v2
-      - uses: actions/setup-python@v2
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v4
      - run: pip install --upgrade pip
      - run: pip install flake8
      - run: flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
--- a/13
+++ b/13
@ -1,4 +1,4 @@
-FROM nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu18.04
+FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu18.04

 # metainformation
 LABEL org.opencontainers.image.version = "1.0.0"
@ -13,24 +13,23 @@ RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/

 RUN apt-get update && apt-get install -y wget libxml2 cuda-minimal-build-11-3 libcusparse-dev-11-3 libcublas-dev-11-3 libcusolver-dev-11-3 git
 RUN wget -P /tmp \
-    "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" \
-    && bash /tmp/Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda \
-    && rm /tmp/Miniconda3-latest-Linux-x86_64.sh
+    "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh" \
+    && bash /tmp/Miniforge3-Linux-x86_64.sh -b -p /opt/conda \
+    && rm /tmp/Miniforge3-Linux-x86_64.sh
 ENV PATH /opt/conda/bin:$PATH

 COPY environment.yml /opt/openfold/environment.yml

 # installing into the base environment since the docker container wont do anything other than run openfold
-RUN conda env update -n base --file /opt/openfold/environment.yml && conda clean --all
+RUN mamba env update -n base --file /opt/openfold/environment.yml && mamba clean --all
+RUN export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:${LD_LIBRARY_PATH}

 COPY openfold /opt/openfold/openfold
 COPY scripts /opt/openfold/scripts
 COPY run_pretrained_openfold.py /opt/openfold/run_pretrained_openfold.py
 COPY train_openfold.py /opt/openfold/train_openfold.py
 COPY setup.py /opt/openfold/setup.py
-COPY lib/openmm.patch /opt/openfold/lib/openmm.patch
 RUN wget -q -P /opt/openfold/openfold/resources \
    https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt
-RUN patch -p0 -d /opt/conda/lib/python3.9/site-packages/ < /opt/openfold/lib/openmm.patch
 WORKDIR /opt/openfold
 RUN python3 setup.py install
--- a/README.md
+++ b/README.md
@ -29,7 +29,7 @@ vice versa (see `scripts/convert_of_weights_to_jax.py`).

 OpenFold has the following advantages over the reference implementation:

- **Faster inference** on GPU, sometimes by as much as 2x. The greatest speedups are achieved on (>= Ampere) GPUs.
+- **Faster inference** on GPU, sometimes by as much as 2x. The greatest speedups are achieved on Ampere or higher architecture GPUs.
 - **Inference on extremely long chains**, made possible by our implementation of low-memory attention 
 ([Rabe & Staats 2021](https://arxiv.org/pdf/2112.05682.pdf)). OpenFold can predict the structures of
  sequences with more than 4000 residues on a single A100, and even longer ones with CPU offloading.
@ -49,37 +49,19 @@ and one of {`jackhmmer`, [MMseqs2](https://github.com/soedinglab/mmseqs2) (night
 installed on your system. You'll need `git-lfs` to download OpenFold parameters. 
 Finally, some download scripts require `aria2c` and `aws`.

-For convenience, we provide a script that installs Miniconda locally, creates a 
-`conda` virtual environment, installs all Python dependencies, and downloads
-useful resources, including both sets of model parameters. Run:
+This package is currently supported for CUDA 11 and Pytorch 1.12

-```bash
-scripts/install_third_party_dependencies.sh
-```
+To install:
+1. Clone the repository, e.g. `git clone https://github.com/aqlaboratory/openfold.git`
+1. From the `openfold` repo: 
+    - Create a [Mamba]("https://github.com/conda-forge/miniforge/releases/latest/download/) environment, e.g. 
+        `mamba env create -n openfold_env -f environment.yml`
+      Mamba is recommended as the dependencies required by OpenFold are quite large and mamba can speed up the process.
+    - Activate the environment, e.g `conda activate openfold_env` 
+1. Run `scripts/install_third_party_dependencies.sh` to configure kernels and folding resources.

-To activate the environment, run:
+For some systems, it may help to append the Conda environment library path to `$LD_LIBRARY_PATH`. The `install_third_party_dependencies.sh` script does this once, but you may need this for each bash instance.

-```bash
-source scripts/activate_conda_env.sh
-```
-
-To deactivate it, run:
-
-```bash
-source scripts/deactivate_conda_env.sh
-```
-
-With the environment active, compile OpenFold's CUDA kernels with
-
-```bash
-python3 setup.py install
-```
-
-To install the HH-suite to `/usr/bin`, run
-
-```bash
-# scripts/install_hh_suite.sh
-```

 ## Usage

@ -233,6 +215,51 @@ efficent AlphaFold-Multimer more than double the time. Use the
 at once. The `run_pretrained_openfold.py` script can enable this config option with the 
 `--long_sequence_inference` command line option

+#### SoloSeq Inference
+To run inference for a sequence using the SoloSeq single-sequence model, you can either precompute ESM-1b embeddings in bulk, or you can generate them during inference.
+
+For generating ESM-1b embeddings in bulk, use the provided script: `scripts/precompute_embeddings.py`. The script takes a directory of FASTA files (one sequence per file) and generates ESM-1b embeddings in the same format and directory structure as required by SoloSeq. Following is an example command to use the script:
+
+```bash
+python scripts/precompute_embeddings.py fasta_dir/ embeddings_output_dir/
+```
+
+In the same per-label subdirectories inside `embeddings_output_dir`, you can also place `*.hhr` files (outputs from HHSearch), which can contain the details about the structures that you want to use as templates. If you do not place any such file, templates will not be used and only the ESM-1b embeddings will be used to predict the structure. If you want to use templates, you need to pass the PDB MMCIF dataset to the command.
+
+Now, you are ready to run inference:
+```bash
+python run_pretrained_openfold.py \
+    fasta_dir \
+    data/pdb_mmcif/mmcif_files/ \
+    --use_precomputed_alignments embeddings_output_dir \
+    --output_dir ./ \
+    --model_device "cuda:0" \
+    --config_preset "seq_model_esm1b_ptm" \
+    --openfold_checkpoint_path openfold/resources/openfold_params/seq_model_esm1b_ptm.pt
+```
+
+For generating the embeddings during inference, skip the `--use_precomputed_alignments` argument. The `*.hhr` files will be generated as well if you pass the paths to the relevant databases and tools, as specified in the command below. If you skip the database and tool arguments, HHSearch will not be used to find templates and only generated ESM-1b embeddings will be used to predict the structure.
+```bash
+python3 run_pretrained_openfold.py \
+    fasta_dir \
+    data/pdb_mmcif/mmcif_files/ \
+    --output_dir ./ \
+    --model_device "cuda:0" \
+    --config_preset "seq_model_esm1b_ptm" \
+    --openfold_checkpoint_path openfold/resources/openfold_params/seq_model_esm1b_ptm.pt \
+    --uniref90_database_path data/uniref90/uniref90.fasta \
+    --pdb70_database_path data/pdb70/pdb70 \
+    --jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
+    --hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
+    --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign \
+```
+
+For generating template information, you will need the UniRef90 and PDB70 databases and the JackHmmer and HHSearch binaries. 
+
+SoloSeq allows you to use the same flags and optimizations as the MSA-based OpenFold. For example, you can skip relaxation using `--skip_relaxation`, save all model outputs using `--save_outputs`, and generate output files in MMCIF format using `--cif_output`.
+
+**NOTE:** Due to the nature of the ESM-1b embeddings, the sequence length for inference using the SoloSeq model is limited to 1022 residues. Sequences longer than that will be truncated.
+
 ### Training

 To train the model, you will first need to precompute protein alignments. 
@ -440,7 +467,7 @@ Please cite our paper:

 ```bibtex
@article {Ahdritz2022.11.20.517210,
-	author = {Ahdritz, Gustaf and Bouatta, Nazim and Kadyan, Sachin and Xia, Qinghui and Gerecke, William and O{\textquoteright}Donnell, Timothy J and Berenberg, Daniel and Fisk, Ian and Zanichelli, Niccolò and Zhang, Bo and Nowaczynski, Arkadiusz and Wang, Bei and Stepniewska-Dziubinska, Marta M and Zhang, Shang and Ojewole, Adegoke and Guney, Murat Efe and Biderman, Stella and Watkins, Andrew M and Ra, Stephen and Lorenzo, Pablo Ribalta and Nivon, Lucas and Weitzner, Brian and Ban, Yih-En Andrew and Sorger, Peter K and Mostaque, Emad and Zhang, Zhao and Bonneau, Richard and AlQuraishi, Mohammed},
+	author = {Ahdritz, Gustaf and Bouatta, Nazim and Floristean, Christina and Kadyan, Sachin and Xia, Qinghui and Gerecke, William and O{\textquoteright}Donnell, Timothy J and Berenberg, Daniel and Fisk, Ian and Zanichelli, Niccolò and Zhang, Bo and Nowaczynski, Arkadiusz and Wang, Bei and Stepniewska-Dziubinska, Marta M and Zhang, Shang and Ojewole, Adegoke and Guney, Murat Efe and Biderman, Stella and Watkins, Andrew M and Ra, Stephen and Lorenzo, Pablo Ribalta and Nivon, Lucas and Weitzner, Brian and Ban, Yih-En Andrew and Sorger, Peter K and Mostaque, Emad and Zhang, Zhao and Bonneau, Richard and AlQuraishi, Mohammed},
 	title = {{O}pen{F}old: {R}etraining {A}lpha{F}old2 yields new insights into its learning mechanisms and capacity for generalization},
 	elocation-id = {2022.11.20.517210},
 	year = {2022},
--- a/environment.yml
+++ b/environment.yml
@ -1,33 +1,36 @@
-name: openfold_venv
+name: openfold-venv
 channels:
  - conda-forge
  - bioconda
  - pytorch
 dependencies:
-  - conda-forge::python=3.9
-  - conda-forge::setuptools=59.5.0
-  - conda-forge::pip
-  - conda-forge::openmm=7.5.1
-  - conda-forge::pdbfixer
-  - conda-forge::cudatoolkit==11.3.*
+  - python=3.9
+  - libgcc=7.2
+  - setuptools=59.5.0
+  - pip
+  - openmm=7.7
+  - pdbfixer
+  - cudatoolkit==11.3.*
+  - pytorch-lightning==1.5.10
+  - biopython==1.79
+  - numpy==1.21
+  - PyYAML==5.4.1
+  - requests
+  - scipy==1.7
+  - tqdm==4.62.2
+  - typing-extensions==3.10
+  - wandb==0.12.21
+  - modelcif==0.7
+  - awscli
+  - ml-collections
+  - aria2
+  - git
  - bioconda::hmmer==3.3.2
  - bioconda::hhsuite==3.3.0
  - bioconda::kalign2==2.04
  - pytorch::pytorch=1.12.*
  - pip:
-      - biopython==1.79
+      - deepspeed==0.12.2
      - dm-tree==0.1.6
-      - ml-collections==0.1.0
-      - numpy==1.21.2
-      - PyYAML==5.4.1
-      - requests==2.26.0
-      - scipy==1.7.1
-      - tqdm==4.62.2
-      - typing-extensions==3.10.0.2
-      - pytorch_lightning==1.5.10
-      - wandb==0.12.21
-      - modelcif==0.7
      - git+https://github.com/NVIDIA/dllogger.git
-      - git+https://github.com/microsoft/DeepSpeed.git
-      # TODO: Replace above when version becomes available
-      # - deepspeed==0.10.4
+      - git+https://github.com/Dao-AILab/flash-attention.git@5b838a8
--- a/lib/openmm.patch
+++ b/lib/openmm.patch
@ -1,42 +0,0 @@
-Index: simtk/openmm/app/topology.py
-===================================================================
--- simtk.orig/openmm/app/topology.py
-+++ simtk/openmm/app/topology.py
-@@ -356,19 +356,35 @@
-         def isCyx(res):
-             names = [atom.name for atom in res._atoms]
-             return 'SG' in names and 'HG' not in names
-+        # This function is used to prevent multiple di-sulfide bonds from being
-+        # assigned to a given atom. This is a DeepMind modification.
-+        def isDisulfideBonded(atom):
-+            for b in self._bonds:
-+                if (atom in b and b[0].name == 'SG' and
-+                    b[1].name == 'SG'):
-+                    return True
-+
-+            return False
- 
-         cyx = [res for res in self.residues() if res.name == 'CYS' and isCyx(res)]
-         atomNames = [[atom.name for atom in res._atoms] for res in cyx]
-         for i in range(len(cyx)):
-             sg1 = cyx[i]._atoms[atomNames[i].index('SG')]
-             pos1 = positions[sg1.index]
-+            candidate_distance, candidate_atom = 0.3*nanometers, None
-             for j in range(i):
-                 sg2 = cyx[j]._atoms[atomNames[j].index('SG')]
-                 pos2 = positions[sg2.index]
-                 delta = [x-y for (x,y) in zip(pos1, pos2)]
-                 distance = sqrt(delta[0]*delta[0] + delta[1]*delta[1] + delta[2]*delta[2])
-                if distance < 0.3*nanometers:
-                    self.addBond(sg1, sg2)
-+                if distance < candidate_distance and not isDisulfideBonded(sg2):
-+                    candidate_distance = distance
-+                    candidate_atom = sg2
-+            # Assign bond to closest pair.
-+            if candidate_atom:
-+                self.addBond(sg1, candidate_atom)
-+
-+
- 
- class Chain(object):
-     """A Chain object represents a chain within a Topology."""
--- a/openfold/config.py
+++ b/openfold/config.py
@ -152,9 +152,42 @@ def model_config(
        c.model.template.enabled = False
        c.model.heads.tm.enabled = True
        c.loss.tm.weight = 0.1
+    # SINGLE SEQUENCE EMBEDDING PRESETS
+    elif name == "seqemb_initial_training":
+        c.data.train.max_msa_clusters = 1
+        c.data.eval.max_msa_clusters = 1
+        c.data.train.max_distillation_msa_clusters = 1
+    elif name == "seqemb_finetuning":
+        c.data.train.max_msa_clusters = 1
+        c.data.eval.max_msa_clusters = 1
+        c.data.train.max_distillation_msa_clusters = 1
+        c.data.train.crop_size = 384
+        c.loss.violation.weight = 1.
+        c.loss.experimentally_resolved.weight = 0.01
+    elif name == "seq_model_esm1b":
+        c.data.common.use_templates = True
+        c.data.common.use_template_torsion_angles = True
+        c.model.template.enabled = True
+        c.data.predict.max_msa_clusters = 1
+    elif name == "seq_model_esm1b_ptm":
+        c.data.common.use_templates = True
+        c.data.common.use_template_torsion_angles = True
+        c.model.template.enabled = True
+        c.data.predict.max_msa_clusters = 1
+        c.model.heads.tm.enabled = True
+        c.loss.tm.weight = 0.1
    else:
        raise ValueError("Invalid model name")

+    if name.startswith("seq"):
+        # Tell the data pipeline that we will use sequence embeddings instead of MSAs.
+        c.data.seqemb_mode.enabled = True
+        c.globals.seqemb_mode_enabled = True
+        # In seqemb mode, we turn off the ExtraMSAStack and Evoformer's column attention.
+        c.model.extra_msa.enabled = False
+        c.model.evoformer_stack.no_column_attention = True
+        c.update(seq_mode_config.copy_and_resolve_references())
+
    if long_sequence_inference:
        assert(not train)
        c.globals.offload_inference = True
@ -189,6 +222,11 @@ c_m = mlc.FieldReference(256, field_type=int)
 c_t = mlc.FieldReference(64, field_type=int)
 c_e = mlc.FieldReference(64, field_type=int)
 c_s = mlc.FieldReference(384, field_type=int)
+
+# For seqemb mode, dimension size of the per-residue sequence embedding passed to the model
+# In current model, the dimension size is the ESM-1b dimension size i.e. 1280.
+preemb_dim_size = mlc.FieldReference(1280, field_type=int)
+
 blocks_per_ckpt = mlc.FieldReference(None, field_type=int)
 chunk_size = mlc.FieldReference(4, field_type=int)
 aux_distogram_bins = mlc.FieldReference(64, field_type=int)
@ -301,6 +339,9 @@ config = mlc.ConfigDict(
                "use_templates": templates_enabled,
                "use_template_torsion_angles": embed_template_torsion_angles,
            },
+            "seqemb_mode": { # Configuration for sequence embedding mode
+                "enabled": False, # If True, use seq emb instead of MSA
+            },
            "supervised": {
                "clamp_prob": 0.9,
                "supervised_features": [
@ -365,6 +406,7 @@ config = mlc.ConfigDict(
        },
        # Recurring FieldReferences that can be changed globally here
        "globals": {
+            "seqemb_mode_enabled": False, # Global flag for enabling seq emb mode
            "blocks_per_ckpt": blocks_per_ckpt,
            "chunk_size": chunk_size,
            # Use DeepSpeed memory-efficient attention kernel. Mutually
@ -497,6 +539,7 @@ config = mlc.ConfigDict(
                "transition_n": 4,
                "msa_dropout": 0.15,
                "pair_dropout": 0.25,
+                "no_column_attention": False,
                "blocks_per_ckpt": blocks_per_ckpt,
                "clear_cache_between_blocks": False,
                "tune_chunk_size": tune_chunk_size,
@ -618,3 +661,31 @@ config = mlc.ConfigDict(
        "ema": {"decay": 0.999},
    }
 )
+
+seq_mode_config = mlc.ConfigDict({
+    "data": {
+        "common": {
+            "feat": {
+                "seq_embedding": [NUM_RES, None],
+            },
+            "seqemb_features": [ # List of features to be generated in seqemb mode
+                "seq_embedding"
+            ],
+        },
+        "seqemb_mode": { # Configuration for sequence embedding mode
+            "enabled": True, # If True, use seq emb instead of MSA
+        },
+    },
+    "globals": {
+        "seqemb_mode_enabled": True,
+    },
+    "model": {
+        "preembedding_embedder": { # Used in sequence embedding mode
+            "tf_dim": 22,
+            "preembedding_dim": preemb_dim_size,
+            "c_z": c_z,
+            "c_m": c_m,
+            "relpos_k": 32,
+        },
+    }
+})
--- a/openfold/data/data_modules.py
+++ b/openfold/data/data_modules.py
@ -186,7 +186,8 @@ class OpenFoldSingleDataset(torch.utils.data.Dataset):
            mmcif=mmcif_object,
            alignment_dir=alignment_dir,
            chain_id=chain_id,
-            alignment_index=alignment_index
+            alignment_index=alignment_index,
+            seqemb_mode=self.config.seqemb_mode.enabled
        )

        return data
@ -239,6 +240,7 @@ class OpenFoldSingleDataset(torch.utils.data.Dataset):
            elif(ext == ".core"):
                data = self.data_pipeline.process_core(
                    path, alignment_dir, alignment_index,
+                    seqemb_mode=self.config.seqemb_mode.enabled,
                )
            elif(ext == ".pdb"):
                structure_index = None
@ -251,6 +253,7 @@ class OpenFoldSingleDataset(torch.utils.data.Dataset):
                    chain_id=chain_id,
                    alignment_index=alignment_index,
                    _structure_index=structure_index,
+                    seqemb_mode=self.config.seqemb_mode.enabled,
                )
            else:
               raise ValueError("Extension branch missing") 
@ -260,6 +263,7 @@ class OpenFoldSingleDataset(torch.utils.data.Dataset):
                fasta_path=path,
                alignment_dir=alignment_dir,
                alignment_index=alignment_index,
+                seqemb_mode=self.config.seqemb_mode.enabled,
            )

        if(self._output_raw):
--- a/openfold/data/data_pipeline.py
+++ b/openfold/data/data_pipeline.py
@ -19,6 +19,7 @@ from multiprocessing import cpu_count
 from typing import Mapping, Optional, Sequence, Any

 import numpy as np
+import torch

 from openfold.data import templates, parsers, mmcif_parsing
 from openfold.data.templates import get_custom_template_features
@ -260,6 +261,18 @@ def make_msa_features(
    return features


+# Generate 1-sequence MSA features having only the input sequence
+def make_dummy_msa_feats(input_sequence):
+    msas = [[input_sequence]]
+    deletion_matrices = [[[0 for _ in input_sequence]]]
+    msa_features = make_msa_features(
+        msas=msas,
+        deletion_matrices=deletion_matrices,
+    )
+
+    return msa_features
+
+
 def make_sequence_features_with_custom_template(
        sequence: str,
        mmcif_path: str,
@ -627,11 +640,28 @@ class DataPipeline:

        return msa_features

+    # Load and process sequence embedding features
+    def _process_seqemb_features(self,
+        alignment_dir: str,
+    ) -> Mapping[str, Any]:
+        seqemb_features = {}
+        for f in os.listdir(alignment_dir):
+            path = os.path.join(alignment_dir, f)
+            ext = os.path.splitext(f)[-1]
+
+            if (ext == ".pt"):
+                # Load embedding file
+                seqemb_data = torch.load(path)
+                seqemb_features["seq_embedding"] = seqemb_data["representations"][33]
+
+        return seqemb_features
+
    def process_fasta(
        self,
        fasta_path: str,
        alignment_dir: str,
        alignment_index: Optional[str] = None,
+        seqemb_mode: bool = False,
    ) -> FeatureDict:
        """Assembles features for a single sequence in a FASTA file""" 
        with open(fasta_path) as f:
@ -658,12 +688,19 @@ class DataPipeline:
            num_res=num_res,
        )

-        msa_features = self._process_msa_feats(alignment_dir, input_sequence, alignment_index)
+        sequence_embedding_features = {}
+        # If using seqemb mode, generate a dummy MSA features using just the sequence
+        if seqemb_mode:
+            msa_features = make_dummy_msa_feats(input_sequence)
+            sequence_embedding_features = self._process_seqemb_features(alignment_dir)
+        else:
+            msa_features = self._process_msa_feats(alignment_dir, input_sequence, alignment_index)
        
        return {
            **sequence_features,
            **msa_features, 
-            **template_features
+            **template_features,
+            **sequence_embedding_features,
        }

    def process_mmcif(
@ -672,6 +709,7 @@ class DataPipeline:
        alignment_dir: str,
        chain_id: Optional[str] = None,
        alignment_index: Optional[str] = None,
+        seqemb_mode: bool = False,
    ) -> FeatureDict:
        """
            Assembles features for a specific chain in an mmCIF object.
@ -696,10 +734,16 @@ class DataPipeline:
            self.template_featurizer,
            query_release_date=to_date(mmcif.header["release_date"])
        )
-        
-        msa_features = self._process_msa_feats(alignment_dir, input_sequence, alignment_index)

-        return {**mmcif_feats, **template_features, **msa_features}
+        sequence_embedding_features = {}
+        # If using seqemb mode, generate a dummy MSA features using just the sequence
+        if seqemb_mode:
+            msa_features = make_dummy_msa_feats(input_sequence)
+            sequence_embedding_features = self._process_seqemb_features(alignment_dir)
+        else:
+            msa_features = self._process_msa_feats(alignment_dir, input_sequence, alignment_index)
+
+        return {**mmcif_feats, **template_features, **msa_features, **sequence_embedding_features}

    def process_pdb(
        self,
@ -709,6 +753,7 @@ class DataPipeline:
        chain_id: Optional[str] = None,
        _structure_index: Optional[str] = None,
        alignment_index: Optional[str] = None,
+        seqemb_mode: bool = False,
    ) -> FeatureDict:
        """
            Assembles features for a protein in a PDB file.
@ -742,15 +787,22 @@ class DataPipeline:
            self.template_featurizer,
        )

-        msa_features = self._process_msa_feats(alignment_dir, input_sequence, alignment_index)
+        sequence_embedding_features = {}
+        # If in sequence embedding mode, generate dummy MSA features using just the input sequence
+        if seqemb_mode:
+            msa_features = make_dummy_msa_feats(input_sequence)
+            sequence_embedding_features = self._process_seqemb_features(alignment_dir)
+        else:
+            msa_features = self._process_msa_feats(alignment_dir, input_sequence, alignment_index)

-        return {**pdb_feats, **template_features, **msa_features}
+        return {**pdb_feats, **template_features, **msa_features, **sequence_embedding_features}

    def process_core(
        self,
        core_path: str,
        alignment_dir: str,
        alignment_index: Optional[str] = None,
+        seqemb_mode: bool = False,
    ) -> FeatureDict:
        """
            Assembles features for a protein in a ProteinNet .core file.
@ -770,9 +822,15 @@ class DataPipeline:
            self.template_featurizer,
        )

-        msa_features = self._process_msa_feats(alignment_dir, input_sequence)
+        sequence_embedding_features = {}
+        # If in sequence embedding mode, generate dummy MSA features using just the input sequence
+        if seqemb_mode:
+            msa_features = make_dummy_msa_feats(input_sequence)
+            sequence_embedding_features = self._process_seqemb_features(alignment_dir)
+        else:
+            msa_features = self._process_msa_feats(alignment_dir, input_sequence)

-        return {**core_feats, **template_features, **msa_features}
+        return {**core_feats, **template_features, **msa_features, **sequence_embedding_features}

    def process_multiseq_fasta(self,
        fasta_path: str,
--- a/openfold/data/feature_pipeline.py
+++ b/openfold/data/feature_pipeline.py
@ -40,9 +40,11 @@ def np_to_tensor_dict(
    Returns:
        A dictionary of features mapping feature names to features. Only the given
        features are returned, all other ones are filtered out.
-    """ 
+    """
+    # torch generates warnings if feature is already a torch Tensor
+    to_tensor = lambda t: torch.tensor(t) if type(t) != torch.Tensor else t.clone().detach()
    tensor_dict = {
-        k: torch.tensor(v) for k, v in np_example.items() if k in features
+        k: to_tensor(v) for k, v in np_example.items() if k in features
    }

    return tensor_dict
@ -61,6 +63,10 @@ def make_data_config(

    feature_names = cfg.common.unsupervised_features

+    # Add seqemb related features if using seqemb mode.
+    if cfg.seqemb_mode.enabled:
+        feature_names += cfg.common.seqemb_features
+
    if cfg.common.use_templates:
        feature_names += cfg.common.template_features

--- a/openfold/model/embedders.py
+++ b/openfold/model/embedders.py
@ -139,6 +139,100 @@ class InputEmbedder(nn.Module):
        return msa_emb, pair_emb


+class PreembeddingEmbedder(nn.Module):
+    """
+    Embeds the sequence pre-embedding passed to the model and the target_feat features.
+    """
+
+    def __init__(
+        self,
+        tf_dim: int,
+        preembedding_dim: int,
+        c_z: int,
+        c_m: int,
+        relpos_k: int,
+        **kwargs,
+    ):
+        """
+        Args:
+            tf_dim:
+                End channel dimension of the incoming target features
+            preembedding_dim:
+                End channel dimension of the incoming embeddings
+            c_z:
+                Pair embedding dimension
+            c_m:
+                Single-Seq embedding dimension
+            relpos_k:
+                Window size used in relative position encoding
+        """
+        super(PreembeddingEmbedder, self).__init__()
+
+        self.tf_dim = tf_dim
+        self.preembedding_dim = preembedding_dim
+
+        self.c_z = c_z
+        self.c_m = c_m
+
+        self.linear_tf_m = Linear(tf_dim, c_m)
+        self.linear_preemb_m = Linear(self.preembedding_dim, c_m)
+        self.linear_preemb_z_i = Linear(self.preembedding_dim, c_z)
+        self.linear_preemb_z_j = Linear(self.preembedding_dim, c_z)
+
+        # Relative Positional Encoding
+        self.relpos_k = relpos_k
+        self.no_bins = 2 * relpos_k + 1
+        self.linear_relpos = Linear(self.no_bins, c_z)
+
+    def relpos(self, ri: torch.Tensor):
+        """
+        Computes relative positional encodings
+        Args:
+            ri:
+                "residue_index" feature of shape [*, N]
+        Returns:
+                Relative positional encoding of protein using the
+                residue_index feature
+        """
+        d = ri[..., None] - ri[..., None, :]
+        boundaries = torch.arange(
+            start=-self.relpos_k, end=self.relpos_k + 1, device=d.device
+        )
+        reshaped_bins = boundaries.view(((1,) * len(d.shape)) + (len(boundaries),))
+        d = d[..., None] - reshaped_bins
+        d = torch.abs(d)
+        d = torch.argmin(d, dim=-1)
+        d = nn.functional.one_hot(d, num_classes=len(boundaries)).float()
+        d = d.to(ri.dtype)
+        return self.linear_relpos(d)
+
+    def forward(
+        self,
+        tf: torch.Tensor,
+        ri: torch.Tensor,
+        preemb: torch.Tensor,
+        inplace_safe: bool = False,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+
+        tf_m = (
+            self.linear_tf_m(tf)
+            .unsqueeze(-3)
+        )
+        preemb_emb = self.linear_preemb_m(preemb[..., None, :, :]) + tf_m
+        preemb_emb_i = self.linear_preemb_z_i(preemb)
+        preemb_emb_j = self.linear_preemb_z_j(preemb)
+
+        pair_emb = self.relpos(ri.type(preemb_emb_i.dtype))
+        pair_emb = add(pair_emb,
+                       preemb_emb_i[..., None, :],
+                       inplace=inplace_safe)
+        pair_emb = add(pair_emb,
+                       preemb_emb_j[..., None, :, :],
+                       inplace=inplace_safe)
+
+        return preemb_emb, pair_emb
+
+
 class RecyclingEmbedder(nn.Module):
    """
    Embeds the output of an iteration of the model for recycling.
--- a/openfold/model/evoformer.py
+++ b/openfold/model/evoformer.py
@ -87,7 +87,6 @@ class MSATransition(nn.Module):
             no_batch_dims=len(m.shape[:-2]),
         )

-
    def forward(
        self,
        m: torch.Tensor,
@ -326,6 +325,7 @@ class EvoformerBlock(nn.Module):
        transition_n: int,
        msa_dropout: float,
        pair_dropout: float,
+        no_column_attention: bool,
        inf: float,
        eps: float,
    ):
@ -339,12 +339,15 @@ class EvoformerBlock(nn.Module):
            inf=inf,
        )

-        self.msa_att_col = MSAColumnAttention(
-            c_m,
-            c_hidden_msa_att,
-            no_heads_msa,
-            inf=inf,
-        )
+        # Specifically, seqemb mode does not use column attention
+        self.no_column_attention = no_column_attention
+        if not self.no_column_attention:
+            self.msa_att_col = MSAColumnAttention(
+                c_m,
+                c_hidden_msa_att,
+                no_heads_msa,
+                inf=inf,
+            )

        self.msa_dropout_layer = DropoutRowwise(msa_dropout)

@ -402,18 +405,20 @@ class EvoformerBlock(nn.Module):
            ),
            inplace=inplace_safe,
        )
-        
-        m = add(m, 
-            self.msa_att_col(
-                m, 
-                mask=msa_mask, 
-                chunk_size=chunk_size,
-                use_deepspeed_evo_attention=use_deepspeed_evo_attention,
-                use_lma=use_lma,
-                use_flash=use_flash,
-            ),
-            inplace=inplace_safe,
-        )
+
+        # Specifically, column attention is not used in seqemb mode.
+        if not self.no_column_attention:
+            m = add(m,
+                self.msa_att_col(
+                    m,
+                    mask=msa_mask,
+                    chunk_size=chunk_size,
+                    use_deepspeed_evo_attention=use_deepspeed_evo_attention,
+                    use_lma=use_lma,
+                    use_flash=use_flash,
+                ),
+                inplace=inplace_safe,
+            )

        if(not inplace_safe):
            input_tensors = [m, input_tensors[1]]
@ -605,6 +610,7 @@ class EvoformerStack(nn.Module):
        msa_dropout: float,
        pair_dropout: float,
        blocks_per_ckpt: int,
+        no_column_attention: bool,
        inf: float,
        eps: float,
        clear_cache_between_blocks: bool = False, 
@ -642,6 +648,9 @@ class EvoformerStack(nn.Module):
                Dropout used for pair activations
            blocks_per_ckpt:
                Number of Evoformer blocks in each activation checkpoint
+            no_column_attention:
+                When True, doesn't use column attention. Required for running
+                sequence embedding mode
            clear_cache_between_blocks:
                Whether to clear CUDA's GPU memory cache between blocks of the
                stack. Slows down each block but can reduce fragmentation
@ -668,6 +677,7 @@ class EvoformerStack(nn.Module):
                transition_n=transition_n,
                msa_dropout=msa_dropout,
                pair_dropout=pair_dropout,
+                no_column_attention=no_column_attention,
                inf=inf,
                eps=eps,
            )
--- a/openfold/model/model.py
+++ b/openfold/model/model.py
@ -24,6 +24,7 @@ from openfold.model.embedders import (
    TemplateAngleEmbedder,
    TemplatePairEmbedder,
    ExtraMSAEmbedder,
+    PreembeddingEmbedder,
 )
 from openfold.model.evoformer import EvoformerStack, ExtraMSAStack
 from openfold.model.heads import AuxiliaryHeads
@ -71,11 +72,19 @@ class AlphaFold(nn.Module):
        self.config = config.model
        self.template_config = self.config.template
        self.extra_msa_config = self.config.extra_msa
+        self.seqemb_mode = config.globals.seqemb_mode_enabled

        # Main trunk + structure module
-        self.input_embedder = InputEmbedder(
-            **self.config["input_embedder"],
-        )
+        # If using seqemb mode, embed the sequence embeddings passed
+        # to the model ("preembeddings") instead of embedding the sequence
+        if self.seqemb_mode:
+            self.input_embedder = PreembeddingEmbedder(
+                **self.config["preembedding_embedder"],
+            )
+        else:
+            self.input_embedder = InputEmbedder(
+                **self.config["input_embedder"],
+            )
        self.recycling_embedder = RecyclingEmbedder(
            **self.config["recycling_embedder"],
        )
@ -238,17 +247,27 @@ class AlphaFold(nn.Module):
        seq_mask = feats["seq_mask"]
        pair_mask = seq_mask[..., None] * seq_mask[..., None, :]
        msa_mask = feats["msa_mask"]
-        
-        ## Initialize the MSA and pair representations

-        # m: [*, S_c, N, C_m]
+        ## Initialize the SingleSeq and pair representations
+        # m: [*, 1, N, C_m]
        # z: [*, N, N, C_z]
-        m, z = self.input_embedder(
-            feats["target_feat"],
-            feats["residue_index"],
-            feats["msa_feat"],
-            inplace_safe=inplace_safe,
-        )
+        if self.seqemb_mode:
+            m, z = self.input_embedder(
+                feats["target_feat"],
+                feats["residue_index"],
+                feats["seq_embedding"]
+            )
+
+        else:
+            ## Initialize the MSA and pair representations
+            # m: [*, S_c, N, C_m]
+            # z: [*, N, N, C_z]
+            m, z = self.input_embedder(
+                feats["target_feat"],
+                feats["residue_index"],
+                feats["msa_feat"],
+                inplace_safe=inplace_safe,
+            )

        # Unpack the recycling embeddings. Removing them from the list allows 
        # them to be freed further down in this function, saving memory
--- a/openfold/model/primitives.py
+++ b/openfold/model/primitives.py
@ -23,7 +23,7 @@ if deepspeed_is_installed:
    import deepspeed

 if ds4s_is_installed:
-    from deepspeed.ops.deepspeed4science import DS4Sci_EvoformerAttention
+    from deepspeed.ops.deepspeed4science import EvoformerFusedAttention

 fa_is_installed = importlib.util.find_spec("flash_attn") is not None
 if fa_is_installed:
@ -661,18 +661,19 @@ def _deepspeed_evo_attn(
        v = reshape_dims(v)
        biases = [reshape_dims(b) for b in biases]

+    biases.extend([None] * (2 - len(biases)))
+
    # DeepSpeed attn. kernel requires inputs to be type bf16 or fp16
    # Cast to bf16 so kernel can be used during inference
    orig_dtype = q.dtype
    if orig_dtype not in [torch.bfloat16, torch.float16]:
-        o = DS4Sci_EvoformerAttention(q.to(dtype=torch.bfloat16),
-                                      k.to(dtype=torch.bfloat16),
-                                      v.to(dtype=torch.bfloat16),
-                                      [b.to(dtype=torch.bfloat16) for b in biases])
+        inputs_bf16 = [x.to(dtype=torch.bfloat16) if x is not None else x
+                       for x in (q, k, v, biases[0], biases[1])]
+        o = EvoformerFusedAttention.apply(*inputs_bf16)

        o = o.to(dtype=orig_dtype)
    else:
-        o = DS4Sci_EvoformerAttention(q, k, v, biases)
+        o = EvoformerFusedAttention.apply(q, k, v, biases[0], biases[1])

    o = o.reshape(orig_shape)
    return o
--- a/openfold/np/relax/amber_minimize.py
+++ b/openfold/np/relax/amber_minimize.py
@ -28,18 +28,10 @@ import openfold.utils.loss as loss
 from openfold.np.relax import cleanup, utils
 import ml_collections
 import numpy as np
-try:
-    # openmm >= 7.6
-    import openmm
-    from openmm import unit
-    from openmm import app as openmm_app
-    from openmm.app.internal.pdbstructure import PdbStructure
-except ImportError:
-    # openmm < 7.6 (requires DeepMind patch)
-    from simtk import openmm
-    from simtk import unit
-    from simtk.openmm import app as openmm_app
-    from simtk.openmm.app.internal.pdbstructure import PdbStructure
+import openmm
+from openmm import unit
+from openmm import app as openmm_app
+from openmm.app.internal.pdbstructure import PdbStructure

 ENERGY = unit.kilocalories_per_mole
 LENGTH = unit.angstroms
--- a/openfold/np/relax/cleanup.py
+++ b/openfold/np/relax/cleanup.py
@ -20,14 +20,8 @@ cases like removing chains of length one (see clean_structure).
 import io

 import pdbfixer
-try:
-    # openmm >= 7.6
-    from openmm import app
-    from openmm.app import element
-except ImportError:
-    # openmm < 7.6 (requires DeepMind patch)
-    from simtk.openmm import app
-    from simtk.openmm.app import element
+from openmm import app
+from openmm.app import element


 def fix_pdb(pdbfile, alterations_info):
--- a/openfold/np/relax/utils.py
+++ b/openfold/np/relax/utils.py
@ -18,14 +18,8 @@ import io
 from openfold.np import residue_constants
 from Bio import PDB
 import numpy as np
-try:
-    # openmm >= 7.6
-    from openmm import app as openmm_app
-    from openmm.app.internal.pdbstructure import PdbStructure
-except ImportError:
-    # openmm < 7.6 (requires DeepMind patch)
-    from simtk.openmm import app as openmm_app
-    from simtk.openmm.app.internal.pdbstructure import PdbStructure
+from openmm import app as openmm_app
+from openmm.app.internal.pdbstructure import PdbStructure


 def overwrite_pdb_coordinates(pdb_str: str, pos) -> str:
--- a/openfold/utils/script_utils.py
+++ b/openfold/utils/script_utils.py
@ -159,7 +159,7 @@ def run_model(model, batch, tag, output_dir):
        out = model(batch)
        inference_time = time.perf_counter() - t
        logger.info(f"Inference time: {inference_time}")
-        update_timings({"inference": inference_time}, os.path.join(output_dir, "timings.json"))
+        update_timings({tag: {"inference": inference_time}}, os.path.join(output_dir, "timings.json"))

        model.config.template.enabled = template_enabled

--- a/run_pretrained_openfold.py
+++ b/run_pretrained_openfold.py
@ -55,6 +55,7 @@ from openfold.utils.trace_utils import (
    pad_feature_dict_seq,
    trace_model_,
 )
+from scripts.precompute_embeddings import EmbeddingGenerator
 from scripts.utils import add_data_args


@ -73,17 +74,29 @@ def precompute_alignments(tags, seqs, alignment_dir, args):

            os.makedirs(local_alignment_dir)

-            alignment_runner = data_pipeline.AlignmentRunner(
-                jackhmmer_binary_path=args.jackhmmer_binary_path,
-                hhblits_binary_path=args.hhblits_binary_path,
-                hhsearch_binary_path=args.hhsearch_binary_path,
-                uniref90_database_path=args.uniref90_database_path,
-                mgnify_database_path=args.mgnify_database_path,
-                bfd_database_path=args.bfd_database_path,
-                uniclust30_database_path=args.uniclust30_database_path,
-                pdb70_database_path=args.pdb70_database_path,
-                no_cpus=args.cpus,
-            )
+            # In seqemb mode, use AlignmentRunner only to generate templates
+            if args.use_single_seq_mode:
+                alignment_runner = data_pipeline.AlignmentRunner(
+                    jackhmmer_binary_path=args.jackhmmer_binary_path,
+                    hhsearch_binary_path=args.hhsearch_binary_path,
+                    uniref90_database_path=args.uniref90_database_path,
+                    pdb70_database_path=args.pdb70_database_path,
+                    no_cpus=args.cpus,
+                )
+                embedding_generator = EmbeddingGenerator()
+                embedding_generator.run(tmp_fasta_path, alignment_dir)
+            else:
+                alignment_runner = data_pipeline.AlignmentRunner(
+                    jackhmmer_binary_path=args.jackhmmer_binary_path,
+                    hhblits_binary_path=args.hhblits_binary_path,
+                    hhsearch_binary_path=args.hhsearch_binary_path,
+                    uniref90_database_path=args.uniref90_database_path,
+                    mgnify_database_path=args.mgnify_database_path,
+                    bfd_database_path=args.bfd_database_path,
+                    uniclust30_database_path=args.uniclust30_database_path,
+                    pdb70_database_path=args.pdb70_database_path,
+                    no_cpus=args.cpus,
+                )
            alignment_runner.run(
                tmp_fasta_path, local_alignment_dir
            )
@ -116,7 +129,9 @@ def generate_feature_dict(

        local_alignment_dir = os.path.join(alignment_dir, tag)
        feature_dict = data_processor.process_fasta(
-            fasta_path=tmp_fasta_path, alignment_dir=local_alignment_dir
+            fasta_path=tmp_fasta_path,
+            alignment_dir=local_alignment_dir,
+            seqemb_mode=args.use_single_seq_mode,
        )
    else:
        with open(tmp_fasta_path, "w") as fp:
@ -140,6 +155,8 @@ def main(args):
    # Create the output directory
    os.makedirs(args.output_dir, exist_ok=True)

+    if args.config_preset.startswith("seq"):
+        args.use_single_seq_mode = True
    config = model_config(args.config_preset, long_sequence_inference=args.long_sequence_inference)

    if(args.trace_model):
@ -314,6 +331,10 @@ if __name__ == "__main__":
        help="""Path to alignment directory. If provided, alignment computation 
                is skipped and database path arguments are ignored."""
    )
+    parser.add_argument(
+        "--use_single_seq_mode", action="store_true", default=False,
+        help="""Use single sequence embeddings instead of MSAs."""
+    )
    parser.add_argument(
        "--output_dir", type=str, default=os.getcwd(),
        help="""Name of the directory in which to output the prediction""",
--- a/scripts/install_third_party_dependencies.sh
+++ b/scripts/install_third_party_dependencies.sh
@ -1,54 +1,26 @@
 #!/bin/bash
-CONDA_INSTALL_URL=${CONDA_INSTALL_URL:-"https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh"}
-
-source scripts/vars.sh
-
-# Install Miniconda locally
-rm -rf lib/conda
-rm -f /tmp/Miniconda3-latest-Linux-x86_64.sh
-wget -P /tmp \
-    "${CONDA_INSTALL_URL}" \
-    && bash /tmp/Miniconda3-latest-Linux-x86_64.sh -b -p lib/conda \
-    && rm /tmp/Miniconda3-latest-Linux-x86_64.sh
-
-# Grab conda-only packages
-export PATH=lib/conda/bin:$PATH
-lib/conda/bin/python3 -m pip install nvidia-pyindex
-conda env create --name=${ENV_NAME} -f environment.yml
-source scripts/activate_conda_env.sh
-
-echo "Attempting to install FlashAttention"
-git clone https://github.com/HazyResearch/flash-attention
-CUR_DIR=$PWD
-cd flash-attention
-git checkout 5b838a8bef
-python3 setup.py install
-cd $CUR_DIR
-
-echo "Attempting to download CUTLASS, required for Deepspeed Evoformer attention kernel"
-git clone https://github.com/NVIDIA/cutlass.git
-conda env config vars set CUTLASS_PATH=$PWD/cutlass
-source scripts/activate_conda_env.sh
-
-# Install DeepMind's OpenMM patch
-OPENFOLD_DIR=$PWD
-pushd lib/conda/envs/$ENV_NAME/lib/python3.9/site-packages/ \
-    && patch -p0 < $OPENFOLD_DIR/lib/openmm.patch \
-    && popd

 # Download folding resources
-wget --no-check-certificate -P openfold/resources \
+wget -N --no-check-certificate -P openfold/resources \
    https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt

 # Certain tests need access to this file
 mkdir -p tests/test_data/alphafold/common
 ln -rs openfold/resources/stereo_chemical_props.txt tests/test_data/alphafold/common

-echo "Downloading OpenFold parameters..."
-bash scripts/download_openfold_params.sh openfold/resources
-
-echo "Downloading AlphaFold parameters..."
-bash scripts/download_alphafold_params.sh openfold/resources
-
 # Decompress test data
-gunzip tests/test_data/sample_feats.pickle.gz
+gunzip -c tests/test_data/sample_feats.pickle.gz > tests/test_data/sample_feats.pickle
+
+python setup.py install
+
+export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
+
+echo "Attempting to download CUTLASS, required for Deepspeed Evoformer attention kernel"
+git clone https://github.com/NVIDIA/cutlass --depth 1
+conda env config vars set CUTLASS_PATH=$PWD/cutlass
+
+# This setting is used to fix a worker assignment issue during data loading
+conda env config vars set KMP_AFFINITY=none
+
+# Reactivate env so that the above environment variables take effect
+conda activate $CONDA_PREFIX
--- a/scripts/precompute_embeddings.py
+++ b/scripts/precompute_embeddings.py
@ -0,0 +1,200 @@
+# Some functions borrowed from [ESM](https://www.github.com/facebookresearch/esm)
+import argparse
+import logging
+import os
+
+import torch
+
+from openfold.data import parsers
+
+logging.basicConfig(level=logging.INFO)
+
+class SequenceDataset(object):
+    def __init__(self, labels, sequences) -> None:
+        self.labels = labels
+        self.sequences = sequences
+    
+    @classmethod
+    def from_file(cls, fasta_file):
+        labels, sequences = [], []
+
+        with open(fasta_file, "r") as infile:
+            fasta_str = infile.read()
+            sequences, labels = parsers.parse_fasta(fasta_str)
+        
+        assert len(set(labels)) == len(labels),\
+            "Sequence labels need to be unique. Duplicates found!"
+        
+        return cls(labels, sequences)
+    
+    def __len__(self):
+        return len(self.labels)
+    
+    def __getitem__(self, idx):
+        return self.labels[idx], self.sequences[idx]
+    
+    def get_batch_indices(self, toks_per_batch, extra_toks_per_seq):
+        sizes = [(len(s), i) for i, s in enumerate(self.sequences)]
+        sizes.sort()
+        batches = []
+        buf = []
+        max_len = 0
+
+        def _flush_current_buf():
+            nonlocal max_len, buf
+            if len(buf) == 0:
+                return
+            batches.append(buf)
+            buf = []
+            max_len = 0
+        
+        for sz, i in sizes:
+            sz += extra_toks_per_seq
+            if max(sz, max_len) * (len(buf)+1) > toks_per_batch:
+                _flush_current_buf()
+            max_len = max(max_len, sz)
+            buf.append(i)
+        
+        _flush_current_buf()
+        return batches
+
+
+class EmbeddingGenerator:
+    """Generates the ESM-1b embeddings for the single sequence model"""
+    def __init__(self,
+        toks_per_batch: int = 4096,
+        truncate: bool = True,
+        use_local_esm: str = None,
+        nogpu: bool = False,
+    ):
+        self.toks_per_batch = toks_per_batch
+        self.truncate = truncate
+        self.use_local_esm = use_local_esm
+        self.nogpu = nogpu
+        
+        # Generate embeddings in bulk
+        if self.use_local_esm:
+            self.model, self.alphabet = torch.hub.load(self.use_local_esm, "esm1b_t33_650M_UR50S", source='local')
+        else:
+            self.model, self.alphabet = torch.hub.load("facebookresearch/esm:main", "esm1b_t33_650M_UR50S")
+        if torch.cuda.is_available() and not self.nogpu:
+            self.model = self.model.to(device="cuda")
+
+    def parse_sequences(self, fasta_dir, output_dir):
+        labels = []
+        seqs = []
+
+        # Generate a single bulk file
+        for f in os.listdir(fasta_dir):
+            f_name, ext = os.path.splitext(f)
+            if ext != '.fasta' and ext != '.fa':
+                logging.warning(f"Ignoring non-FASTA file: {f}")
+                continue
+            with open(os.path.join(fasta_dir, f), 'r') as infile:
+                seq = infile.readlines()[1].strip()
+            labels.append(f_name)
+            seqs.append(seq)
+        
+        lines = []
+        for label, seq in zip(labels, seqs):
+            lines += f'>{label}\n'
+            lines += f'{seq}\n'
+        os.makedirs(output_dir, exist_ok=True)
+        temp_fasta_file = os.path.join(output_dir, 'temp.fasta')
+        with open(temp_fasta_file, 'w') as outfile:
+            outfile.writelines(lines)
+        return temp_fasta_file
+    
+    def run(
+        self,
+        fasta_file,
+        output_dir,
+    ):
+
+        dataset = SequenceDataset.from_file(fasta_file)
+        batches = dataset.get_batch_indices(self.toks_per_batch, extra_toks_per_seq=1)
+        data_loader = torch.utils.data.DataLoader(
+            dataset, collate_fn=self.alphabet.get_batch_converter(), batch_sampler=batches
+        )
+        logging.info("Loaded all sequences")
+        repr_layers = [33]
+
+        with torch.no_grad():
+            for batch_idx, (labels, strs, toks) in enumerate(data_loader):
+                logging.info(f"Processing {batch_idx + 1} of {len(batches)} batches ({toks.size(0)} sequences)")
+                if torch.cuda.is_available() and not self.nogpu:
+                    toks = toks.to(device="cuda", non_blocking=True)
+                
+                if self.truncate:
+                    toks = toks[:1022]
+                
+                out = self.model(toks, repr_layers=repr_layers, return_contacts=False)
+
+                representations = {
+                    33: out["representations"][33].to(device="cpu")
+                }
+
+                for i, label in enumerate(labels):
+                    os.makedirs(os.path.join(output_dir, label), exist_ok=True)
+                    result = {"label": label}
+
+                    result["representations"] = {
+                        33: representations[33][i, 1: len(strs[i]) + 1].clone()
+                    }
+                    torch.save(
+                        result,
+                        os.path.join(output_dir, label, label+".pt")
+                    )
+        
+        
+
+def main(args):
+    logging.info("Loading the model...")
+    embedding_generator = EmbeddingGenerator(
+        args.toks_per_batch,
+        args.truncate,
+        args.use_local_esm,
+        args.nogpu)
+    logging.info("Loading the sequences and running the inference...")
+    temp_fasta_file = embedding_generator.parse_sequences(
+        args.fasta_dir,
+        args.output_dir
+    )
+    embedding_generator.run(
+        temp_fasta_file,
+        args.output_dir
+    )
+    os.remove(temp_fasta_file)
+    logging.info("Completed.")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "fasta_dir", type=str,
+        help="""Path to directory containing FASTA files."""
+    )
+    parser.add_argument(
+        "output_dir", type=str,
+        help="Directory in which to output embeddings"
+    )
+    parser.add_argument(
+        "--toks_per_batch", type=int, default=4096, 
+        help="maximum tokens in a batch"
+    )
+    parser.add_argument(
+        "--truncate", action="store_true", default=True,
+        help="Truncate sequences longer than 1022 (ESM restriction). Default: True"
+    )
+    parser.add_argument(
+        "--use_local_esm", type=str, default=None,
+        help="Use a local ESM repository instead of cloning from Github"
+    )
+    parser.add_argument(
+        "--nogpu", action="store_true",
+        help="Do not use GPU"
+    )
+
+    args = parser.parse_args()
+
+    main(args)
--- a/tests/test_embedders.py
+++ b/tests/test_embedders.py
@ -17,6 +17,7 @@ import numpy as np
 import unittest
 from openfold.model.embedders import (
    InputEmbedder,
+    PreembeddingEmbedder,
    RecyclingEmbedder,
    TemplateAngleEmbedder,
    TemplatePairEmbedder,
@ -46,6 +47,28 @@ class TestInputEmbedder(unittest.TestCase):
        self.assertTrue(pair_emb.shape == (b, n_res, n_res, c_z))


+class TestPreembeddingEmbedder(unittest.TestCase):
+    def test_shape(self):
+        tf_dim = 22
+        preembedding_dim = 1280
+        c_z = 4
+        c_m = 6
+        relpos_k = 10
+
+        batch_size = 4
+        num_res = 20
+
+        tf = torch.rand((batch_size, num_res, tf_dim))
+        ri = torch.rand((batch_size, num_res))
+        preemb = torch.rand((batch_size, num_res, preembedding_dim))
+
+        pe = PreembeddingEmbedder(tf_dim, preembedding_dim, c_z, c_m, relpos_k)
+
+        seq_emb, pair_emb = pe(tf, ri, preemb)
+        self.assertTrue(seq_emb.shape == (batch_size, 1, num_res, c_m))
+        self.assertTrue(pair_emb.shape == (batch_size, num_res, num_res, c_z))
+
+
 class TestRecyclingEmbedder(unittest.TestCase):
    def test_shape(self):
        batch_size = 2
--- a/tests/test_evoformer.py
+++ b/tests/test_evoformer.py
@ -66,6 +66,7 @@ class TestEvoformerStack(unittest.TestCase):
            msa_dropout,
            pair_stack_dropout,
            blocks_per_ckpt=None,
+            no_column_attention=False,
            inf=inf,
            eps=eps,
        ).eval()
@ -86,6 +87,62 @@ class TestEvoformerStack(unittest.TestCase):
        self.assertTrue(z.shape == shape_z_before)
        self.assertTrue(s.shape == (batch_size, n_res, c_s))

+    def test_shape_without_column_attention(self):
+        batch_size = consts.batch_size
+        n_seq = consts.n_seq
+        n_res = consts.n_res
+        c_m = consts.c_m
+        c_z = consts.c_z
+        c_hidden_msa_att = 12
+        c_hidden_opm = 17
+        c_hidden_mul = 19
+        c_hidden_pair_att = 14
+        c_s = consts.c_s
+        no_heads_msa = 3
+        no_heads_pair = 7
+        no_blocks = 2
+        transition_n = 2
+        msa_dropout = 0.15
+        pair_stack_dropout = 0.25
+        inf = 1e9
+        eps = 1e-10
+
+        es = EvoformerStack(
+            c_m,
+            c_z,
+            c_hidden_msa_att,
+            c_hidden_opm,
+            c_hidden_mul,
+            c_hidden_pair_att,
+            c_s,
+            no_heads_msa,
+            no_heads_pair,
+            no_blocks,
+            transition_n,
+            msa_dropout,
+            pair_stack_dropout,
+            blocks_per_ckpt=None,
+            no_column_attention=True,
+            inf=inf,
+            eps=eps,
+        ).eval()
+
+        m_init = torch.rand((batch_size, n_seq, n_res, c_m))
+        z_init = torch.rand((batch_size, n_res, n_res, c_z))
+        msa_mask = torch.randint(0, 2, size=(batch_size, n_seq, n_res))
+        pair_mask = torch.randint(0, 2, size=(batch_size, n_res, n_res))
+
+        shape_m_before = m_init.shape
+        shape_z_before = z_init.shape
+
+        m, z, s = es(
+            m_init, z_init, chunk_size=4, msa_mask=msa_mask, pair_mask=pair_mask
+        )
+
+        self.assertTrue(m.shape == shape_m_before)
+        self.assertTrue(z.shape == shape_z_before)
+        self.assertTrue(s.shape == (batch_size, n_res, c_s))
+
    @compare_utils.skip_unless_alphafold_installed()
    def test_compare(self):
        def run_ei(activations, masks):
@ -206,7 +263,7 @@ class TestExtraMSAStack(unittest.TestCase):
                n_res,
            ),
            device="cuda",
-        )
+        ).float()
        pair_mask = torch.randint(
            0,
            2,
@ -216,7 +273,7 @@ class TestExtraMSAStack(unittest.TestCase):
                n_res,
            ),
            device="cuda",
-        )
+        ).float()

        shape_z_before = z.shape

--- a/tests/test_model.py
+++ b/tests/test_model.py
@ -47,27 +47,27 @@ class TestModel(unittest.TestCase):
        c.model.evoformer_stack.blocks_per_ckpt = None  # don't want to set up
        # deepspeed for this test

-        model = AlphaFold(c)
+        model = AlphaFold(c).cuda()
        model.eval()

        batch = {}
-        tf = torch.randint(c.model.input_embedder.tf_dim - 1, size=(n_res,))
+        tf = torch.randint(c.model.input_embedder.tf_dim - 1, size=(n_res,)).cuda()
        batch["target_feat"] = nn.functional.one_hot(
            tf, c.model.input_embedder.tf_dim
-        ).float()
-        batch["aatype"] = torch.argmax(batch["target_feat"], dim=-1)
-        batch["residue_index"] = torch.arange(n_res)
-        batch["msa_feat"] = torch.rand((n_seq, n_res, c.model.input_embedder.msa_dim))
+        ).float().cuda()
+        batch["aatype"] = torch.argmax(batch["target_feat"], dim=-1).cuda()
+        batch["residue_index"] = torch.arange(n_res).cuda()
+        batch["msa_feat"] = torch.rand((n_seq, n_res, c.model.input_embedder.msa_dim)).cuda()
        t_feats = random_template_feats(n_templ, n_res)
-        batch.update({k: torch.tensor(v) for k, v in t_feats.items()})
+        batch.update({k: torch.tensor(v).cuda() for k, v in t_feats.items()})
        extra_feats = random_extra_msa_feats(n_extra_seq, n_res)
-        batch.update({k: torch.tensor(v) for k, v in extra_feats.items()})
+        batch.update({k: torch.tensor(v).cuda() for k, v in extra_feats.items()})
        batch["msa_mask"] = torch.randint(
            low=0, high=2, size=(n_seq, n_res)
-        ).float()
-        batch["seq_mask"] = torch.randint(low=0, high=2, size=(n_res,)).float()
+        ).float().cuda()
+        batch["seq_mask"] = torch.randint(low=0, high=2, size=(n_res,)).float().cuda()
        batch.update(data_transforms.make_atom14_masks(batch))
-        batch["no_recycling_iters"] = torch.tensor(2.)
+        batch["no_recycling_iters"] = torch.tensor(2.).cuda()

        add_recycling_dims = lambda t: (
            t.unsqueeze(-1).expand(*t.shape, c.data.common.max_recycling_iters)
@ -77,6 +77,46 @@ class TestModel(unittest.TestCase):
        with torch.no_grad():
            out = model(batch)

+    def test_dry_run_seqemb_mode(self):
+        n_seq = 1
+        n_templ = consts.n_templ
+        n_res = consts.n_res
+        msa_dim = 49
+
+        c = model_config("seq_model_esm1b")
+        c.model.evoformer_stack.no_blocks = 2
+        c.model.evoformer_stack.blocks_per_ckpt = None
+        model = AlphaFold(c)
+        model.to(torch.device('cuda'))
+        model.eval()
+
+        batch = {}
+        tf = torch.randint(c.model.preembedding_embedder.tf_dim - 1, size=(n_res,))
+        batch["target_feat"] = nn.functional.one_hot(tf, c.model.preembedding_embedder.tf_dim).float()
+        batch["aatype"] = torch.argmax(batch["target_feat"], dim=-1)
+        batch["residue_index"] = torch.arange(n_res)
+        batch["msa_feat"] = torch.rand((n_seq, n_res, msa_dim))
+        batch["seq_embedding"] = torch.rand((n_res, c.model.preembedding_embedder.preembedding_dim))
+
+        t_feats = random_template_feats(n_templ, n_res)
+        batch.update({k: torch.tensor(v) for k, v in t_feats.items()})
+
+        batch["seq_mask"] = torch.randint(low=0, high=2, size=(n_res,)).float()
+        batch.update(data_transforms.make_atom14_masks(batch))
+        batch["msa_mask"] = torch.randint(low=0, high=2, size=(n_seq, n_res)).float()
+        
+        batch["no_recycling_iters"] = torch.tensor(2.)
+        add_recycling_dims = lambda t: (
+            t.unsqueeze(-1).expand(*t.shape, c.data.common.max_recycling_iters)
+        )
+        batch = tensor_tree_map(add_recycling_dims, batch)
+
+        to_cuda_device = lambda t: t.to(torch.device("cuda"))
+        batch = tensor_tree_map(to_cuda_device, batch)
+
+        with torch.no_grad():
+            out = model(batch)
+
    @compare_utils.skip_unless_alphafold_installed()
    def test_compare(self):
        def run_alphafold(batch):
--- a/train_openfold.py
+++ b/train_openfold.py
@ -416,6 +416,10 @@ if __name__ == "__main__":
        help='''Cutoff for all templates. In training mode, templates are also 
                filtered by the release date of the target'''
    )
+    parser.add_argument(
+        "--use_single_seq_mode", type=str, default=False,
+        help="Use single sequence embeddings instead of MSAs."
+    )
    parser.add_argument(
        "--distillation_data_dir", type=str, default=None,
        help="Directory containing training PDB files"
Author	SHA1	Message	Date
Christina Floristean	b7f35dce65	Use EvoformerFusedAttention directly to avoid all-zero bias term in column attention	2023-11-08 13:57:43 -05:00
Christina Floristean	5aa549583a	Merge branch 'main' into deepspeed-evo-attention	2023-11-08 13:32:44 -05:00
Christina Floristean	099769d2ec	Hotfix to correct seed type in inference	2023-11-06 13:11:04 -05:00
Jennifer Wei	2338b896c7	Merge pull request #365 from aqlaboratory/dependabot/github_actions/actions/setup-python-4 Bump actions/setup-python from 2 to 4	2023-10-30 14:04:51 -04:00
dependabot[bot]	dcd809d9c2	Bump actions/setup-python from 2 to 4 Bumps [actions/setup-python](https://github.com/actions/setup-python) from 2 to 4. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](https://github.com/actions/setup-python/compare/v2...v4) --- updated-dependencies: - dependency-name: actions/setup-python dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>	2023-10-30 14:55:31 +00:00
Jennifer Wei	51472e756a	Merge pull request #366 from aqlaboratory/dependabot/github_actions/actions/checkout-4 Bump actions/checkout from 2 to 4	2023-10-30 10:54:56 -04:00
dependabot[bot]	13728b1203	Bump actions/checkout from 2 to 4 Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 4. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v2...v4) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>	2023-10-27 19:25:04 +00:00
Jennifer Wei	f7dba95f0b	Merge pull request #360 from aqlaboratory/installation-updates Merging installation updates and updates	2023-10-27 15:13:12 -04:00
Jennifer Wei	e3716118cd	Adds KMP_AFFINITY to conda environment.	2023-10-27 12:04:15 -07:00
Jennifer Wei	6df89c763f	Merge branch 'installation-updates' of https://github.com/aqlaboratory/openfold into installation-updates	2023-10-27 11:33:01 -07:00
Jennifer Wei	13f0f6fe16	Adds libgcc and pytorch::pytorch to environment.yml	2023-10-27 11:30:53 -07:00
Sachin Kadyan	9e32781fd6	README: Small improvements, info regarding templates	2023-10-25 13:42:59 -05:00
jnwei	f68a6c694b	Update readme, also reset flash attention version to 0.2.1 in environment.yml	2023-10-25 08:51:58 -04:00
jnwei	a5c69a79c6	Test docker image nvidia/11.7.1-cudnn8-devel-ubuntu18.04	2023-10-24 19:55:58 -04:00
jnwei	e2bb3c4b90	Change docker image to use cudnn devel environment	2023-10-24 19:43:58 -04:00
jnwei	6fe34248b2	Test nvidia image 11.4.3-cudnn8-runtime-unbuntu18.04	2023-10-24 19:32:15 -04:00
jnwei	5efba4425a	Fix export statement	2023-10-24 18:31:13 -04:00
jnwei	3817d94098	Update Dockerfile to add conda path to $LD_LIBRARY_PATH	2023-10-24 18:10:13 -04:00
jnwei	a5a86d4323	update flash-attention-version	2023-10-24 18:07:32 -04:00
Jennifer Wei	f06657fe8a	Add git to environment.yml	2023-10-24 13:54:21 -07:00
Sachin Kadyan	2d4fe4f414	README: Update details about SoloSeq and inference-time embedding generation.	2023-10-24 13:07:14 -05:00
jnwei	a90da39554	Test flash-attention v0.2.1 on docker CI	2023-10-24 13:50:32 -04:00
Sachin Kadyan	86b990d6ed	Separate out input parsing code in `EmbeddingGenerator` Bugfix: Corrected paths for just-in-time embedding generation	2023-10-24 12:30:25 -05:00
Sachin Kadyan	8185c30775	Just-in-time embedding generation for the SoloSeq model	2023-10-24 01:03:20 -05:00
Sachin Kadyan	4c8e37644e	README: Add details about using the bulk embedding generation script	2023-10-23 20:45:56 -05:00
jnwei	5f5c8f2a5b	Merge remote-tracking branch 'mattwthompson/ci' into installation-updates	2023-10-23 18:27:01 -04:00
Matthew W. Thompson	7666c80272	Use conventional "short hash" style	2023-10-23 16:13:16 -05:00
Matthew W. Thompson	582103505d	Update name to `pytorch-lightning`	2023-10-23 16:01:20 -05:00
Matthew W. Thompson	736d668741	Add Dependabot for Actions	2023-10-23 15:53:34 -05:00
Matthew W. Thompson	32c11376d7	Update checkout action	2023-10-23 15:52:39 -05:00
Matthew W. Thompson	f86d42f40e	Switch to Miniforge3	2023-10-23 15:45:33 -05:00
Matthew W. Thompson	6bf5c8cea1	Update environment YAML	2023-10-23 15:37:36 -05:00
Sachin Kadyan	92835fd5e6	More cleaning of bulk embedding generation script	2023-10-23 15:14:55 -05:00
Sachin Kadyan	0026173e23	Cleaned up `precompute_embeddings.py`.	2023-10-22 21:48:53 -05:00
Sachin Kadyan	bcc6d97b69	New script for generating ESM embeddings in bulk	2023-10-20 21:33:44 -05:00
Gustaf Ahdritz	0c20e3c989	Update BibTex	2023-10-20 15:56:10 -04:00
jnwei	d6ae9f5894	Remove conda env config setting and update to README	2023-10-20 15:44:34 -04:00
Jennifer Wei	b3a118fc83	Updates Dockerfile to not use OpenMM patches.	2023-10-20 09:20:35 -07:00
Jennifer Wei	2893fd934b	Merge branch 'installation-updates' of https://github.com/aqlaboratory/openfold into installation-updates Merging changes from different systems	2023-10-20 07:49:19 -07:00
Jennifer Wei	3e3f07c7f2	Updates installation instructions in the README.	2023-10-20 07:02:11 -07:00
Jennifer Wei	fcba33580e	- Moves python packages to conda installation instead of pip - Adds helper line to automatically prepend conda library to $LD_LIBRARY_PATH	2023-10-20 06:48:05 -07:00
Gustaf Ahdritz	2300f6720d	Update BibTex	2023-10-19 18:44:01 -04:00
Jennifer	d77a8dabea	Merge branch 'installation-updates' of https://github.com/aqlaboratory/openfold into installation-updates	2023-10-17 17:30:21 -04:00
Jennifer	fb34a0cb62	Remove openmm patches	2023-10-17 17:10:24 -04:00
Jennifer Wei	705c26773d	Updating $LD_LIBRARY_PATH to include conda environment library.	2023-10-17 09:10:49 -07:00
Jennifer Wei	4fde713c05	update installation scripts.	2023-10-17 08:22:46 -07:00
Jennifer Wei	7922bd57f1	Removes conda installation from installation script.	2023-10-16 10:22:43 -07:00
Sachin Kadyan	6381ddd6e9	README fixes for single sequence model	2023-10-11 15:41:15 -05:00
Sachin Kadyan	e8de822e9b	Improved UX: Automatically set the single seq mode flag	2023-10-10 16:33:07 -04:00
Sachin Kadyan	c8c1239723	Separate out the seq mode configs from vanilla OF config	2023-10-10 16:33:07 -04:00
Sachin Kadyan	b45a91ba5c	Limit the MSA distillation clusters to 1 in seq mode	2023-10-10 16:33:07 -04:00
Sachin Kadyan	3f592307eb	Reduce redundancy in seq embedding config presets	2023-10-10 16:33:07 -04:00
Gustaf Ahdritz	6aefa986a8	Fix typos	2023-10-10 16:33:07 -04:00
Sachin Kadyan	3c240cb3f2	Updated README: Running seqemb model inference	2023-10-10 16:33:07 -04:00
Sachin Kadyan	28334db382	Added test for no column attention Evoformer	2023-10-10 16:33:07 -04:00
Sachin Kadyan	a7c0d0d178	Added sequence embedding mode test for `model`.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	777d738a59	Added test for PreembeddingEmbedder	2023-10-10 16:33:07 -04:00
Sachin Kadyan	6012b9e1c1	Default value for --use_single_seq_mode arg	2023-10-10 16:33:07 -04:00
Sachin Kadyan	08ef6e9fb6	Add sequence embedding mode option to .core file parser	2023-10-10 16:33:07 -04:00
Sachin Kadyan	395a9f1ba8	Added flag in training script for using sequence embeddings	2023-10-10 16:33:07 -04:00
sachinkadyan7	f85d67f4f9	Bugfix for timings.json - now store timings per tag.	2023-10-10 16:33:07 -04:00
sachinkadyan7	9b114f28df	Renamed `preembedding_embedder` to `input_embedder`	2023-10-10 16:33:07 -04:00
sachinkadyan7	5a8d2b78c1	Properly reading the embedding file	2023-10-10 16:33:07 -04:00
sachinkadyan7	36d5708cfd	Added config presets for esm1b model inference	2023-10-10 16:33:07 -04:00
Sachin Kadyan	047e69af8d	Fix for a bug in data_transforms which wouldn't allow creation of MSA mask if there is only input sequence in MSA. - Set `max_msa_clusters=1` in model presets for allowing the input sequence to be a MSA cluster centre.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	624b5aa698	Added switch for sequence embedding mode to the PDB file pipeline. - In `seqemb_mode`, `process_pdb` loads sequence embedding for the PDB's protein, and a dummy MSA	2023-10-10 16:33:07 -04:00
Sachin Kadyan	299629903b	Added documentation for some sequence embedding model changes.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	a83c6fcc3e	Added inference model preset for seqemb mode.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	8c94482aa0	Added a separate AlignmentRunner for handling seq_emb mode. - In `seq_emb` mode, the AlignmentRunner works only on generating templates.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	380947c429	Added the seq_emb features to the list of features to be processed by feature pipeline , if using seq_emb mode - In `seq_emb` mode, add list of `seq_emb` features to `feature_names`	2023-10-10 16:33:07 -04:00
Sachin Kadyan	f2540236b7	Changed the seq embedding tensor passed to the data pipeline to be a seqemb feature dictionary. - `_process_seqemb_features` now returns a dictionary instead of a tensor.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	43d0964536	Optimized type-changing of features from numpy to torch - Bugfix: `torch` throws warnings when copying a tensor via initialization - Added lambda to `.clone()` those tensors instead	2023-10-10 16:33:07 -04:00
Sachin Kadyan	7f84eebd48	[BUGFIX] Fix an import bug in `data_pipeline.py`	2023-10-10 16:33:07 -04:00
Sachin Kadyan	a51f5fb585	Added training preset for sequence embedding finetuning training. - Turn on `seqemb` mode in `data`, `model`, and `globals` config when using `seqemb` training preset. - Set configuration options specific for finetuning in general.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	bbdaacfd17	Added training preset for sequence embedding initial training. - Turn on `seqemb` mode in `data`, `model`, and `globals` config when using `seqemb` training preset.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	a6a467e09c	Added passing of sequence embedding mode flag from `data_modules` to `data_pipeline` for training and inference pipelines. - Passing the config.data.seqemb_mode.enabled flag to the FASTA, PDB, and MMCIF data pipelines.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	2ba07feb88	Added switch in the MMCIF processing pipeline for using sequence embeddings when in `seqemb` mode. - Use sequence embedding files when in `seqemb` mode. - Make dummy MSA features for MMCIF when using `seqemb` mode.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	bc3ba06ef1	Added switching off of column attention in evoformer when using sequence embeddings. - Added flag `no_column_attention` in evoformer config. - Added check in `evoformer.py` to switch off `MSAColumnAttention` when the config flag `no_column_attention` is `True`.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	6403401fb6	Added configuration options for the new PreembeddingEmbedder. - Added `preembedding_embedder` config dictionary in `config` - Added `preemb_dim_size` property in `config` for specifying single seq embedding size.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	75889e9a9a	Added switch for using the single sequence embedder when using the model in `seqemb` mode. - `seqemb_mode_enabled` added as a configuration option. - `model.py` switches to using the `PreembeddingEmbedder` when the flag is `True`.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	aacf1b6fb2	Added single seq mode in inference script and forwarded to the FASTA processors. - Added `use_single_seq_mode` flag in inference script arguments. - Passed on the flag to the FASTA file `data_processor`.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	cf054ce9e3	Added loading of sequence embeddings in inference flow when in seq_emb mode. - Added a method to load and process sequence embedding `*.pt` files. - In `seqemb_mode`, now add seqemb features to the feature dictionary.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	19d090cb92	Added switch in inference flow for using sequence embedding instead of MSAs. - Added a `seq_emb` bool flag to `data_pipeline > process_fasta()` - In `seqemb_mode` use dummy MSA features instead of full ones.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	e40900d897	Added dummy MSA generation for seq-emb mode. - New method for generating dummy MSA features.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	21a88b6ff9	Added sequence-embedding mode config. - Added configuration options in `config.py` for toggling seq-emb mode - Added list for specifying features to be used in seq-emb mode.	2023-10-10 16:33:07 -04:00
Sachin Kadyan	40325b186e	Added embedder for handling single-sequence embeddings. - Added a `PreembeddingEmbedder` for embedding single-sequence (NUM_RESIDUE, ...) shaped embeddings as input.	2023-10-10 16:33:07 -04:00
jnwei	60d0b15ac3	Merge pull request #350 from aqlaboratory/fix-msastack-test-error Fixes cuda/float wrapper error in unit tests	2023-09-21 10:45:17 -04:00
Jennifer Wei	73ff40b655	Preserves one copy of `tests/test_data/sample_feats.pickle.gz` for unit tests in test_data_transforms.py	2023-09-20 13:04:47 -07:00
Jennifer Wei	8baae516a4	Adds cuda wrapper to pytorch vectors to fix TestModel.test_dry_run	2023-09-20 13:01:51 -07:00
Jennifer Wei	48668ca30b	Adds float wrapper to to vectors in TestExtraMSAStack in test_evoformer.py	2023-09-18 14:57:27 -07:00