evolutionaryscale
diff --git a/‎README.md‎
Lines changed: 28 additions & 15 deletions b/‎README.md‎
Lines changed: 28 additions & 15 deletions
diff --git a/‎esm/__init__.py‎
Lines changed: 1 addition & 1 deletion b/‎esm/__init__.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎esm/layers/attention.py‎
Lines changed: 52 additions & 3 deletions b/‎esm/layers/attention.py‎
Lines changed: 52 additions & 3 deletions
diff --git a/‎esm/layers/blocks.py‎
Lines changed: 13 additions & 4 deletions b/‎esm/layers/blocks.py‎
Lines changed: 13 additions & 4 deletions
diff --git a/‎esm/layers/rotary.py‎
Lines changed: 40 additions & 0 deletions b/‎esm/layers/rotary.py‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎esm/layers/transformer_stack.py‎
Lines changed: 3 additions & 2 deletions b/‎esm/layers/transformer_stack.py‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎esm/models/esmc.py‎
Lines changed: 57 additions & 3 deletions b/‎esm/models/esmc.py‎
Lines changed: 57 additions & 3 deletions
diff --git a/‎esm/pretrained.py‎
Lines changed: 12 additions & 4 deletions b/‎esm/pretrained.py‎
Lines changed: 12 additions & 4 deletions
@@ -1,15 +1,18 @@
-- [Installation ](#installation)
-- [ESM C](#esm-c)
-  - [ESM C 300M and 600M via GitHub](#esm-c-github)
-  - [ESM C via Forge API for Free Non-Commercial Use](#esm-c-forge)
-  - [ESM C via SageMaker for Commercial Use](#esm-c-sagemaker)
-  - [ESM C Example Usage](#esmc-example)
-- [ESM 3](#esm3)
-  - [Quickstart for ESM3-open](#esm3-quickstart)
-  - [Forge: Access to larger ESM3 models](#esm3-forge)
-  - [ESM 3 Example Usage](#esm3-example)
-- [Responsible Development ](#responsible-development)
-- [Licenses](#licenses)
+- [Installation ](#installation-)
+- [ESM C ](#esm-c-)
+  - [ESM C Local Models via GitHub ](#esm-c-local-models-via-github-)
+  - [Using ESM C 6B via Forge API](#using-esm-c-6b-via-forge-api)
+  - [ESM C via Forge API for Free Non-Commercial Use  ](#esm-c-via-forge-api-for-free-non-commercial-use--)
+  - [ESM C via SageMaker for Commercial Use  ](#esm-c-via-sagemaker-for-commercial-use--)
+  - [ESM C Example Usage](#esm-c-example-usage)
+- [ESM 3  ](#esm-3--)
+  - [Quickstart for ESM3-open ](#quickstart-for-esm3-open-)
+  - [EvolutionaryScale Forge: Access to larger ESM3 models](#evolutionaryscale-forge-access-to-larger-esm3-models)
+  - [ESM3 Example Usage](#esm3-example-usage)
+- [Responsible Development ](#responsible-development-)
+- [Licenses  ](#licenses--)
+  - [How can I access the models and which licenses apply?](#how-can-i-access-the-models-and-which-licenses-apply)
+  - [What changed with the release of ESM C?](#what-changed-with-the-release-of-esm-c)
 
 
 ## Installation <a name="installation"></a>
@@ -46,6 +49,16 @@ logits_output = client.logits(
 print(logits_output.logits, logits_output.embeddings)
 ```
 
+To use Flash Attention with the open weights:
+
+Simply install flash-attn package, which will enable Flash Attention automatically:
+```
+pip install flash-attn --no-build-isolation
+```
+
+You can also disable flash-attn by passing ``use_flash_attn=False`` to utils like ``ESMC_300M_202412``.
+
+### Using ESM C 6B via Forge API
 ### ESM C via Forge API for Free Non-Commercial Use  <a name="esm-c-forge"></a>
 
 The ESM C model family, including ESMC 6B, are accessible via EvolutionaryScale Forge for free [non-commercial use](#licenses).
@@ -235,13 +248,13 @@ The models can be accessed in three different ways, each with its own licensing
 1. **Code and weights** via GitHub and HuggingFace are available under either a [non-commercial](https://www.evolutionaryscale.ai/policies/cambrian-non-commercial-license-agreement) (ESM C 600M, ESM3-small-open) or an [open license](https://www.evolutionaryscale.ai/policies/cambrian-open-license-agreement) (codebase, ESM C 300M).
     1. **Building with ESM encouraged**: You can use embeddings, model predictions, fine-tune the models and use components of both the models and code. We strongly encourage anyone to build on ESM C and ESM3! Just remember to maintain the same license terms and release under the ESM name.
 2. **Free non-commercial inference API** via Forge. All models are available this way, with free credits granted to students and researchers. We want to enable academics under [non-commercial Terms of Use](https://www.evolutionaryscale.ai/policies/terms-of-use), which mirrors the non-commercial license.
-3. **Paid commercial Inference API** for commercial use via SageMaker (Forge coming soon). All ESM C models are available this way to commercial entities for commercial use under a [clickthrough license agreement](https://www.evolutionaryscale.ai/policies/cambrian-inference-clickthrough-license-agreement) with few restrictions. 
+3. **Paid commercial Inference API** for commercial use via SageMaker (Forge coming soon). All ESM C models are available this way to commercial entities for commercial use under a [clickthrough license agreement](https://www.evolutionaryscale.ai/policies/cambrian-inference-clickthrough-license-agreement) with few restrictions.
     1. In broad strokes: standard commercial use like developing molecules and developing downstream ML models and methods with the model is allowed, while training competing models on the API outputs is not.
-    2. Note: For ESM3 commercial use, reach out to [[email protected]](mailto:[email protected]) 
+    2. Note: For ESM3 commercial use, reach out to [[email protected]](mailto:[email protected])
 
 ### What changed with the release of ESM C?
 
-We introduced a [clickthrough license agreement](https://www.evolutionaryscale.ai/policies/cambrian-inference-clickthrough-license-agreement) to enable frictionless commercial use of ESM C. 
+We introduced a [clickthrough license agreement](https://www.evolutionaryscale.ai/policies/cambrian-inference-clickthrough-license-agreement) to enable frictionless commercial use of ESM C.
 
 We introduced the new [Cambrian Open License](https://www.evolutionaryscale.ai/policies/cambrian-open-license-agreement) for ESM C 300M, and at the same time moved all code in the [`esm` repo](https://github.com/evolutionaryscale/esm) under that permissive license.
 
 
@@ -1,2 +1,2 @@
-__version__ = "3.1.1"
+__version__ = "3.1.2"
 
@@ -5,7 +5,15 @@
 import torch.nn.functional as F
 from torch import nn
 
-from esm.layers.rotary import RotaryEmbedding
+from esm.layers.rotary import (
+    RotaryEmbedding,
+    TritonRotaryEmbedding,
+)
+
+try:
+    from flash_attn import flash_attn_varlen_qkvpacked_func  # type:ignore
+except ImportError:
+    flash_attn_varlen_func = None
 
 
 class MultiHeadAttention(nn.Module):
@@ -49,9 +57,8 @@ def forward(self, x, seq_id):
         )
         query_BLD, key_BLD = self._apply_rotary(query_BLD, key_BLD)
 
-        n_heads = self.n_heads
         reshaper = functools.partial(
-            einops.rearrange, pattern="b s (h d) -> b h s d", h=n_heads
+            einops.rearrange, pattern="b s (h d) -> b h s d", h=self.n_heads
         )
 
         query_BHLD, key_BHLD, value_BHLD = map(
@@ -72,5 +79,47 @@ def forward(self, x, seq_id):
             context_BHLD = F.scaled_dot_product_attention(
                 query_BHLD, key_BHLD, value_BHLD
             )
+
         context_BLD = einops.rearrange(context_BHLD, "b h s d -> b s (h d)")
+
         return self.out_proj(context_BLD)
+
+
+class FlashMultiHeadAttention(MultiHeadAttention):
+    def __init__(
+        self, d_model: int, n_heads: int, bias: bool = False, qk_layernorm: bool = True
+    ):
+        super().__init__(
+            d_model=d_model, n_heads=n_heads, bias=bias, qk_layernorm=qk_layernorm
+        )
+
+        # Flash attention rotary.
+        self.rotary = TritonRotaryEmbedding(d_model // n_heads)
+
+    def forward(self, x, seq_id):
+        assert seq_id.dtype == torch.bool
+
+        seqlens = seq_id.sum(dim=-1, dtype=torch.int32)
+        cu_seqlens = F.pad(torch.cumsum(seqlens, dim=0, dtype=torch.int32), (1, 0))
+        max_seqlen = seqlens.max().item()
+
+        qkv_ND3 = self.layernorm_qkv(x)
+
+        query_ND, key_ND, value_ND = torch.chunk(qkv_ND3, 3, dim=-1)
+        query_ND, key_ND = (
+            self.q_ln(query_ND).to(query_ND.dtype),
+            self.k_ln(key_ND).to(query_ND.dtype),
+        )
+
+        qkv_N3D = torch.stack([query_ND, key_ND, value_ND], dim=1)
+        qkv_N3HD = einops.rearrange(
+            qkv_N3D, pattern="n a (h d) -> n a h d", h=self.n_heads
+        )
+        qkv_N3HD = self.rotary(qkv_N3HD, cu_seqlens, max_seqlen)
+
+        context_NHD = flash_attn_varlen_qkvpacked_func(
+            qkv_N3HD, cu_seqlens, max_seqlen, softmax_scale=self.d_head**-0.5
+        )
+        context_ND = einops.rearrange(context_NHD, "n h d -> n (h d)")
+
+        return self.out_proj(context_ND)
@@ -2,7 +2,10 @@
 import torch.nn as nn
 import torch.nn.functional as F
 
-from esm.layers.attention import MultiHeadAttention
+from esm.layers.attention import (
+    FlashMultiHeadAttention,
+    MultiHeadAttention,
+)
 from esm.layers.geom_attention import (
     GeometricReasoningOriginalImpl,
 )
@@ -78,6 +81,7 @@ def __init__(
         n_heads: int,
         use_geom_attn: bool = False,
         use_plain_attn: bool = True,
+        use_flash_attn: bool = False,
         v_heads: int | None = None,
         bias: bool = False,
         expansion_ratio: float = 4.0,
@@ -89,9 +93,14 @@ def __init__(
         super().__init__()
         self.use_plain_attn = use_plain_attn
         if self.use_plain_attn:
-            self.attn = MultiHeadAttention(
-                d_model, n_heads, bias, qk_layernorm=qk_layernorm
-            )
+            if use_flash_attn:
+                self.attn = FlashMultiHeadAttention(
+                    d_model, n_heads, bias, qk_layernorm=qk_layernorm
+                )
+            else:
+                self.attn = MultiHeadAttention(
+                    d_model, n_heads, bias, qk_layernorm=qk_layernorm
+                )
         self.use_geom_attn = use_geom_attn
         if self.use_geom_attn:
             if v_heads is None:
 
@@ -25,6 +25,13 @@
 import torch
 from einops import rearrange, repeat
 
+try:
+    from flash_attn.ops.triton.rotary import (  # type:ignore
+        apply_rotary as apply_triton_rotary,
+    )
+except ImportError:
+    apply_triton_rotary = None
+
 
 def rotate_half(x, interleaved=False):
     if not interleaved:
@@ -219,3 +226,36 @@ def forward(
             )  # type: ignore
         else:
             assert False
+
+
+class TritonRotaryEmbedding(RotaryEmbedding):
+    def forward(self, qkv: torch.Tensor, cu_seqlens, max_seqlen) -> torch.Tensor:
+        """
+        qkv: (n, 3, nheads, headdim)
+        cu_seqlens: cumulative sequence lengths
+        max_seqlen: max sequence length
+        """
+        self._update_cos_sin_cache(max_seqlen, device=qkv.device, dtype=qkv.dtype)
+        assert self._cos_cached is not None
+        assert self._sin_cached is not None
+
+        assert apply_triton_rotary is not None
+        # In-place modification
+        apply_triton_rotary(
+            qkv[:, 0],
+            self._cos_cached,
+            self._sin_cached,
+            cu_seqlens=cu_seqlens,
+            max_seqlen=max_seqlen,
+            inplace=True,
+        )
+        apply_triton_rotary(
+            qkv[:, 1],
+            self._cos_cached,
+            self._sin_cached,
+            cu_seqlens=cu_seqlens,
+            max_seqlen=max_seqlen,
+            inplace=True,
+        )
+
+        return qkv
@@ -36,6 +36,7 @@ def __init__(
         qk_layernorm: bool = True,
         ffn_type: str = "swiglu",  # swiglu | gelu
         expansion_ratio: float = 8 / 3,
+        use_flash_attn: bool = False,
     ):
         super().__init__()
         self.blocks = nn.ModuleList(
@@ -45,6 +46,7 @@ def __init__(
                     n_heads,
                     v_heads=v_heads,
                     use_geom_attn=i < n_layers_geom,
+                    use_flash_attn=use_flash_attn,
                     residue_scaling_factor=(
                         math.sqrt(n_layers / 36) if scale_residue else 1.0
                     ),
@@ -66,7 +68,7 @@ def forward(
         affine: Affine3D | None = None,
         affine_mask: torch.Tensor | None = None,
         chain_id: torch.Tensor | None = None,
-    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    ) -> tuple[torch.Tensor, torch.Tensor, list[torch.Tensor]]:
         """
         Forward pass of the TransformerStack.
 
@@ -89,5 +91,4 @@ def forward(
         for block in self.blocks:
             x = block(x, sequence_id, affine, affine_mask, chain_id)
             hiddens.append(x)
-        hiddens = torch.stack(hiddens, dim=0)
         return self.norm(x), x, hiddens
@@ -7,6 +7,15 @@
 import torch.nn as nn
 from attr import dataclass
 
+try:
+    from flash_attn.bert_padding import pad_input, unpad_input  # type:ignore
+
+    is_flash_attn_available = True
+except ImportError:
+    pad_input = None
+    unpad_input = None
+    is_flash_attn_available = False
+
 from esm.layers.regression_head import RegressionHead
 from esm.layers.transformer_stack import TransformerStack
 from esm.sdk.api import (
@@ -43,13 +52,26 @@ class ESMC(nn.Module, ESMCInferenceClient):
     """
 
     def __init__(
-        self, d_model: int, n_heads: int, n_layers: int, tokenizer: EsmSequenceTokenizer
+        self,
+        d_model: int,
+        n_heads: int,
+        n_layers: int,
+        tokenizer: EsmSequenceTokenizer,
+        use_flash_attn: bool = True,
     ):
         super().__init__()
         self.embed = nn.Embedding(64, d_model)
+
+        self._use_flash_attn = is_flash_attn_available and use_flash_attn
         self.transformer = TransformerStack(
-            d_model, n_heads, None, n_layers, n_layers_geom=0
+            d_model,
+            n_heads,
+            None,
+            n_layers,
+            n_layers_geom=0,
+            use_flash_attn=self._use_flash_attn,
         )
+
         self.sequence_head = RegressionHead(d_model, 64)
         self.tokenizer = tokenizer
 
@@ -109,10 +131,41 @@ def forward(
 
         """
         if sequence_id is None:
-            sequence_id = sequence_tokens == self.tokenizer.pad_token_id
+            # For EMSC, a boolean mask is created in place of sequence_id if not specified.
+            sequence_id = sequence_tokens != self.tokenizer.pad_token_id
 
         x = self.embed(sequence_tokens)
+
+        B, L = x.shape[:2]
+
+        # If sequence_id looks like a mask.
+        if self._use_flash_attn:
+            assert (
+                sequence_id.dtype == torch.bool
+            ), "sequence_id must be a boolean mask if Flash Attention is used"
+            assert sequence_id.shape == (B, L)
+            assert unpad_input is not None
+            x, indices, _, _, _ = unpad_input(  # type: ignore
+                x, sequence_id
+            )
+        else:
+            indices = None
+
         x, _, hiddens = self.transformer(x, sequence_id=sequence_id)
+
+        if self._use_flash_attn:
+            assert indices is not None
+            assert pad_input is not None
+            x = pad_input(x, indices, B, L)  # Back to [B, L, D]
+            hiddens = [
+                # Back to [[B, L, D], ...]
+                pad_input(h, indices, B, L)
+                for h in hiddens
+            ]
+
+        # Stack hidden states into a [n_layers, B, L, D] matrix.
+        hiddens = torch.stack(hiddens, dim=0)  # type: ignore
+
         sequence_logits = self.sequence_head(x)
         output = ESMCOutput(
             sequence_logits=sequence_logits, embeddings=x, hidden_states=hiddens
@@ -161,4 +214,5 @@ def logits(
                 sequence=output.sequence_logits if config.sequence else None
             ),
             embeddings=output.embeddings if config.return_embeddings else None,
+            hidden_states=output.hidden_states if config.return_hidden_states else None,
         )
@@ -62,10 +62,14 @@ def ESM3_function_decoder_v0(device: torch.device | str = "cpu"):
     return model
 
 
-def ESMC_300M_202412(device: torch.device | str = "cpu"):
+def ESMC_300M_202412(device: torch.device | str = "cpu", use_flash_attn: bool = True):
     with torch.device(device):
         model = ESMC(
-            d_model=960, n_heads=15, n_layers=30, tokenizer=get_esmc_model_tokenizers()
+            d_model=960,
+            n_heads=15,
+            n_layers=30,
+            tokenizer=get_esmc_model_tokenizers(),
+            use_flash_attn=use_flash_attn,
         ).eval()
     state_dict = torch.load(
         data_root("esmc-300") / "data/weights/esmc_300m_2024_12_v0.pth",
@@ -76,10 +80,14 @@ def ESMC_300M_202412(device: torch.device | str = "cpu"):
     return model
 
 
-def ESMC_600M_202412(device: torch.device | str = "cpu"):
+def ESMC_600M_202412(device: torch.device | str = "cpu", use_flash_attn: bool = True):
     with torch.device(device):
         model = ESMC(
-            d_model=1152, n_heads=18, n_layers=36, tokenizer=get_esmc_model_tokenizers()
+            d_model=1152,
+            n_heads=18,
+            n_layers=36,
+            tokenizer=get_esmc_model_tokenizers(),
+            use_flash_attn=use_flash_attn,
         ).eval()
     state_dict = torch.load(
         data_root("esmc-600") / "data/weights/esmc_600m_2024_12_v0.pth",
Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,2 @@`
`1`		`-__version__ = "3.1.1"`
	`1`	`+__version__ = "3.1.2"`
`2`	`2`