explosion · cclauss · Aug 31, 2025 · Nov 17, 2025
diff --git a/extra/DEVELOPER_DOCS/Listeners.md b/extra/DEVELOPER_DOCS/Listeners.md
@@ -194,7 +194,7 @@ model = chain(
 )
 ```
 
-but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained inbetween the model
+but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained in between the model
 and `trfs2arrays`:
 
 ```

diff --git a/extra/DEVELOPER_DOCS/Satellite Packages.md b/extra/DEVELOPER_DOCS/Satellite Packages.md
@@ -6,7 +6,7 @@ This is a list of all the active repos relevant to spaCy besides the main one, w
 
 These packages are always pulled in when you install spaCy. Most of them are direct dependencies, but some are transitive dependencies through other packages.
 
-- [spacy-legacy](https://github.com/explosion/spacy-legacy): When an architecture in spaCy changes enough to get a new version, the old version is frozen and moved to spacy-legacy. This allows us to keep the core library slim while also preserving backwards compatability.
+- [spacy-legacy](https://github.com/explosion/spacy-legacy): When an architecture in spaCy changes enough to get a new version, the old version is frozen and moved to spacy-legacy. This allows us to keep the core library slim while also preserving backwards compatibility.
 - [thinc](https://github.com/explosion/thinc): Thinc is the machine learning library that powers trainable components in spaCy. It wraps backends like Numpy, PyTorch, and Tensorflow to provide a functional interface for specifying architectures.
 - [catalogue](https://github.com/explosion/catalogue): Small library for adding function registries, like those used for model architectures in spaCy.
 - [confection](https://github.com/explosion/confection): This library contains the functionality for config parsing that was formerly contained directly in Thinc.
@@ -67,7 +67,7 @@ These repos are used to support the spaCy docs or otherwise present information
 
 These repos are used for organizing data around spaCy, but are not something an end user would need to install as part of using the library.
 
-- [spacy-models](https://github.com/explosion/spacy-models): This repo contains metadata (but not training data) for all the spaCy models. This includes information about where their training data came from, version compatability, and performance information. It also includes tests for the model packages, and the built models are hosted as releases of this repo.
+- [spacy-models](https://github.com/explosion/spacy-models): This repo contains metadata (but not training data) for all the spaCy models. This includes information about where their training data came from, version compatibility, and performance information. It also includes tests for the model packages, and the built models are hosted as releases of this repo.
 - [wheelwright](https://github.com/explosion/wheelwright): A tool for automating our PyPI builds and releases.
 - [ec2buildwheel](https://github.com/explosion/ec2buildwheel): A small project that allows you to build Python packages in the manner of cibuildwheel, but on any EC2 image. Used by wheelwright.
 

diff --git a/extra/DEVELOPER_DOCS/StringStore-Vocab.md b/extra/DEVELOPER_DOCS/StringStore-Vocab.md
@@ -145,7 +145,7 @@ These are things stored in the vocab:
 - `get_noun_chunks`: a syntax iterator
 - lex attribute getters: functions like `is_punct`, set in language defaults
 - `cfg`: **not** the pipeline config, this is mostly unused
-- `_unused_object`: Formerly an unused object, kept around until v4 for compatability
+- `_unused_object`: Formerly an unused object, kept around until v4 for compatibility
 
 Some of these, like the Morphology and Vectors, are complex enough that they
 need their own explanations. Here we'll just look at Vocab-specific items.

diff --git a/extra/example_data/textcat_example_data/CC_BY-SA-3.0.txt b/extra/example_data/textcat_example_data/CC_BY-SA-3.0.txt
@@ -34,7 +34,7 @@ CONDITIONS.
     Collection will not be considered an Adaptation for the purpose of
     this License. For the avoidance of doubt, where the Work is a musical
     work, performance or phonogram, the synchronization of the Work in
-    timed-relation with a moving image ("synching") will be considered an
+    timed-relation with a moving image ("syncing") will be considered an
     Adaptation for the purpose of this License.
  b. "Collection" means a collection of literary or artistic works, such as
     encyclopedias and anthologies, or performances, phonograms or
@@ -264,7 +264,7 @@ subject to and limited by the following restrictions:
 UNLESS OTHERWISE MUTUALLY AGREED TO BY THE PARTIES IN WRITING, LICENSOR
 OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY
 KIND CONCERNING THE WORK, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE,
-INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTIBILITY,
+INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTABILITY,
 FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF
 LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS,
 WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION

diff --git a/spacy/cli/_util.py b/spacy/cli/_util.py
@@ -99,7 +99,7 @@ def parse_config_overrides(
     RETURNS (Dict[str, Any]): The parsed dict, keyed by nested config setting.
     """
     env_string = os.environ.get(env_var, "") if env_var else ""
-    env_overrides = _parse_overrides(split_arg_string(env_string))
+    env_overrides = _parse_overrides(split_arg_string(env_string))  # type: ignore[operator]
     cli_overrides = _parse_overrides(args, is_cli=True)
     if cli_overrides:
         keys = [k for k in cli_overrides if k not in env_overrides]

diff --git a/spacy/cli/info.py b/spacy/cli/info.py
@@ -84,7 +84,7 @@ def info(
 
 
 def info_spacy() -> Dict[str, Any]:
-    """Generate info about the current spaCy intallation.
+    """Generate info about the current spaCy installation.
 
     RETURNS (dict): The spaCy info.
     """

diff --git a/spacy/glossary.py b/spacy/glossary.py
@@ -354,7 +354,7 @@ def explain(term):
     # https://github.com/ltgoslo/norne
     "EVT": "Festivals, cultural events, sports events, weather phenomena, wars, etc.",
     "PROD": "Product, i.e. artificially produced entities including speeches, radio shows, programming languages, contracts, laws and ideas",
-    "DRV": "Words (and phrases?) that are dervied from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')",
+    "DRV": "Words (and phrases?) that are derived from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')",
     "GPE_LOC": "Geo-political entity, with a locative sense, e.g. 'John lives in Spain'",
     "GPE_ORG": "Geo-political entity, with an organisation sense, e.g. 'Spain declined to meet with Belgium'",
 }
diff --git a/spacy/language.py b/spacy/language.py
@@ -106,7 +106,7 @@ class BaseDefaults:
 
 def create_tokenizer() -> Callable[["Language"], Tokenizer]:
     """Registered function to create a tokenizer. Returns a factory that takes
-    the nlp object and returns a Tokenizer instance using the language detaults.
+    the nlp object and returns a Tokenizer instance using the language defaults.
     """
 
     def tokenizer_factory(nlp: "Language") -> Tokenizer:
@@ -173,7 +173,7 @@ def __init__(
             current models may run out memory on extremely long texts, due to
             large internal allocations. You should segment these texts into
             meaningful units, e.g. paragraphs, subsections etc, before passing
-            them to spaCy. Default maximum length is 1,000,000 charas (1mb). As
+            them to spaCy. Default maximum length is 1,000,000 chars (1mb). As
             a rule of thumb, if all pipeline components are enabled, spaCy's
             default models currently requires roughly 1GB of temporary memory per
             100,000 characters in one text.
@@ -2446,7 +2446,7 @@ def send(self) -> None:
             q.put(item)
 
     def step(self) -> None:
-        """Tell sender that comsumed one item. Data is sent to the workers after
+        """Tell sender that consumed one item. Data is sent to the workers after
         every chunk_size calls.
         """
         self.count += 1

diff --git a/spacy/pipeline/_edit_tree_internals/edit_trees.pxd b/spacy/pipeline/_edit_tree_internals/edit_trees.pxd
@@ -12,7 +12,7 @@ cdef extern from "<algorithm>" namespace "std" nogil:
 # An edit tree (Müller et al., 2015) is a tree structure that consists of
 # edit operations. The two types of operations are string matches
 # and string substitutions. Given an input string s and an output string t,
-# subsitution and match nodes should be interpreted as follows:
+# substitution and match nodes should be interpreted as follows:
 #
 # * Substitution node: consists of an original string and substitute string.
 #   If s matches the original string, then t is the substitute. Otherwise,

diff --git a/spacy/pipeline/legacy/entity_linker.py b/spacy/pipeline/legacy/entity_linker.py
@@ -1,5 +1,5 @@
 # This file is present to provide a prior version of the EntityLinker component
-# for backwards compatability. For details see #9669.
+# for backwards compatibility. For details see #9669.
 
 import random
 import warnings

diff --git a/spacy/pipeline/lemmatizer.py b/spacy/pipeline/lemmatizer.py
@@ -187,7 +187,7 @@ def rule_lemmatize(self, token: Token) -> List[str]:
             if univ_pos == "":
                 warnings.warn(Warnings.W108)
             return [string.lower()]
-        # See Issue #435 for example of where this logic is requied.
+        # See Issue #435 for example of where this logic is required.
         if self.is_base_form(token):
             return [string.lower()]
         index_table = self.lookups.get_table("lemma_index", {})

diff --git a/spacy/tests/doc/test_span.py b/spacy/tests/doc/test_span.py
@@ -247,7 +247,7 @@ def test_issue13769():
         (1, 4, "This is"),  # Overlapping with 2 sentences
         (0, 2, "This is"),  # Beginning of the Doc. Full sentence
         (0, 1, "This is"),  # Beginning of the Doc. Part of a sentence
-        (10, 14, "And a"),  # End of the Doc. Overlapping with 2 senteces
+        (10, 14, "And a"),  # End of the Doc. Overlapping with 2 sentences
         (12, 14, "third."),  # End of the Doc. Full sentence
         (1, 1, "This is"),  # Empty Span
     ],
@@ -676,7 +676,7 @@ def test_span_comparison(doc):
         (3, 6, 2, 2),  # Overlapping with 2 sentences
         (0, 4, 1, 2),  # Beginning of the Doc. Full sentence
         (0, 3, 1, 2),  # Beginning of the Doc. Part of a sentence
-        (9, 14, 2, 3),  # End of the Doc. Overlapping with 2 senteces
+        (9, 14, 2, 3),  # End of the Doc. Overlapping with 2 sentences
         (10, 14, 1, 2),  # End of the Doc. Full sentence
         (11, 14, 1, 2),  # End of the Doc. Partial sentence
         (0, 0, 1, 1),  # Empty Span

diff --git a/spacy/tests/matcher/test_matcher_logic.py b/spacy/tests/matcher/test_matcher_logic.py
@@ -670,7 +670,7 @@ def test_matcher_remove():
     # removing once should work
     matcher.remove("Rule")
 
-    # should not return any maches anymore
+    # should not return any matches anymore
     results2 = matcher(nlp(text))
     assert len(results2) == 0
 

diff --git a/spacy/tests/parser/test_ner.py b/spacy/tests/parser/test_ner.py
@@ -351,7 +351,7 @@ def test_oracle_moves_whitespace(en_vocab):
 
 
 def test_accept_blocked_token():
-    """Test succesful blocking of tokens to be in an entity."""
+    """Test successful blocking of tokens to be in an entity."""
     # 1. test normal behaviour
     nlp1 = English()
     doc1 = nlp1("I live in New York")

diff --git a/spacy/tests/pipeline/test_entity_linker.py b/spacy/tests/pipeline/test_entity_linker.py
@@ -1288,7 +1288,7 @@ def create_kb(vocab):
     entity_linker.set_kb(create_kb)  # type: ignore
     nlp.initialize(get_examples=lambda: train_examples)
 
-    # Add a custom rule-based component to mimick NER
+    # Add a custom rule-based component to mimic NER
     ruler = nlp.add_pipe("entity_ruler", before="entity_linker")
     ruler.add_patterns([{"label": "PERSON", "pattern": [{"LOWER": "mahler"}]}])  # type: ignore
     doc = nlp(text)

diff --git a/spacy/tests/pipeline/test_pipe_methods.py b/spacy/tests/pipeline/test_pipe_methods.py
@@ -47,7 +47,7 @@ def string_generator():
     nlp = English()
     for i, d in enumerate(nlp.pipe(string_generator())):
         # We should run cleanup more than one time to actually cleanup data.
-        # In first run — clean up only mark strings as «not hitted».
+        # In first run — clean up only mark strings as «not hit».
         if i == 10000 or i == 20000 or i == 30000:
             gc.collect()
         for t in d:

diff --git a/spacy/tests/test_displacy.py b/spacy/tests/test_displacy.py
@@ -34,7 +34,7 @@ def test_issue2728(en_vocab):
 @pytest.mark.issue(3288)
 def test_issue3288(en_vocab):
     """Test that retokenization works correctly via displaCy when punctuation
-    is merged onto the preceeding token and tensor is resized."""
+    is merged onto the preceding token and tensor is resized."""
     words = ["Hello", "World", "!", "When", "is", "this", "breaking", "?"]
     heads = [1, 1, 1, 4, 4, 6, 4, 4]
     deps = ["intj", "ROOT", "punct", "advmod", "ROOT", "det", "nsubj", "punct"]

diff --git a/website/docs/api/curatedtransformer.mdx b/website/docs/api/curatedtransformer.mdx
@@ -410,7 +410,7 @@ attribute.
 
 | Name              | Description                                                                                                                                                                        |
 | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `all_outputs`     | List of `Ragged` tensors that correspends to outputs of the different transformer layers. Each tensor element corresponds to a piece identifier's representation. ~~List[Ragged]~~ |
+| `all_outputs`     | List of `Ragged` tensors that corresponds to outputs of the different transformer layers. Each tensor element corresponds to a piece identifier's representation. ~~List[Ragged]~~ |
 | `last_layer_only` | If only the last transformer layer's outputs are preserved. ~~bool~~                                                                                                               |
 
 ### DocTransformerOutput.embedding_layer {id="doctransformeroutput-embeddinglayer",tag="property"}

diff --git a/website/docs/api/language.mdx b/website/docs/api/language.mdx
@@ -1116,7 +1116,7 @@ customize the default language data:
 | --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `stop_words`                      | List of stop words, used for `Token.is_stop`.<br />**Example:** [`stop_words.py`](%%GITHUB_SPACY/spacy/lang/en/stop_words.py) ~~Set[str]~~                                                                                                                                                                       |
 | `tokenizer_exceptions`            | Tokenizer exception rules, string mapped to list of token attributes.<br />**Example:** [`de/tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/de/tokenizer_exceptions.py) ~~Dict[str, List[dict]]~~                                                                                                           |
-| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`puncutation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py) ~~Optional[Sequence[Union[str, Pattern]]]~~                                                                                                                             |
+| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`punctuation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py) ~~Optional[Sequence[Union[str, Pattern]]]~~                                                                                                                             |
 | `token_match`                     | Optional regex for matching strings that should never be split, overriding the infix rules.<br />**Example:** [`fr/tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/fr/tokenizer_exceptions.py) ~~Optional[Callable]~~                                                                                        |
 | `url_match`                       | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.<br />**Example:** [`tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/tokenizer_exceptions.py) ~~Optional[Callable]~~                                                                                       |
 | `lex_attr_getters`                | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.<br />**Example:** [`lex_attrs.py`](%%GITHUB_SPACY/spacy/lang/en/lex_attrs.py) ~~Dict[int, Callable[[str], Any]]~~                                                                                                                    |

diff --git a/website/docs/api/large-language-models.mdx b/website/docs/api/large-language-models.mdx
@@ -590,7 +590,7 @@ candidate.
 protocol required by [`spacy.EntityLinker.v1`](#el-v1). The built-in candidate
 selector method allows loading existing knowledge bases in several ways, e. g.
 loading from a spaCy pipeline with a (not necessarily trained) entity linking
-component, and loading from a file describing the knowlege base as a .yaml file.
+component, and loading from a file describing the knowledge base as a .yaml file.
 Either way the loaded data will be converted to a spaCy `InMemoryLookupKB`
 instance. The KB's selection capabilities are used to select the most likely
 entity candidates for the specified mentions.
@@ -1103,7 +1103,7 @@ prompting.
 
 | Argument                    | Description                                                                                                                                                                   |
 | --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `examples`                  | Optional function that generates examples for few-shot learning. Deafults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~                                                |
+| `examples`                  | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~                                                |
 | `parse_responses` (NEW)     | Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. ~~Optional[TaskResponseParser[SpanCatTask]]~~                        |
 | `prompt_example_type` (NEW) | Type to use for fewshot examples. Defaults to `TextCatExample`. ~~Optional[Type[FewshotExample]]~~                                                                            |
 | `scorer` (NEW)              | Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. ~~Optional[Scorer]~~                                          |
@@ -1624,7 +1624,7 @@ the same documents at each run that keeps batches of documents stored on disk.
 | Argument             | Description                                                                                                                                      |
 | -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `path`               | Cache directory. If `None`, no caching is performed, and this component will act as a NoOp. Defaults to `None`. ~~Optional[Union[str, Path]]~~   |
-| `batch_size`         | Number of docs in one batch (file). Once a batch is full, it will be peristed to disk. Defaults to 64. ~~int~~                                   |
+| `batch_size`         | Number of docs in one batch (file). Once a batch is full, it will be persisted to disk. Defaults to 64. ~~int~~                                   |
 | `max_batches_in_mem` | Max. number of batches to hold in memory. Allows you to limit the effect on your memory if you're handling a lot of docs. Defaults to 4. ~~int~~ |
 
 When retrieving a document, the `BatchCache` will first figure out what batch

diff --git a/website/docs/api/tokenizer.mdx b/website/docs/api/tokenizer.mdx
@@ -1,6 +1,6 @@
 ---
 title: Tokenizer
-teaser: Segment text into words, punctuations marks, etc.
+teaser: Segment text into words, punctuation marks, etc.
 tag: class
 source: spacy/tokenizer.pyx
 ---

diff --git a/website/docs/models/index.mdx b/website/docs/models/index.mdx
@@ -152,7 +152,7 @@ For faster processing, you may only want to run a subset of the components in a
 trained pipeline. The `disable` and `exclude` arguments to
 [`spacy.load`](/api/top-level#spacy.load) let you control which components are
 loaded and run. Disabled components are loaded in the background so it's
-possible to reenable them in the same pipeline in the future with
+possible to re-enable them in the same pipeline in the future with
 [`nlp.enable_pipe`](/api/language/#enable_pipe). To skip loading a component
 completely, use `exclude` instead of `disable`.
-Original file line number
+Diff line change
@@ Expand Up / @@ -194,7 +194,7 @@ model = chain( @@
     )
     ```
-    but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained inbetween the model
+    but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained in between the model
     and `trfs2arrays`:
     ```
@@ Expand Down @@