Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion extra/DEVELOPER_DOCS/Listeners.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ model = chain(
)
```

but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained inbetween the model
but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained in between the model
and `trfs2arrays`:

```
Expand Down
4 changes: 2 additions & 2 deletions extra/DEVELOPER_DOCS/Satellite Packages.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This is a list of all the active repos relevant to spaCy besides the main one, w

These packages are always pulled in when you install spaCy. Most of them are direct dependencies, but some are transitive dependencies through other packages.

- [spacy-legacy](https://github.com/explosion/spacy-legacy): When an architecture in spaCy changes enough to get a new version, the old version is frozen and moved to spacy-legacy. This allows us to keep the core library slim while also preserving backwards compatability.
- [spacy-legacy](https://github.com/explosion/spacy-legacy): When an architecture in spaCy changes enough to get a new version, the old version is frozen and moved to spacy-legacy. This allows us to keep the core library slim while also preserving backwards compatibility.
- [thinc](https://github.com/explosion/thinc): Thinc is the machine learning library that powers trainable components in spaCy. It wraps backends like Numpy, PyTorch, and Tensorflow to provide a functional interface for specifying architectures.
- [catalogue](https://github.com/explosion/catalogue): Small library for adding function registries, like those used for model architectures in spaCy.
- [confection](https://github.com/explosion/confection): This library contains the functionality for config parsing that was formerly contained directly in Thinc.
Expand Down Expand Up @@ -67,7 +67,7 @@ These repos are used to support the spaCy docs or otherwise present information

These repos are used for organizing data around spaCy, but are not something an end user would need to install as part of using the library.

- [spacy-models](https://github.com/explosion/spacy-models): This repo contains metadata (but not training data) for all the spaCy models. This includes information about where their training data came from, version compatability, and performance information. It also includes tests for the model packages, and the built models are hosted as releases of this repo.
- [spacy-models](https://github.com/explosion/spacy-models): This repo contains metadata (but not training data) for all the spaCy models. This includes information about where their training data came from, version compatibility, and performance information. It also includes tests for the model packages, and the built models are hosted as releases of this repo.
- [wheelwright](https://github.com/explosion/wheelwright): A tool for automating our PyPI builds and releases.
- [ec2buildwheel](https://github.com/explosion/ec2buildwheel): A small project that allows you to build Python packages in the manner of cibuildwheel, but on any EC2 image. Used by wheelwright.

Expand Down
2 changes: 1 addition & 1 deletion extra/DEVELOPER_DOCS/StringStore-Vocab.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ These are things stored in the vocab:
- `get_noun_chunks`: a syntax iterator
- lex attribute getters: functions like `is_punct`, set in language defaults
- `cfg`: **not** the pipeline config, this is mostly unused
- `_unused_object`: Formerly an unused object, kept around until v4 for compatability
- `_unused_object`: Formerly an unused object, kept around until v4 for compatibility

Some of these, like the Morphology and Vectors, are complex enough that they
need their own explanations. Here we'll just look at Vocab-specific items.
Expand Down
4 changes: 2 additions & 2 deletions extra/example_data/textcat_example_data/CC_BY-SA-3.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ CONDITIONS.
Collection will not be considered an Adaptation for the purpose of
this License. For the avoidance of doubt, where the Work is a musical
work, performance or phonogram, the synchronization of the Work in
timed-relation with a moving image ("synching") will be considered an
timed-relation with a moving image ("syncing") will be considered an
Adaptation for the purpose of this License.
b. "Collection" means a collection of literary or artistic works, such as
encyclopedias and anthologies, or performances, phonograms or
Expand Down Expand Up @@ -264,7 +264,7 @@ subject to and limited by the following restrictions:
UNLESS OTHERWISE MUTUALLY AGREED TO BY THE PARTIES IN WRITING, LICENSOR
OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY
KIND CONCERNING THE WORK, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE,
INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTIBILITY,
INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF
LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS,
WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION
Expand Down
2 changes: 1 addition & 1 deletion spacy/cli/_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ def parse_config_overrides(
RETURNS (Dict[str, Any]): The parsed dict, keyed by nested config setting.
"""
env_string = os.environ.get(env_var, "") if env_var else ""
env_overrides = _parse_overrides(split_arg_string(env_string))
env_overrides = _parse_overrides(split_arg_string(env_string)) # type: ignore[operator]
cli_overrides = _parse_overrides(args, is_cli=True)
if cli_overrides:
keys = [k for k in cli_overrides if k not in env_overrides]
Expand Down
2 changes: 1 addition & 1 deletion spacy/cli/info.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ def info(


def info_spacy() -> Dict[str, Any]:
"""Generate info about the current spaCy intallation.
"""Generate info about the current spaCy installation.

RETURNS (dict): The spaCy info.
"""
Expand Down
2 changes: 1 addition & 1 deletion spacy/glossary.py
Original file line number Diff line number Diff line change
Expand Up @@ -354,7 +354,7 @@ def explain(term):
# https://github.com/ltgoslo/norne
"EVT": "Festivals, cultural events, sports events, weather phenomena, wars, etc.",
"PROD": "Product, i.e. artificially produced entities including speeches, radio shows, programming languages, contracts, laws and ideas",
"DRV": "Words (and phrases?) that are dervied from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')",
"DRV": "Words (and phrases?) that are derived from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')",
"GPE_LOC": "Geo-political entity, with a locative sense, e.g. 'John lives in Spain'",
"GPE_ORG": "Geo-political entity, with an organisation sense, e.g. 'Spain declined to meet with Belgium'",
}
6 changes: 3 additions & 3 deletions spacy/language.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ class BaseDefaults:

def create_tokenizer() -> Callable[["Language"], Tokenizer]:
"""Registered function to create a tokenizer. Returns a factory that takes
the nlp object and returns a Tokenizer instance using the language detaults.
the nlp object and returns a Tokenizer instance using the language defaults.
"""

def tokenizer_factory(nlp: "Language") -> Tokenizer:
Expand Down Expand Up @@ -173,7 +173,7 @@ def __init__(
current models may run out memory on extremely long texts, due to
large internal allocations. You should segment these texts into
meaningful units, e.g. paragraphs, subsections etc, before passing
them to spaCy. Default maximum length is 1,000,000 charas (1mb). As
them to spaCy. Default maximum length is 1,000,000 chars (1mb). As
a rule of thumb, if all pipeline components are enabled, spaCy's
default models currently requires roughly 1GB of temporary memory per
100,000 characters in one text.
Expand Down Expand Up @@ -2446,7 +2446,7 @@ def send(self) -> None:
q.put(item)

def step(self) -> None:
"""Tell sender that comsumed one item. Data is sent to the workers after
"""Tell sender that consumed one item. Data is sent to the workers after
every chunk_size calls.
"""
self.count += 1
Expand Down
2 changes: 1 addition & 1 deletion spacy/pipeline/_edit_tree_internals/edit_trees.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ cdef extern from "<algorithm>" namespace "std" nogil:
# An edit tree (Müller et al., 2015) is a tree structure that consists of
# edit operations. The two types of operations are string matches
# and string substitutions. Given an input string s and an output string t,
# subsitution and match nodes should be interpreted as follows:
# substitution and match nodes should be interpreted as follows:
#
# * Substitution node: consists of an original string and substitute string.
# If s matches the original string, then t is the substitute. Otherwise,
Expand Down
2 changes: 1 addition & 1 deletion spacy/pipeline/legacy/entity_linker.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# This file is present to provide a prior version of the EntityLinker component
# for backwards compatability. For details see #9669.
# for backwards compatibility. For details see #9669.

import random
import warnings
Expand Down
2 changes: 1 addition & 1 deletion spacy/pipeline/lemmatizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ def rule_lemmatize(self, token: Token) -> List[str]:
if univ_pos == "":
warnings.warn(Warnings.W108)
return [string.lower()]
# See Issue #435 for example of where this logic is requied.
# See Issue #435 for example of where this logic is required.
if self.is_base_form(token):
return [string.lower()]
index_table = self.lookups.get_table("lemma_index", {})
Expand Down
4 changes: 2 additions & 2 deletions spacy/tests/doc/test_span.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,7 @@ def test_issue13769():
(1, 4, "This is"), # Overlapping with 2 sentences
(0, 2, "This is"), # Beginning of the Doc. Full sentence
(0, 1, "This is"), # Beginning of the Doc. Part of a sentence
(10, 14, "And a"), # End of the Doc. Overlapping with 2 senteces
(10, 14, "And a"), # End of the Doc. Overlapping with 2 sentences
(12, 14, "third."), # End of the Doc. Full sentence
(1, 1, "This is"), # Empty Span
],
Expand Down Expand Up @@ -676,7 +676,7 @@ def test_span_comparison(doc):
(3, 6, 2, 2), # Overlapping with 2 sentences
(0, 4, 1, 2), # Beginning of the Doc. Full sentence
(0, 3, 1, 2), # Beginning of the Doc. Part of a sentence
(9, 14, 2, 3), # End of the Doc. Overlapping with 2 senteces
(9, 14, 2, 3), # End of the Doc. Overlapping with 2 sentences
(10, 14, 1, 2), # End of the Doc. Full sentence
(11, 14, 1, 2), # End of the Doc. Partial sentence
(0, 0, 1, 1), # Empty Span
Expand Down
2 changes: 1 addition & 1 deletion spacy/tests/matcher/test_matcher_logic.py
Original file line number Diff line number Diff line change
Expand Up @@ -670,7 +670,7 @@ def test_matcher_remove():
# removing once should work
matcher.remove("Rule")

# should not return any maches anymore
# should not return any matches anymore
results2 = matcher(nlp(text))
assert len(results2) == 0

Expand Down
2 changes: 1 addition & 1 deletion spacy/tests/parser/test_ner.py
Original file line number Diff line number Diff line change
Expand Up @@ -351,7 +351,7 @@ def test_oracle_moves_whitespace(en_vocab):


def test_accept_blocked_token():
"""Test succesful blocking of tokens to be in an entity."""
"""Test successful blocking of tokens to be in an entity."""
# 1. test normal behaviour
nlp1 = English()
doc1 = nlp1("I live in New York")
Expand Down
2 changes: 1 addition & 1 deletion spacy/tests/pipeline/test_entity_linker.py
Original file line number Diff line number Diff line change
Expand Up @@ -1288,7 +1288,7 @@ def create_kb(vocab):
entity_linker.set_kb(create_kb) # type: ignore
nlp.initialize(get_examples=lambda: train_examples)

# Add a custom rule-based component to mimick NER
# Add a custom rule-based component to mimic NER
ruler = nlp.add_pipe("entity_ruler", before="entity_linker")
ruler.add_patterns([{"label": "PERSON", "pattern": [{"LOWER": "mahler"}]}]) # type: ignore
doc = nlp(text)
Expand Down
2 changes: 1 addition & 1 deletion spacy/tests/pipeline/test_pipe_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def string_generator():
nlp = English()
for i, d in enumerate(nlp.pipe(string_generator())):
# We should run cleanup more than one time to actually cleanup data.
# In first run — clean up only mark strings as «not hitted».
# In first run — clean up only mark strings as «not hit».
if i == 10000 or i == 20000 or i == 30000:
gc.collect()
for t in d:
Expand Down
2 changes: 1 addition & 1 deletion spacy/tests/test_displacy.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def test_issue2728(en_vocab):
@pytest.mark.issue(3288)
def test_issue3288(en_vocab):
"""Test that retokenization works correctly via displaCy when punctuation
is merged onto the preceeding token and tensor is resized."""
is merged onto the preceding token and tensor is resized."""
words = ["Hello", "World", "!", "When", "is", "this", "breaking", "?"]
heads = [1, 1, 1, 4, 4, 6, 4, 4]
deps = ["intj", "ROOT", "punct", "advmod", "ROOT", "det", "nsubj", "punct"]
Expand Down
2 changes: 1 addition & 1 deletion website/docs/api/curatedtransformer.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -410,7 +410,7 @@ attribute.

| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `all_outputs` | List of `Ragged` tensors that correspends to outputs of the different transformer layers. Each tensor element corresponds to a piece identifier's representation. ~~List[Ragged]~~ |
| `all_outputs` | List of `Ragged` tensors that corresponds to outputs of the different transformer layers. Each tensor element corresponds to a piece identifier's representation. ~~List[Ragged]~~ |
| `last_layer_only` | If only the last transformer layer's outputs are preserved. ~~bool~~ |

### DocTransformerOutput.embedding_layer {id="doctransformeroutput-embeddinglayer",tag="property"}
Expand Down
2 changes: 1 addition & 1 deletion website/docs/api/language.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -1116,7 +1116,7 @@ customize the default language data:
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `stop_words` | List of stop words, used for `Token.is_stop`.<br />**Example:** [`stop_words.py`](%%GITHUB_SPACY/spacy/lang/en/stop_words.py) ~~Set[str]~~ |
| `tokenizer_exceptions` | Tokenizer exception rules, string mapped to list of token attributes.<br />**Example:** [`de/tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/de/tokenizer_exceptions.py) ~~Dict[str, List[dict]]~~ |
| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`puncutation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py) ~~Optional[Sequence[Union[str, Pattern]]]~~ |
| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`punctuation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py) ~~Optional[Sequence[Union[str, Pattern]]]~~ |
| `token_match` | Optional regex for matching strings that should never be split, overriding the infix rules.<br />**Example:** [`fr/tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/fr/tokenizer_exceptions.py) ~~Optional[Callable]~~ |
| `url_match` | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.<br />**Example:** [`tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/tokenizer_exceptions.py) ~~Optional[Callable]~~ |
| `lex_attr_getters` | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.<br />**Example:** [`lex_attrs.py`](%%GITHUB_SPACY/spacy/lang/en/lex_attrs.py) ~~Dict[int, Callable[[str], Any]]~~ |
Expand Down
6 changes: 3 additions & 3 deletions website/docs/api/large-language-models.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -590,7 +590,7 @@ candidate.
protocol required by [`spacy.EntityLinker.v1`](#el-v1). The built-in candidate
selector method allows loading existing knowledge bases in several ways, e. g.
loading from a spaCy pipeline with a (not necessarily trained) entity linking
component, and loading from a file describing the knowlege base as a .yaml file.
component, and loading from a file describing the knowledge base as a .yaml file.
Either way the loaded data will be converted to a spaCy `InMemoryLookupKB`
instance. The KB's selection capabilities are used to select the most likely
entity candidates for the specified mentions.
Expand Down Expand Up @@ -1103,7 +1103,7 @@ prompting.

| Argument | Description |
| --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | Optional function that generates examples for few-shot learning. Deafults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
| `parse_responses` (NEW) | Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. ~~Optional[TaskResponseParser[SpanCatTask]]~~ |
| `prompt_example_type` (NEW) | Type to use for fewshot examples. Defaults to `TextCatExample`. ~~Optional[Type[FewshotExample]]~~ |
| `scorer` (NEW) | Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. ~~Optional[Scorer]~~ |
Expand Down Expand Up @@ -1624,7 +1624,7 @@ the same documents at each run that keeps batches of documents stored on disk.
| Argument | Description |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
| `path` | Cache directory. If `None`, no caching is performed, and this component will act as a NoOp. Defaults to `None`. ~~Optional[Union[str, Path]]~~ |
| `batch_size` | Number of docs in one batch (file). Once a batch is full, it will be peristed to disk. Defaults to 64. ~~int~~ |
| `batch_size` | Number of docs in one batch (file). Once a batch is full, it will be persisted to disk. Defaults to 64. ~~int~~ |
| `max_batches_in_mem` | Max. number of batches to hold in memory. Allows you to limit the effect on your memory if you're handling a lot of docs. Defaults to 4. ~~int~~ |

When retrieving a document, the `BatchCache` will first figure out what batch
Expand Down
2 changes: 1 addition & 1 deletion website/docs/api/tokenizer.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Tokenizer
teaser: Segment text into words, punctuations marks, etc.
teaser: Segment text into words, punctuation marks, etc.
tag: class
source: spacy/tokenizer.pyx
---
Expand Down
2 changes: 1 addition & 1 deletion website/docs/models/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ For faster processing, you may only want to run a subset of the components in a
trained pipeline. The `disable` and `exclude` arguments to
[`spacy.load`](/api/top-level#spacy.load) let you control which components are
loaded and run. Disabled components are loaded in the background so it's
possible to reenable them in the same pipeline in the future with
possible to re-enable them in the same pipeline in the future with
[`nlp.enable_pipe`](/api/language/#enable_pipe). To skip loading a component
completely, use `exclude` instead of `disable`.

Expand Down
Loading