Skip to content

Conversation

@ruanchaves
Copy link

Add Hashformers to spaCy Universe

Description

This PR adds Hashformers to the spaCy Universe.

Hashformers is a word segmentation library that uses transformers and beam search to segment text without spaces (like hashtags) into words. It fills the gap between heuristic-based splitters and LLM prompt-based segmentation, and works with any model from the Hugging Face Model Hub.

Key Features

  • 🔤 Word Segmentation: Segments hashtags and concatenated text into individual words
  • 🤗 Hugging Face Compatible: Works with any autoregressive model (GPT-2, LLaMA, etc.)
  • 🌍 Multilingual: Supports any language with a compatible language model
  • 🔬 State-of-the-art: Recognized as SOTA for hashtag segmentation at LREC 2022
  • spaCy Integration: Available as a pipeline component via pip install hashformers[spacy]

Example Usage

from hashformers import TransformerWordSegmenter as WordSegmenter

ws = WordSegmenter(segmenter_model_name_or_path="distilgpt2")
result = ws.segment(["#weneedanationalpark"])
print(result)  # ['we need a national park']

spaCy Pipeline Component

import spacy
import hashformers.spacy  # registers the "hashformers" component

nlp = spacy.blank("en")
nlp.add_pipe("hashformers", config={"model": "distilgpt2"})

doc = nlp("#weneedanationalpark")
print(doc._.segmented)  # "we need a national park"

Checklist

  • Open-source license (MIT)
  • README with usage instructions
  • Available on PyPI (pip install hashformers)
  • GitHub repository
  • Working demo (Google Colab notebook)
  • spaCy pipeline component integration

Links

Citations

The library has been cited in several academic papers including work on multilingual sentiment analysis, abusive language detection, and text processing.

@misc{rodrigues2021zeroshot,
      title={Zero-shot hashtag segmentation for multilingual sentiment analysis}, 
      author={Ruan Chaves Rodrigues and Marcelo Akira Inuzuka and others},
      year={2021},
      eprint={2112.03213},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant