|
1 | | -twitter-text in Rust |
2 | | -============ |
| 1 | +# twitter-text in Rust |
3 | 2 |
|
4 | | -This repo is a Rust implementation of twitter-text. All aspects of tweet text are parsed by a [Pest](https://github.com/pest-parser/pest) [PEG](https://en.wikipedia.org/wiki/Parsing_expression_grammar) grammar, with the exception of URL length and character weighting. See the [parser](rust/parser/src) directory for the grammar. Procedural validation for URL lengths and character weights is performed by the [Extractor](rust/twitter-text/src/extractor.rs) code. |
| 3 | +A Rust implementation of [twitter-text](https://github.com/twitter/twitter-text) that parses tweet text using a [Pest](https://github.com/pest-parser/pest) [PEG](https://en.wikipedia.org/wiki/Parsing_expression_grammar) grammar. Includes bindings for Ruby, Python, Java, C++, Swift, and WebAssembly. |
5 | 4 |
|
6 | | -To run the tests, [install Rust](https://www.rust-lang.org/tools/install), and then try this in the terminal: |
| 5 | +## Features |
| 6 | + |
| 7 | +- **Entity extraction**: URLs, @mentions, #hashtags, $cashtags, and emoji |
| 8 | +- **Tweet validation**: 280 weighted character limit with configurable weights |
| 9 | +- **Autolinking**: Convert entities to HTML links |
| 10 | +- **Hit highlighting**: Highlight search terms in tweet text |
| 11 | +- **Unicode 17.0**: Full emoji support including ZWJ sequences and skin tone modifiers |
| 12 | + |
| 13 | +## Quick Start |
| 14 | + |
| 15 | +### Using Cargo |
| 16 | + |
| 17 | +```bash |
| 18 | +cargo build |
| 19 | +cargo test |
7 | 20 | ``` |
8 | | -> cargo build |
9 | | -> cargo test |
| 21 | + |
| 22 | +### Using Bazel |
| 23 | + |
| 24 | +```bash |
| 25 | +# Build everything |
| 26 | +bazel build //rust/... |
| 27 | + |
| 28 | +# Run all tests |
| 29 | +bazel test //rust/... |
10 | 30 | ``` |
11 | 31 |
|
12 | | -### Ruby Bindings |
| 32 | +## Language Bindings |
13 | 33 |
|
14 | | -The Ruby bindings require **Ruby 3.4.1 or higher**. If you're on macOS with the system Ruby (2.6.x), you'll need to install a newer version: |
| 34 | +| Language | Directory | Requirements | Technology | |
| 35 | +|----------|-----------|--------------|------------| |
| 36 | +| Ruby | `rust/ruby-bindings/` | Ruby 3.3+ | [Magnus](https://github.com/matsadler/magnus) FFI | |
| 37 | +| Python | `rust/python-bindings/` | Python 3.12 | [PyO3](https://github.com/PyO3/pyo3) | |
| 38 | +| Java | `rust/java-bindings/` | JDK 23+ | Foreign Function & Memory API | |
| 39 | +| C++ | `rust/cpp-bindings/` | C++17 | [cxx.rs](https://github.com/dtolnay/cxx) | |
| 40 | +| Swift | `rust/swift-bindings/` | Swift 6.0+ | C FFI | |
| 41 | +| WebAssembly | `rust/wasm-bindings/` | - | [wasm-bindgen](https://github.com/AshleyScirra/nicerm) | |
| 42 | + |
| 43 | +### Building Bindings |
15 | 44 |
|
16 | | -**Option 1: Homebrew (simplest)** |
17 | 45 | ```bash |
18 | | -brew install ruby |
| 46 | +# Ruby |
| 47 | +bazel build //rust/ruby-bindings:twittertext |
| 48 | + |
| 49 | +# Python |
| 50 | +bazel build //rust/python-bindings:twitter_text |
| 51 | + |
| 52 | +# Java |
| 53 | +bazel build //rust/java-bindings:twitter_text_java_ffm |
| 54 | + |
| 55 | +# C++ |
| 56 | +bazel build //rust/cpp-bindings/... |
| 57 | + |
| 58 | +# Swift |
| 59 | +bazel build //rust/swift-bindings:TwitterText |
| 60 | + |
| 61 | +# WebAssembly |
| 62 | +bazel build //rust/wasm-bindings:twitter_text_wasm |
19 | 63 | ``` |
20 | 64 |
|
21 | | -**Option 2: Ruby version manager** |
22 | | -* [rbenv](https://github.com/rbenv/rbenv) |
23 | | -* [rvm](https://rvm.io/) |
24 | | -* [asdf](https://asdf-vm.com/) |
| 65 | +## Architecture |
| 66 | + |
| 67 | +### Core Components |
| 68 | + |
| 69 | +- **PEG Grammar Parser** (`rust/parser/`): Pest grammar for parsing tweet entities |
| 70 | +- **Main Library** (`rust/twitter-text/`): Extraction, validation, autolinking, and highlighting |
| 71 | +- **Configuration** (`rust/config/`): Character weights and URL length settings |
| 72 | +- **Conformance Tests** (`rust/conformance/`): Tests against canonical twitter-text test suites |
| 73 | + |
| 74 | +### Entity Parsing Order |
| 75 | + |
| 76 | +The grammar processes entities in this order to resolve ambiguities: |
| 77 | +1. URLs (including t.co short URLs) |
| 78 | +2. Hashtags |
| 79 | +3. Mentions |
| 80 | +4. Cashtags |
| 81 | + |
| 82 | +## Dependencies |
| 83 | + |
| 84 | +- **Rust**: 1.91.1+ |
| 85 | +- **Bazel**: 8.4.2+ (for full build) |
| 86 | +- **Ruby**: 3.3+ (requires libyaml: `brew install libyaml` on macOS) |
| 87 | +- **Python**: 3.12 |
| 88 | +- **Java**: JDK 23+ |
| 89 | +- **LLVM**: 17.0.6 (hermetic toolchain via Bazel) |
| 90 | + |
| 91 | +## Conformance |
| 92 | + |
| 93 | +This implementation passes the canonical twitter-text conformance tests in `conformance/*.yml`. These tests cover: |
| 94 | +- Autolink (URL/mention/hashtag linking) |
| 95 | +- Extract (entity extraction) |
| 96 | +- Validation (tweet validity) |
| 97 | +- Hit highlighting |
| 98 | + |
| 99 | +## License |
25 | 100 |
|
26 | | -The Ruby bindings use the [magnus](https://github.com/matsadler/magnus) crate which requires Ruby 3.2.3+ APIs. |
| 101 | +Apache 2.0 |
0 commit comments