Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ VERSION
*.DS_Store
.env*
.serena/cache
.specify/
1 change: 1 addition & 0 deletions .serena/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/cache
152 changes: 21 additions & 131 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,131 +1,21 @@
# PolyForm Noncommercial License 1.0.0

<https://polyformproject.org/licenses/noncommercial/1.0.0>

## Acceptance

In order to get any license under these terms, you must agree
to them as both strict obligations and conditions to all
your licenses.

## Copyright License

The licensor grants you a copyright license for the
software to do everything you might do with the software
that would otherwise infringe the licensor's copyright
in it for any permitted purpose. However, you may
only distribute the software according to [Distribution
License](#distribution-license) and make changes or new works
based on the software according to [Changes and New Works
License](#changes-and-new-works-license).

## Distribution License

The licensor grants you an additional copyright license
to distribute copies of the software. Your license
to distribute covers distributing the software with
changes and new works permitted by [Changes and New Works
License](#changes-and-new-works-license).

## Notices

You must ensure that anyone who gets a copy of any part of
the software from you also gets a copy of these terms or the
URL for them above, as well as copies of any plain-text lines
beginning with `Required Notice:` that the licensor provided
with the software. For example:

> Required Notice: Copyright Yoyodyne, Inc. (http://example.com)

## Changes and New Works License

The licensor grants you an additional copyright license to
make changes and new works based on the software for any
permitted purpose.

## Patent License

The licensor grants you a patent license for the software that
covers patent claims the licensor can license, or becomes able
to license, that you would infringe by using the software.

## Noncommercial Purposes

Any noncommercial purpose is a permitted purpose.

## Personal Uses

Personal use for research, experiment, and testing for
the benefit of public knowledge, personal study, private
entertainment, hobby projects, amateur pursuits, or religious
observance, without any anticipated commercial application,
is use for a permitted purpose.

## Noncommercial Organizations

Use by any charitable organization, educational institution,
public research organization, public safety or health
organization, environmental protection organization,
or government institution is use for a permitted purpose
regardless of the source of funding or obligations resulting
from the funding.

## Fair Use

You may have "fair use" rights for the software under the
law. These terms do not limit them.

## No Other Rights

These terms do not allow you to sublicense or transfer any of
your licenses to anyone else, or prevent the licensor from
granting licenses to anyone else. These terms do not imply
any other licenses.

## Patent Defense

If you make any written claim that the software infringes or
contributes to infringement of any patent, your patent license
for the software granted under these terms ends immediately. If
your company makes such a claim, your patent license ends
immediately for work on behalf of your company.

## Violations

The first time you are notified in writing that you have
violated any of these terms, or done anything with the software
not covered by your licenses, your licenses can nonetheless
continue if you come into full compliance with these terms,
and take practical steps to correct past violations, within
32 days of receiving notice. Otherwise, all your licenses
end immediately.

## No Liability

***As far as the law allows, the software comes as is, without
any warranty or condition, and the licensor will not be liable
to you for any damages arising out of these terms or the use
or nature of the software, under any kind of legal claim.***

## Definitions

The **licensor** is the individual or entity offering these
terms, and the **software** is the software the licensor makes
available under these terms.

**You** refers to the individual or entity agreeing to these
terms.

**Your company** is any legal entity, sole proprietorship,
or other kind of organization that you work for, plus all
organizations that have control over, are under the control of,
or are under common control with that organization. **Control**
means ownership of substantially all the assets of an entity,
or the power to direct its management and policies by vote,
contract, or otherwise. Control can be direct or indirect.

**Your licenses** are all the licenses granted to you for the
software under these terms.

**Use** means anything you do with the software requiring one
of your licenses.
MIT License

Copyright (c) 2025 CIB Mango Tree

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
17 changes: 10 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,28 @@
<h2 align="center">mango-tango-cli</h2>
<h3 align="center">A Python command-line tool for detecting coordinated inauthentic behavior</h3>
<h2 align="center">CIB Mango Tree</h2>
<h3 align="center">An Interactive Command Line and Dashboard Tool for Detecting Coordinated Inauthentic Behavior Datasets of Online Activity</h3>

<p align="center">
<img src="https://raw.githubusercontent.com/CIB-Mango-Tree/CIB-Mango-Tree-Website/main/assets/images/mango-text.PNG" alt="Mango logo" style="width:200px;"/>
</p>

<p align="center">
<a href="https://www.python.org/"><img alt="code" src="https://img.shields.io/badge/code-Python%203.12-blue?logo=Python"></a>
<a href="https://www.python.org/"><img alt="code" src="https://img.shields.io/badge/Python-3.12-blue?logo=Python"></a>
<a href="https://docs.astral.sh/ruff/"><img alt="style: black" src="https://img.shields.io/badge/Polars-1.9-skyblue?logo=Polars"></a>
<a href="https://plotly.com/python/"><img alt="style: black" src="https://img.shields.io/badge/Plotly-5.24.1-purple?logo=Plotly"></a>
<a href="https://github.com/Textualize/rich"><img alt="style: black" src="https://img.shields.io/badge/Rich-14.0.0-gold?logo=Rich"></a>
<a href="https://civictechdc.github.io/mango-tango-cli/"><img alt="style: black" src="https://img.shields.io/badge/docs-website-blue"></a>
<a href="https://black.readthedocs.io/en/stable/"><img alt="style: black" src="https://img.shields.io/badge/style-Black-black?logo=Black"></a>
<a href="https://docs.astral.sh/ruff/"><img alt="style: black" src="https://img.shields.io/badge/tool-Polars-skyblue?logo=Polars"></a>
</p>

---

## Technical Documentation

For in-depth technical docs related to this repository please visit [https://civictechdc.github.io/mango-tango-cli](https://civictechdc.github.io/mango-tango-cli)
For in-depth technical docs related to this repository please visit: [https://civictechdc.github.io/mango-tango-cli](https://civictechdc.github.io/mango-tango-cli)

## Requirements

Python 3.12
Python 3.12 (see [requirements.txt](https://github.com/civictechdc/mango-tango-cli/blob/main/requirements.txt))

## Setting up

Expand All @@ -41,7 +44,7 @@ python -m venv venv
## Starting the application

```shell
python -m mangotango
python -m cibmangotree
```

## Development Guide and Documentation
Expand Down
2 changes: 1 addition & 1 deletion app/shiny.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def call_handlers(self, inputs: Inputs, outputs: Outputs, session: Session):


class LayoutManager(BaseModel):
title: Optional[str] = "Mango Tango Dashboard"
title: Optional[str] = "CIB Mango Tree"
elements: Optional[List[_navs.NavPanel]] = []

class Config:
Expand Down
4 changes: 3 additions & 1 deletion services/tokenizer/basic/patterns.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,9 @@
)

# Word patterns for different script types
LATIN_WORD_PATTERN = r"[a-zA-Z]+(?:\'[a-zA-Z]+)*" # Handle contractions

LATIN_WORD_PATTERN = r"[a-zA-Z]+(?:\.[a-zA-Z]+)+\.?|[a-zA-Z]+(?:\'[a-zA-Z]+)*" # Handle abbreviations and contractions

WORD_PATTERN = f"(?:{LATIN_WORD_PATTERN}|{CJK_PATTERN}+|{ARABIC_PATTERN}+|{THAI_PATTERN}+|{SEA_PATTERN}+)"

# Punctuation (preserve some, group others)
Expand Down
78 changes: 77 additions & 1 deletion services/tokenizer/basic/test_basic_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

import pytest

from ..core.types import CaseHandling, LanguageFamily, TokenizerConfig, TokenType
from ..core.types import CaseHandling, TokenizerConfig
from .tokenizer import BasicTokenizer


Expand Down Expand Up @@ -1181,6 +1181,82 @@ def test_international_social_media(self):


# Fixtures for reusable test data


class TestAbbreviationsAndPunctuation:
"""Test abbreviation handling and punctuation edge cases."""

def test_abbreviations_basic(self):
"""Test basic abbreviation tokenization - abbreviations should stay intact."""
tokenizer = BasicTokenizer()
text = "The c.e.o.s met yesterday"
result = tokenizer.tokenize(text)

# Abbreviations should be preserved as single tokens
expected = ["the", "c.e.o.s", "met", "yesterday"]
assert result == expected, f"Expected {expected}, got {result}"

def test_abbreviations_with_trailing_period(self):
"""Test abbreviation with trailing sentence period."""
tokenizer = BasicTokenizer()
text = "I live in U.S. now"
result = tokenizer.tokenize(text)

# Abbreviation should be preserved, period is part of the abbreviation
expected = ["i", "live", "in", "u.s.", "now"]
assert result == expected, f"Expected {expected}, got {result}"

def test_multiple_abbreviations(self):
"""Test multiple abbreviations in the same sentence."""
tokenizer = BasicTokenizer()
text = "U.S. and U.K. relations"
result = tokenizer.tokenize(text)

# Both abbreviations should be preserved
expected = ["u.s.", "and", "u.k.", "relations"]
assert result == expected, f"Expected {expected}, got {result}"

def test_ellipses_without_punctuation(self):
"""Test ellipses handling - ellipses should be filtered out by default."""
tokenizer = BasicTokenizer()
text = "Wait for it..."
result = tokenizer.tokenize(text)

# Ellipses should be removed with default config (include_punctuation=False)
expected = ["wait", "for", "it"]
assert result == expected, f"Expected {expected}, got {result}"

def test_chinese_tokenization_regression(self):
"""Test that Chinese character tokenization still works correctly (regression check)."""
tokenizer = BasicTokenizer()
text = "你好世界"
result = tokenizer.tokenize(text)

# Chinese should still be tokenized character by character
expected = ["你", "好", "世", "界"]
assert result == expected, f"Expected {expected}, got {result}"

def test_contractions_regression(self):
"""Test that contractions are still handled correctly (regression check)."""
tokenizer = BasicTokenizer()
text = "I don't think it's ready"
result = tokenizer.tokenize(text)

# Contractions should be preserved as single tokens
expected = ["i", "don't", "think", "it's", "ready"]
assert result == expected, f"Expected {expected}, got {result}"

def test_abbreviations_and_contractions_together(self):
"""Test complex sentence with both abbreviations and contractions."""
tokenizer = BasicTokenizer()
text = "U.S. citizens don't always agree"
result = tokenizer.tokenize(text)

# Both abbreviations and contractions should be preserved
expected = ["u.s.", "citizens", "don't", "always", "agree"]
assert result == expected, f"Expected {expected}, got {result}"


@pytest.fixture
def basic_config():
"""Basic tokenizer configuration for tests."""
Expand Down
34 changes: 24 additions & 10 deletions services/tokenizer/basic/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from typing import Optional

from ..core.base import AbstractTokenizer
from ..core.types import LanguageFamily, TokenizerConfig, TokenList, TokenType
from ..core.types import LanguageFamily, TokenizerConfig, TokenList
from .patterns import get_patterns


Expand Down Expand Up @@ -219,15 +219,29 @@ def _is_url_like(self, token: str) -> bool:
if self._is_email_like(token):
return False

return (
token.startswith(("http://", "https://", "www."))
or "://" in token
or (
token.count(".") >= 1
and any(c.isalpha() for c in token)
and "@" not in token
)
)
# Explicit URL indicators (http://, https://, www., or protocol markers)
if token.startswith(("http://", "https://", "www.")) or "://" in token:
return True

# Domain-like patterns (e.g., "example.com")
# But NOT abbreviations (e.g., "U.S.", "c.e.o.s")
# Heuristic: URLs have at least one period NOT followed by a single uppercase/lowercase letter
# This allows "example.com" but excludes "U.S." and "c.e.o.s"
if (
token.count(".") >= 1
and any(c.isalpha() for c in token)
and "@" not in token
):
# Check if this looks like an abbreviation (single letters between periods)
# Pattern: letter(s).letter(s).letter(s) where segments are 1-3 chars
abbreviation_pattern = r"^[a-z]{1,3}(?:\.[a-z]{1,3})+\.?$"

if re.match(abbreviation_pattern, token, re.IGNORECASE):
return False # This is an abbreviation, not a URL
# If it has a period and looks like a domain, it's URL-like
return True

return False

def _is_email_like(self, token: str) -> bool:
"""Check if token looks like an email address."""
Expand Down
Loading