Simil CUDA Text (SCT)

A command-line tool for text similarity analysis.

Description

This tool calculates the similarity between text files in a given directory. It reads all .txt files in the specified directory, preprocesses the text, and calculates the cosine similarity between each pair of files. The tool then outputs the file pairs that have a similarity score above a given threshold.

Functionality

File Scanning: Scans the input directory for .txt files.
Text Preprocessing:
- Converts all text to lowercase.
- Removes punctuation.
- Tokenizes the text into words.
Vectorization: Creates a term frequency (TF) vector for each document.
Similarity Calculation: Calculates the cosine similarity between all pairs of TF vectors.
Output: Prints the file pairs with a similarity score above the specified threshold.

Command-Line Arguments

-i <input_dir>: The input directory containing the text files. Defaults to the current directory (.).
-t <threshold>: The similarity threshold between 0 and 1. Defaults to 0.85.
-o <output_file>: The output file to write the results to. Defaults to similarity_results.csv.

Usage

To run the application, use the following command:

/User/Github/Projects/SCT/sct -i <input_dir> -t <threshold>

For example:

/User/Github/Projects/SCT/sct -i data -t 0.5

This will calculate the similarity between all .txt files in the data directory and output the file pairs with a similarity score above 0.5.

Dependencies

C++11 or higher
CMake

Building the Project

Create a build directory: mkdir build
Navigate to the build directory: cd build
Run CMake: cmake ..
Build the project: make

Notes

This version of the tool uses a CPU-based implementation for similarity calculation.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
src		src
.gitignore		.gitignore
CMakeCache.txt		CMakeCache.txt
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cmake_install.cmake		cmake_install.cmake
sct		sct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simil CUDA Text (SCT)

Description

Functionality

Command-Line Arguments

Usage

Dependencies

Building the Project

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Simil CUDA Text (SCT)

Description

Functionality

Command-Line Arguments

Usage

Dependencies

Building the Project

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages