A command-line tool for text similarity analysis.
This tool calculates the similarity between text files in a given directory. It reads all .txt files in the specified directory, preprocesses the text, and calculates the cosine similarity between each pair of files. The tool then outputs the file pairs that have a similarity score above a given threshold.
- File Scanning: Scans the input directory for
.txtfiles. - Text Preprocessing:
- Converts all text to lowercase.
- Removes punctuation.
- Tokenizes the text into words.
- Vectorization: Creates a term frequency (TF) vector for each document.
- Similarity Calculation: Calculates the cosine similarity between all pairs of TF vectors.
- Output: Prints the file pairs with a similarity score above the specified threshold.
-i <input_dir>: The input directory containing the text files. Defaults to the current directory (.).-t <threshold>: The similarity threshold between 0 and 1. Defaults to0.85.-o <output_file>: The output file to write the results to. Defaults tosimilarity_results.csv.
To run the application, use the following command:
/User/Github/Projects/SCT/sct -i <input_dir> -t <threshold>For example:
/User/Github/Projects/SCT/sct -i data -t 0.5This will calculate the similarity between all .txt files in the data directory and output the file pairs with a similarity score above 0.5.
- C++11 or higher
- CMake
- Create a build directory:
mkdir build - Navigate to the build directory:
cd build - Run CMake:
cmake .. - Build the project:
make
This version of the tool uses a CPU-based implementation for similarity calculation.