Skip to content

Commit 5054470

Browse files
committed
Add readme
1 parent 32fd062 commit 5054470

File tree

1 file changed

+6
-11
lines changed

1 file changed

+6
-11
lines changed

MinHash_LSH/README.md

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,16 @@
1-
## This is an implementation of TF-IDF algorithm with cosin similarity algorithm in Spark 2.1.1 with Python 2.7
2-
A similarity algorithm implementation of TF-IDF algorithm with cosin similarity implementation on spark platform as the measure of K-Means. The implementation of k-means is provided by Spark in examples/src/main/python/ml/kmeans_example.py.
1+
## This is an implementation of MinHash and Locality-Sensitive Hash (LSH) algorithm in Spark 2.1.1 with Python 2.7
2+
An implementation of MinHash and LSH to find similar set/users from their items/movies preference data. The implementation is finding similar sets/users by minhash and LSH in Spark platform to speed up the calculation - calculating the similarity by Jaccard similarity (or Jaccard coefficient). LSH: The implementation of Locality-Sensitive Hash in Spark. Based on Minhash functions to get the signature for each set/users and split these minhash functions by band. Each band will contain R minhash functions results.
33

4-
## Algorithm: TF-IDF algorithm with cosin similarity
4+
## Algorithm: MinHash and Locality-Sensitive Hash (LSH) algorithm
55

66
## Task:
7-
The task is to implement TF-IDF algorithm with cosin similarity in Apache Spark using Python.
8-
Given a set of vectors to present a document as input, calculating the TF-IDF with cosin similarity to cluster those documents via similarity.
7+
Given a set of vectors to present a document as input to cluster those documents via MinHash and Locality-Sensitive Hash (LSH) algorithm.
8+
9+
#### Usage: bin/spark-submit input_file.txt output_file.txt
910

10-
#### Usage: bin/spark-submit kmeans <file> <k> <convergeDist> [outputfile.txt]
11-
k - the number of clusters
12-
convergDist - The converge distance/similarity to stop program iterations.
13-
14-
example: bin\spark-submit .\kmeans.py .\docword.enron_s.txt 10 0.00001 kmeans_output.txt
1511

1612
#### Input: Takes input file from folder as the input
1713

1814
1915
#### Output: Save all results into one text file.
2016

21-
kmeans_output.txt

0 commit comments

Comments
 (0)