Add readme

Cheng-Lin-Li · Cheng-Lin-Li · commit 5054470e62ce · 2017-10-15T18:07:53.000-07:00
diff --git a/MinHash_LSH/README.md b/MinHash_LSH/README.md
@@ -1,21 +1,16 @@
-## This is an implementation of TF-IDF algorithm with cosin similarity algorithm in Spark 2.1.1 with Python 2.7
-A similarity algorithm implementation of TF-IDF algorithm with cosin similarity implementation on spark platform as the measure of K-Means. The implementation of k-means is provided by Spark in examples/src/main/python/ml/kmeans_example.py.
+## This is an implementation of MinHash and Locality-Sensitive Hash (LSH) algorithm in Spark 2.1.1 with Python 2.7
+An implementation of MinHash and LSH to find similar set/users from their items/movies preference data. The implementation is finding similar sets/users by minhash and LSH in Spark platform to speed up the calculation - calculating the similarity by Jaccard similarity (or Jaccard coefficient). LSH: The implementation of Locality-Sensitive Hash in Spark. Based on Minhash functions to get the signature for each set/users and split these minhash functions by band. Each band will contain R minhash functions results.
 
-## Algorithm: TF-IDF algorithm with cosin similarity
+## Algorithm: MinHash and Locality-Sensitive Hash (LSH) algorithm
 
 ## Task:
-The task is to implement TF-IDF algorithm with cosin similarity in Apache Spark using Python. 
-Given a set of vectors to present a document as input, calculating the TF-IDF with cosin similarity to cluster those documents via similarity.
+Given a set of vectors to present a document as input to cluster those documents via MinHash and Locality-Sensitive Hash (LSH) algorithm.
+
+#### Usage: bin/spark-submit input_file.txt output_file.txt
 
-#### Usage: bin/spark-submit kmeans <file> <k> <convergeDist> [outputfile.txt]
-	k - the number of clusters
-	convergDist - The converge distance/similarity to stop program iterations.
-	
-	example: 	bin\spark-submit .\kmeans.py .\docword.enron_s.txt 10 0.00001 kmeans_output.txt
 
 #### Input: Takes input file from folder as the input
 
 		
 #### Output: Save all results into one text file. 
 
-kmeans_output.txt