This repository was archived by the owner on Feb 8, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 89
fix GEARPUMP-110, try streaming kmeans on Gearpump #5
Open
gy910210
wants to merge
1
commit into
apache:master
Choose a base branch
from
gy910210:kmeans
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| Streaming k-means clustering | ||
| ============================== | ||
| ## Introduction | ||
| This application is following Streaming k-means clustering on Spark, you can see for details at | ||
| <https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html>. | ||
|
|
||
| The DataSource used is `RandomRBFGenerator`, which is referenced by Huawei `StreamDM` <https://github.com/huawei-noah/streamDM>. | ||
|
|
||
| ## Gearpump topology | ||
| The Gearpump topology is as following: | ||
|
|
||
|  | ||
|
|
||
| The `Source Processor` will produce points by time, then broadcast the point to the `Distribution Processor`. | ||
| The number of tasks of the `Distribution Processor` is k, where each task save one center and the corresponding points. | ||
| When `Distribution Processor` receives a point from `Source Processor`, it will calculate the distance of this point to its center, and then send the distance along with the point and its `taskId` to the `Collection Processor`. | ||
| When `Collection Processor` receives the distance from `Distribution Processor`, it will accumulate the number of current points, determine if it's time to update center, choose the smallest distance and then send the point along with its corresponding `Distribution Processor` taskId by broadcast partitioner. | ||
| When `Distribution Processor` receives the result message, task with the corresponding `taskId` will accumulate the point. If `Distribution Processor` receives that it's time to update center, then all the tasks will update its corresponding center. | ||
|
|
||
| This procedure is streaming and the center of cluster will change by time. | ||
|
|
||
| ## How to use it | ||
| You can used this application by command: | ||
|
|
||
| ``` | ||
| bin/gear app -jar examples/streamingkmeans-2.11-0.7.7-SNAPSHOT-assembly.jar io.gearpump.streaming.examples.streamingkmeans.StreamingKmeansExample | ||
| ``` | ||
|
|
||
| As an option, you can configure the clustering task by the following command: | ||
|
|
||
| ``` | ||
| -k <how many clusters (k in kmeans)> | ||
| -dimension <dimension of a point> | ||
| -maxBatch <number of data a batch for DataSourceProcessor> | ||
| -maxNumber <number of data to do a clustering procedure> | ||
| -decayFactor <decay factor for clustering, used by updating center> | ||
| ``` | ||
|
|
||
| ## Evaluation | ||
| The number of task of the `Distribution Processor` is k, where each task saves one cluster center. | ||
| It will output the cluster center once they have been updated. |
6 changes: 6 additions & 0 deletions
6
examples/streaming/streamingkmeans/src/main/resources/geardefault.conf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| gearpump { | ||
| serializers { | ||
| "io.gearpump.streaming.examples.streamingkmeans.InputMessage" = "" | ||
| "io.gearpump.streaming.examples.streamingkmeans.ResultMessage" = "" | ||
| } | ||
| } |
66 changes: 66 additions & 0 deletions
66
...ans/src/main/scala/io/gearpump/streaming/examples/streamingkmeans/ClusterCollection.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package io.gearpump.streaming.examples.streamingkmeans | ||
|
|
||
| import io.gearpump.Message | ||
| import io.gearpump.cluster.UserConfig | ||
| import io.gearpump.streaming.task.{StartTime, Task, TaskContext} | ||
|
|
||
| class ClusterCollection(taskContext: TaskContext, conf: UserConfig) extends Task(taskContext, conf) { | ||
| import taskContext.output | ||
|
|
||
| private val k = conf.getInt("k").get | ||
| private val maxNumber = conf.getInt("maxNumber").get | ||
|
|
||
| private[streamingkmeans] var minTaskId = 0 | ||
| private[streamingkmeans] var minDistance = Double.MaxValue | ||
| private[streamingkmeans] var minDistPoint : List[Double] = null | ||
|
|
||
| private[streamingkmeans] var currentNumber = 0 | ||
| private[streamingkmeans] var totalNumber = 0 | ||
|
|
||
| override def onStart(startTime: StartTime): Unit = super.onStart(startTime) | ||
|
|
||
| override def onNext(msg: Message): Unit = { | ||
| if (null == msg) { | ||
| return | ||
| } | ||
|
|
||
| val (taskId, distance, point) = msg.msg.asInstanceOf[(Int, Double, List[Double])] | ||
| if (distance < minDistance) { | ||
| minDistance = distance | ||
| minDistPoint = point | ||
| minTaskId = taskId | ||
| } | ||
|
|
||
| currentNumber += 1 | ||
| if (k == currentNumber) { | ||
| currentNumber = 0 | ||
| totalNumber += 1 | ||
| if (maxNumber == totalNumber) { | ||
| totalNumber = 0 | ||
| output(new Message(new ResultMessage(minTaskId, minDistPoint, true))) | ||
| } else { | ||
| output(new Message(new ResultMessage(minTaskId, minDistPoint, false))) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| override def onStop(): Unit = super.onStop() | ||
| } |
143 changes: 143 additions & 0 deletions
143
...s/src/main/scala/io/gearpump/streaming/examples/streamingkmeans/ClusterDistribution.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,143 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package io.gearpump.streaming.examples.streamingkmeans | ||
|
|
||
| import java.util.concurrent.LinkedBlockingQueue | ||
|
|
||
| import io.gearpump.Message | ||
| import io.gearpump.cluster.UserConfig | ||
| import io.gearpump.streaming.task.{StartTime, Task, TaskContext} | ||
|
|
||
| import scala.collection.mutable | ||
| import scala.util.Random | ||
|
|
||
| class ClusterDistribution(taskContext: TaskContext, conf: UserConfig) extends Task(taskContext, conf) { | ||
| import taskContext.output | ||
|
|
||
| private[streamingkmeans] val dataQueue: LinkedBlockingQueue[List[Double]] = new LinkedBlockingQueue[List[Double]]() | ||
| private[streamingkmeans] var isBegin: Boolean = true | ||
|
|
||
| private val decayFactor = conf.getDouble("decayFactor").get | ||
| private val dimension = conf.getInt("dimension").get | ||
|
|
||
| private[streamingkmeans] val center: Array[Double] = new Array[Double](dimension) | ||
| private[streamingkmeans] val points: mutable.MutableList[List[Double]] = new mutable.MutableList() | ||
| private[streamingkmeans] var previousNumber = 0 | ||
| private[streamingkmeans] var currentNumber = 0 | ||
|
|
||
|
|
||
| /** | ||
| * init center randomly | ||
| */ | ||
| private[streamingkmeans] def initCenter(): Unit = { | ||
| val random = new Random() | ||
| for (i <- center.indices) { | ||
| center.update(i, random.nextGaussian()) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * The update algorithm uses the "mini-batch" KMeans rule, | ||
| * generalized to incorporate forgetfullness (i.e. decay). | ||
| * The update rule (for each cluster) is: | ||
| * | ||
| * {{{ | ||
| * c_t+1 = [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t] | ||
| * n_t+t = n_t * a + m_t | ||
| * }}} | ||
| * | ||
| * Where c_t is the previously estimated centroid for that cluster, | ||
| * n_t is the number of points assigned to it thus far, x_t is the centroid | ||
| * estimated on the current batch, and m_t is the number of points assigned | ||
| * to that centroid in the current batch. | ||
| * | ||
| * The decay factor 'a' scales the contribution of the clusters as estimated thus far, | ||
| * by applying a as a discount weighting on the current point when evaluating | ||
| * new incoming data. If a=1, all batches are weighted equally. If a=0, new centroids | ||
| * are determined entirely by recent data. Lower values correspond to | ||
| * more forgetting. | ||
| */ | ||
| private[streamingkmeans] def updateCenter(): Unit = { | ||
| if (0 == currentNumber) { | ||
| return | ||
| } | ||
|
|
||
| val newCenter: Array[Double] = new Array[Double](dimension) | ||
| for (i <- newCenter.indices) { | ||
| var sum = 0.0 | ||
| for (point <- points) { | ||
| sum += point(i) | ||
| } | ||
| sum /= currentNumber | ||
| newCenter.update(i, sum) | ||
| } | ||
|
|
||
| for (i <- center.indices) { | ||
| center.update(i, | ||
| (center(i) * previousNumber * decayFactor + newCenter(i) * currentNumber) | ||
| / (previousNumber + currentNumber)) | ||
| } | ||
| } | ||
|
|
||
| private[streamingkmeans] def getDistance(point: List[Double]): Double = { | ||
| var distance = 0.0 | ||
| for (i <- 0 until dimension) { | ||
| distance += ((point(i) - center(i)) * (point(i) - center(i))) | ||
| } | ||
| Math.sqrt(distance) | ||
| } | ||
|
|
||
| override def onStart(startTime: StartTime): Unit = { | ||
| initCenter() | ||
| } | ||
|
|
||
| override def onNext(msg: Message): Unit = { | ||
| if (null == msg) { | ||
| return | ||
| } | ||
|
|
||
| val message = msg.msg.asInstanceOf[ClusterMessage] | ||
|
|
||
| message match { | ||
| case InputMessage(point) => | ||
| if (isBegin) { | ||
| isBegin = false | ||
| output(new Message((taskContext.taskId.index, getDistance(point), point))) | ||
| } else { | ||
| dataQueue.put(point) | ||
| } | ||
| case ResultMessage(taskId, point, doCluster) => | ||
| if (taskContext.taskId.index == taskId) { | ||
| points += point | ||
| currentNumber += 1 | ||
| } | ||
| if (doCluster) { | ||
| updateCenter() | ||
| LOG.info(s"task ${taskContext.taskId.index}, center ${center.mkString(",")}") | ||
| points.clear() | ||
| previousNumber += currentNumber | ||
| currentNumber = 0 | ||
| } | ||
| val newPoint = dataQueue.take() | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a blocking call. We suggest against block in Task. You may use "poll" instead. |
||
| output(new Message((taskContext.taskId.index, getDistance(newPoint), newPoint))) | ||
| } | ||
| } | ||
|
|
||
| override def onStop(): Unit = super.onStop() | ||
| } | ||
23 changes: 23 additions & 0 deletions
23
...kmeans/src/main/scala/io/gearpump/streaming/examples/streamingkmeans/ClusterMessage.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package io.gearpump.streaming.examples.streamingkmeans | ||
|
|
||
| trait ClusterMessage extends Serializable | ||
| case class InputMessage(point: List[Double]) extends ClusterMessage | ||
| case class ResultMessage(taskId: Int, point: List[Double], doCluster: Boolean) extends ClusterMessage |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use non-blocking "offer" is better