[ISSUE-369]add cluster_coefficient UDGA #682

watchMaker-sylar · 2025-11-18T13:35:05Z

What changes were proposed in this pull request?

添加了cluster_coefficient到UDGA中#369

How was this PR tested?

Tests have Added for the changes
Production environment verified

kitalkuyo-gita · 2025-11-19T02:01:29Z

@watchMaker-sylar I have updated my understanding of the algorithm in the comments.

Suggested Improvements or Alternatives:

Modify the counting logic: In iteration 3, for each neighbor u, the intersection size |N(v) ∩ N(u)| should be calculated and added to an accumulator; for undirected graphs, care should be taken to avoid counting the same edge/triangle twice and to divide by an appropriate factor at the end.

Reduce communication volume:

Use sampling or thresholding for higher-order nodes; Use more compact data structures (such as bloom filters, bit vectors, or hash signatures) to transmit neighbor sets to reduce message volume (sacrificing accuracy or requiring subsequent verification);
Use a triangle-counting aggregation algorithm: For example, based on edge directionality (only allowing edges where deg(u) < deg(v) to participate in the exchange), the message volume can be reduced to O(∑ min(deg(u), deg(v))). Merging iterations (if framework supports it): If the runtime supports "reading and writing vertex values within the same round and using them immediately," some steps can be merged to reduce one iteration; however, existing code clearly relies on the semantics of updateVertexValue being visible in the next round, so merging requires framework modifications.
Fix type filtering conditions, clearly define the meaning of vertexType (whether it's "only calculate vertices with this label" or "only include neighbors with this label in the calculation"), and specify this in code comments/documentation.

In my implementation, only three iterations are needed because the process of sending/receiving messages has already sent out the neighbor list in the first iteration and received and processed these lists in the second iteration. Since the vertex stores its neighbor list in the vertex value, there is no need for an additional round to "prepare" or "distribute" this data. Therefore, it can be completed in just three steps: sending (1) → receiving and counting (2) → calculating the output (3).

watchMaker-sylar · 2025-11-19T13:22:14Z

@kitalkuyo-gita Thanks for your reply

Modify the counting logic;
yeah, you are right, I fix it.
Use sampling or thresholding for higher-order nodes;
Is there a way to get the number of nodes in a graph? Perhaps I can determine the message sending method based on the number of nodes.
Fix type filtering conditions;
In my implementation,the meaning of vertexType is only calculate vertices(include neighbors) with this label, so it needs four iterations.Since the vertex not stores its neighbor type.

kitalkuyo-gita · 2025-11-20T06:10:29Z

Regarding the second point, you can try the following usage:

// 1. 获取原始度（邻居数量）
int originDegree = neighborIds.size();

// 2. 确定发送目标列表 (根据阈值和采样大小决定是否采样)
List<Long> targetNeighbors;
if (originDegree > samplingThreshold && sampleSize > 0) {
    targetNeighbors = sampleNeighbors(neighborIds, sampleSize);
} else {
    targetNeighbors = neighborIds;
}

// 3. 创建当前节点的消息对象
// 注意：消息中包含原始度 originDegree 和实际发送的目标列表
ObjectRow message = ObjectRow.create(originDegree, targetNeighbors);

// 4. 遍历目标节点发送消息
targetNeighbors.forEach(neighborId -> {
    this.context.sendMessage(neighborId, message);
});

Sampling method (memory-saving type, you can adjust as needed):

private List<Long> sampleNeighbors(List<Long> neighbors, int k) {
    if (neighbors == null) return new ArrayList<>();
    int n = neighbors.size();
    if (n <= k) return new ArrayList<>(neighbors);

    // 策略选择阈值：如果只需采样极少部分 (例如少于 5%)，使用下标随机法
    // 避免复制整个巨大的 List
    if (k < n * 0.05) {
        return pickRandomIndices(neighbors, n, k);
    }

    // 否则使用部分洗牌法
    return partialShuffle(neighbors, n, k);
}

// 辅助方法：通过随机下标采样 (节约内存)
private List<Long> pickRandomIndices(List<Long> neighbors, int n, int k) {
    List<Long> result = new ArrayList<>(k);
    // 使用 Set 保证下标不重复
    java.util.Set<Integer> selectedIndices = new java.util.HashSet<>(k);
    ThreadLocalRandom rnd = ThreadLocalRandom.current();

    while (selectedIndices.size() < k) {
        int idx = rnd.nextInt(n);
        if (selectedIndices.add(idx)) { // 如果 add 返回 true，说明是新下标
            result.add(neighbors.get(idx));
        }
    }
    return result;
}

// 辅助方法：部分洗牌法
private List<Long> partialShuffle(List<Long> neighbors, int n, int k) {
    List<Long> copy = new ArrayList<>(neighbors);
    ThreadLocalRandom rnd = ThreadLocalRandom.current();
    for (int i = 0; i < k; i++) {
        Collections.swap(copy, i, i + rnd.nextInt(n - i));
    }
    return new ArrayList<>(copy.subList(0, k));
}

watchMaker-sylar · 2025-11-21T12:51:18Z

I understand what you mean. But I see this issue has already been resolved, so do I need to raise the PR again? Or how should I handle this PR?

kitalkuyo-gita · 2025-11-24T05:55:09Z

I understand what you mean. But I see this issue has already been resolved, so do I need to raise the PR again? Or how should I handle this PR?

In my implementation, there is no sampling function to control the number of communications between nodes. If you wish, you may be able to add a sampling function based on my version.

watchMaker-sylar · 2025-11-25T16:34:20Z

I understand what you mean. But I see this issue has already been resolved, so do I need to raise the PR again? Or how should I handle this PR?

In my implementation, there is no sampling function to control the number of communications between nodes. If you wish, you may be able to add a sampling function based on my version.

Your code structure is excellent, which allows me to easily add sampling functionality directly on top of your version. Can you help me check if it meets the expectations?

kitalkuyo-gita · 2025-11-27T05:43:34Z

I understand what you mean. But I see this issue has already been resolved, so do I need to raise the PR again? Or how should I handle this PR?

In my implementation, there is no sampling function to control the number of communications between nodes. If you wish, you may be able to add a sampling function based on my version.

Your code structure is excellent, which allows me to easily add sampling functionality directly on top of your version. Can you help me check if it meets the expectations?

I just saw your message. I've been working overtime these past few days. I'll try to review your code when I have some free time this week.

watchMaker-sylar · 2025-11-27T13:26:52Z

No problem at all! I really appreciate you taking the time to check in, especially when you've been so busy with overtime.

Please don't feel any pressure or rush. Focus on your work and your own well-being first. Whenever you get a spare moment, I'll be happy to hear your thoughts.

kitalkuyo-gita · 2025-12-04T03:00:40Z

No problem at all! I really appreciate you taking the time to check in, especially when you've been so busy with overtime.

Please don't feel any pressure or rush. Focus on your work and your own well-being first. Whenever you get a spare moment, I'll be happy to hear your thoughts.

Please fix CI first.

kitalkuyo-gita

LGTM

watchMaker-sylar closed this Nov 25, 2025

watchMaker-sylar force-pushed the master branch from 3790732 to 3f7b695 Compare November 25, 2025 14:04

[ISSUE-369]Add a sampling function

e17095c

watchMaker-sylar reopened this Nov 25, 2025

[ISSUE-369]checkStyle

736a759

kitalkuyo-gita approved these changes Dec 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ISSUE-369]add cluster_coefficient UDGA #682

[ISSUE-369]add cluster_coefficient UDGA #682

watchMaker-sylar commented Nov 18, 2025 •

edited

Loading

Uh oh!

kitalkuyo-gita commented Nov 19, 2025 •

edited

Loading

Uh oh!

watchMaker-sylar commented Nov 19, 2025 •

edited

Loading

Uh oh!

kitalkuyo-gita commented Nov 20, 2025

Uh oh!

watchMaker-sylar commented Nov 21, 2025

Uh oh!

kitalkuyo-gita commented Nov 24, 2025

Uh oh!

watchMaker-sylar commented Nov 25, 2025

Uh oh!

kitalkuyo-gita commented Nov 27, 2025

Uh oh!

watchMaker-sylar commented Nov 27, 2025

Uh oh!

kitalkuyo-gita commented Dec 4, 2025

Uh oh!

kitalkuyo-gita left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[ISSUE-369]add cluster_coefficient UDGA #682

Are you sure you want to change the base?

[ISSUE-369]add cluster_coefficient UDGA #682

Conversation

watchMaker-sylar commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this PR tested?

Uh oh!

kitalkuyo-gita commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Suggested Improvements or Alternatives:

Reduce communication volume:

Uh oh!

watchMaker-sylar commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kitalkuyo-gita commented Nov 20, 2025

Uh oh!

watchMaker-sylar commented Nov 21, 2025

Uh oh!

kitalkuyo-gita commented Nov 24, 2025

Uh oh!

watchMaker-sylar commented Nov 25, 2025

Uh oh!

kitalkuyo-gita commented Nov 27, 2025

Uh oh!

watchMaker-sylar commented Nov 27, 2025

Uh oh!

kitalkuyo-gita commented Dec 4, 2025

Uh oh!

kitalkuyo-gita left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

watchMaker-sylar commented Nov 18, 2025 •

edited

Loading

kitalkuyo-gita commented Nov 19, 2025 •

edited

Loading

watchMaker-sylar commented Nov 19, 2025 •

edited

Loading