Skip to content

Conversation

@watchMaker-sylar
Copy link

@watchMaker-sylar watchMaker-sylar commented Nov 18, 2025

What changes were proposed in this pull request?

添加了cluster_coefficient到UDGA中#369

How was this PR tested?

  • Tests have Added for the changes
  • Production environment verified

@kitalkuyo-gita
Copy link
Contributor

kitalkuyo-gita commented Nov 19, 2025

@watchMaker-sylar I have updated my understanding of the algorithm in the comments.

Suggested Improvements or Alternatives:

  • Modify the counting logic: In iteration 3, for each neighbor u, the intersection size |N(v) ∩ N(u)| should be calculated and added to an accumulator; for undirected graphs, care should be taken to avoid counting the same edge/triangle twice and to divide by an appropriate factor at the end.

Reduce communication volume:

  • Use sampling or thresholding for higher-order nodes; Use more compact data structures (such as bloom filters, bit vectors, or hash signatures) to transmit neighbor sets to reduce message volume (sacrificing accuracy or requiring subsequent verification);

  • Use a triangle-counting aggregation algorithm: For example, based on edge directionality (only allowing edges where deg(u) < deg(v) to participate in the exchange), the message volume can be reduced to O(∑ min(deg(u), deg(v))). Merging iterations (if framework supports it): If the runtime supports "reading and writing vertex values ​​within the same round and using them immediately," some steps can be merged to reduce one iteration; however, existing code clearly relies on the semantics of updateVertexValue being visible in the next round, so merging requires framework modifications.

  • Fix type filtering conditions, clearly define the meaning of vertexType (whether it's "only calculate vertices with this label" or "only include neighbors with this label in the calculation"), and specify this in code comments/documentation.

In my implementation, only three iterations are needed because the process of sending/receiving messages has already sent out the neighbor list in the first iteration and received and processed these lists in the second iteration. Since the vertex stores its neighbor list in the vertex value, there is no need for an additional round to "prepare" or "distribute" this data. Therefore, it can be completed in just three steps: sending (1) → receiving and counting (2) → calculating the output (3).

@watchMaker-sylar
Copy link
Author

watchMaker-sylar commented Nov 19, 2025

@kitalkuyo-gita Thanks for your reply

  1. Modify the counting logic;
    yeah, you are right, I fix it.
  2. Use sampling or thresholding for higher-order nodes;
    Is there a way to get the number of nodes in a graph? Perhaps I can determine the message sending method based on the number of nodes.
  3. Fix type filtering conditions;
    In my implementation,the meaning of vertexType is only calculate vertices(include neighbors) with this label, so it needs four iterations.Since the vertex not stores its neighbor type.

@kitalkuyo-gita
Copy link
Contributor

Regarding the second point, you can try the following usage:

// 1. 获取原始度(邻居数量)
int originDegree = neighborIds.size();

// 2. 确定发送目标列表 (根据阈值和采样大小决定是否采样)
List<Long> targetNeighbors;
if (originDegree > samplingThreshold && sampleSize > 0) {
    targetNeighbors = sampleNeighbors(neighborIds, sampleSize);
} else {
    targetNeighbors = neighborIds;
}

// 3. 创建当前节点的消息对象
// 注意:消息中包含原始度 originDegree 和实际发送的目标列表
ObjectRow message = ObjectRow.create(originDegree, targetNeighbors);

// 4. 遍历目标节点发送消息
targetNeighbors.forEach(neighborId -> {
    this.context.sendMessage(neighborId, message);
});

Sampling method (memory-saving type, you can adjust as needed):

private List<Long> sampleNeighbors(List<Long> neighbors, int k) {
    if (neighbors == null) return new ArrayList<>();
    int n = neighbors.size();
    if (n <= k) return new ArrayList<>(neighbors);

    // 策略选择阈值:如果只需采样极少部分 (例如少于 5%),使用下标随机法
    // 避免复制整个巨大的 List
    if (k < n * 0.05) {
        return pickRandomIndices(neighbors, n, k);
    }

    // 否则使用部分洗牌法
    return partialShuffle(neighbors, n, k);
}

// 辅助方法:通过随机下标采样 (节约内存)
private List<Long> pickRandomIndices(List<Long> neighbors, int n, int k) {
    List<Long> result = new ArrayList<>(k);
    // 使用 Set 保证下标不重复
    java.util.Set<Integer> selectedIndices = new java.util.HashSet<>(k);
    ThreadLocalRandom rnd = ThreadLocalRandom.current();

    while (selectedIndices.size() < k) {
        int idx = rnd.nextInt(n);
        if (selectedIndices.add(idx)) { // 如果 add 返回 true,说明是新下标
            result.add(neighbors.get(idx));
        }
    }
    return result;
}

// 辅助方法:部分洗牌法
private List<Long> partialShuffle(List<Long> neighbors, int n, int k) {
    List<Long> copy = new ArrayList<>(neighbors);
    ThreadLocalRandom rnd = ThreadLocalRandom.current();
    for (int i = 0; i < k; i++) {
        Collections.swap(copy, i, i + rnd.nextInt(n - i));
    }
    return new ArrayList<>(copy.subList(0, k));
}

@watchMaker-sylar
Copy link
Author

I understand what you mean. But I see this issue has already been resolved, so do I need to raise the PR again? Or how should I handle this PR?

@kitalkuyo-gita
Copy link
Contributor

I understand what you mean. But I see this issue has already been resolved, so do I need to raise the PR again? Or how should I handle this PR?

In my implementation, there is no sampling function to control the number of communications between nodes. If you wish, you may be able to add a sampling function based on my version.

@watchMaker-sylar
Copy link
Author

I understand what you mean. But I see this issue has already been resolved, so do I need to raise the PR again? Or how should I handle this PR?

In my implementation, there is no sampling function to control the number of communications between nodes. If you wish, you may be able to add a sampling function based on my version.

Your code structure is excellent, which allows me to easily add sampling functionality directly on top of your version. Can you help me check if it meets the expectations?

@kitalkuyo-gita
Copy link
Contributor

I understand what you mean. But I see this issue has already been resolved, so do I need to raise the PR again? Or how should I handle this PR?

In my implementation, there is no sampling function to control the number of communications between nodes. If you wish, you may be able to add a sampling function based on my version.

Your code structure is excellent, which allows me to easily add sampling functionality directly on top of your version. Can you help me check if it meets the expectations?

I just saw your message. I've been working overtime these past few days. I'll try to review your code when I have some free time this week.

@watchMaker-sylar
Copy link
Author

No problem at all! I really appreciate you taking the time to check in, especially when you've been so busy with overtime.

Please don't feel any pressure or rush. Focus on your work and your own well-being first. Whenever you get a spare moment, I'll be happy to hear your thoughts.

@kitalkuyo-gita
Copy link
Contributor

No problem at all! I really appreciate you taking the time to check in, especially when you've been so busy with overtime.

Please don't feel any pressure or rush. Focus on your work and your own well-being first. Whenever you get a spare moment, I'll be happy to hear your thoughts.

Please fix CI first.

Copy link
Contributor

@kitalkuyo-gita kitalkuyo-gita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants