Skip to content

Comments

Add shm size check#978

Merged
hiworldwzj merged 8 commits intoModelTC:mainfrom
flyinglandlord:add-shm-check
Aug 18, 2025
Merged

Add shm size check#978
hiworldwzj merged 8 commits intoModelTC:mainfrom
flyinglandlord:add-shm-check

Conversation

@flyinglandlord
Copy link
Collaborator

New Startup Command: --disable-shm-warning.

By default, LightLLM will now periodically check the available SHM size every 120 seconds and log warnings if it's below the recommended threshold (128GB).

If --disable-shm-warning is set, LightLLM will perform a one-time SHM size check only during startup, and then disable further periodic checks.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @flyinglandlord, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a proactive shared memory (/dev/shm) monitoring feature to LightLLM. It aims to alert users about potentially insufficient SHM space, which is critical for optimal performance, by providing both continuous periodic checks and an option for a one-time startup verification.

Highlights

  • New Feature: SHM Size Monitoring: LightLLM now includes a mechanism to periodically check the available /dev/shm size, logging warnings if it falls below the recommended 128GB threshold. This helps users identify potential memory issues that could impact performance.
  • New Command-Line Argument: A new --disable-shm-warning argument has been added to api_cli.py. When this argument is set, LightLLM will perform a single SHM size check only during startup and then disable further periodic checks, providing flexibility for users who prefer a less verbose output.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces a shared memory size check, enhancing robustness. The review suggests clarifying the memory check (total vs. available) and improving maintainability via constants and import conventions.

Comment on lines 23 to 39
def get_shm_size_gb():
"""
获取 /dev/shm 的总大小(以GB为单位)。
"""
try:
shm_path = "/dev/shm"
if not os.path.exists(shm_path):
logger.error(f"{shm_path} not exist, this may indicate a system or Docker configuration anomaly.")
return 0

# shutil.disk_usage 返回 (total, used, free)
total_bytes = shutil.disk_usage(shm_path).total
total_gb = total_bytes / (1024 ** 3)
return total_gb
except Exception as e:
logger.error(f"Error getting /dev/shm size: {e}")
return 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function calculates the total SHM size but the PR description refers to "available" space. This can be misleading. To check for available space, use shutil.disk_usage(shm_path).free and rename the function to get_shm_free_size_gb for clarity.

Suggested change
def get_shm_size_gb():
"""
获取 /dev/shm 的总大小以GB为单位)。
"""
try:
shm_path = "/dev/shm"
if not os.path.exists(shm_path):
logger.error(f"{shm_path} not exist, this may indicate a system or Docker configuration anomaly.")
return 0
# shutil.disk_usage 返回 (total, used, free)
total_bytes = shutil.disk_usage(shm_path).total
total_gb = total_bytes / (1024 ** 3)
return total_gb
except Exception as e:
logger.error(f"Error getting /dev/shm size: {e}")
return 0
def get_shm_free_size_gb():
"""
获取 /dev/shm 的可用大小以GB为单位)。
"""
try:
shm_path = "/dev/shm"
if not os.path.exists(shm_path):
logger.error(f"{shm_path} not exist, this may indicate a system or Docker configuration anomaly.")
return 0
# shutil.disk_usage 返回 (total, used, free)
free_bytes = shutil.disk_usage(shm_path).free
free_gb = free_bytes / (1024 ** 3)
return free_gb
except Exception as e:
logger.error(f"Error getting /dev/shm size: {e}")
return 0

Comment on lines 47 to 51
required_size = 128 # 128G
if shm_size < required_size:
logger.warning(f"{RED}Available shm size is less than 128G: {shm_size:.2f}G{ENDC}")
else:
logger.info(f"{GREEN}/dev/shm available space is sufficient ({shm_size:.2f} GB >= {required_size} GB).{ENDC}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better maintainability, define required_size as a constant REQUIRED_SIZE_GB and use it in the warning message to avoid hardcoding the value 128G.

Suggested change
required_size = 128 # 128G
if shm_size < required_size:
logger.warning(f"{RED}Available shm size is less than 128G: {shm_size:.2f}G{ENDC}")
else:
logger.info(f"{GREEN}/dev/shm available space is sufficient ({shm_size:.2f} GB >= {required_size} GB).{ENDC}")
REQUIRED_SIZE_GB = 128 # 128G
if shm_size < REQUIRED_SIZE_GB:
logger.warning(f"{RED}Available shm size is less than {REQUIRED_SIZE_GB}G: {shm_size:.2f}G{ENDC}")
else:
logger.info(f"{GREEN}/dev/shm available space is sufficient ({shm_size:.2f} GB >= {REQUIRED_SIZE_GB} GB).{ENDC}")

Comment on lines 100 to 108
import threading

def periodic_shm_warning():
while True:
check_shm_size()
time.sleep(120) # 每 120 秒打印一次警告日志

shm_warning_thread = threading.Thread(target=periodic_shm_warning, daemon=True)
shm_warning_thread.start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For style and maintainability:

  • Move import threading to the top of the file (PEP 8).
  • Define the sleep interval 120 as a named constant (e.g., SHM_CHECK_INTERVAL_S = 120) at the module level.

Since I can't suggest changes outside the diff, I'll define the constant locally, but consider moving both the import and the constant to the module level.

Suggested change
import threading
def periodic_shm_warning():
while True:
check_shm_size()
time.sleep(120) # 每 120 秒打印一次警告日志
shm_warning_thread = threading.Thread(target=periodic_shm_warning, daemon=True)
shm_warning_thread.start()
import threading
SHM_CHECK_INTERVAL_S = 120 # Consider moving this to a module-level constant
def periodic_shm_warning():
while True:
check_shm_size()
time.sleep(SHM_CHECK_INTERVAL_S) # 每 120 秒打印一次警告日志
shm_warning_thread = threading.Thread(target=periodic_shm_warning, daemon=True)
shm_warning_thread.start()

args.cache_capacity * (image_size_bytes + image_token_size_bytes)
+ req_shm_size_bytes
+ out_token_queue_size_bytes
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

args.running_max_req_size * 8 * 2 * args.max_req_total_len

Copy link
Collaborator Author

@flyinglandlord flyinglandlord Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size of input_tokens and logprobs

# 假设加载最大分辨率图片时,通过 tokenizer 得到最多的 image_tokens
if not hasattr(tokenizer, "get_image_token_length"):
raise AttributeError("Tokenizer must have a 'get_image_token_length' method for multimodal models.")
max_image_tokens = tokenizer.get_image_token_length(None)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

构造一个img传进去,防止出错

req_class_size = ctypes.sizeof(ChunkedPrefillReq)
req_shm_size_bytes = req_class_size * args.running_max_req_size

# 估算OutTokenQueue所需shm大小
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分已经在上面的sizeof算过了,不用再算一遍

@hiworldwzj hiworldwzj merged commit 81b9ecb into ModelTC:main Aug 18, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants