Sample Solution - Track 2: Open-Vocabulary Segmentation with Text-Prompt (LPCVC 2025)

🔥 News

[2025.03.10] More detailed baseline instruction is provided!
[2025.02.13] OpenCV Webinar by Professor Lu introducing LPCVC 2025
[2025.02.01] Sample solution of Track2: X-Decoder released
[2025.01.10] LPCVC 2025 accepted as CVPR 2025 Workshop
[2024.12.10] LPCVC 2025 announced at NeurIPS 2024

1. Model Training and Evaluation

👉 Please refer to [LPCVC2025_BASELINE_INSTRUCTION] for the baseline environment setup and training instructions. As well as some hints to obtain a better solution for the competition!

For more architectures and details, please refer to [XDecoder].

2. Model Compilation and Profiling (Qualcomm AI Hub)

For detailed instructions on model compilation, profiling and inference, refer to the [AI Hub Documentation].

👉 LPCVC 2025 Track 2, Sample solution compilation, profiling and inference pipeline available in [compile_profile_inference_aihub]. Run python ./compile_and_profile/compile_profile_inference_aihub.py to get your torch model inference, converting to onnx format, andn QNN model compiling + profiling + inference on AIHub all together.

# Submit compilation job to AIHub
compile_job = hub.submit_compile_job(
    model=model_path,
    name="lpcvc25_track2_sample_solution",
    device=hub.Device(deploy_device),
    options="--target_runtime qnn_context_binary",
)
# IMPORTANT! You must share your compile job to lpcvc organizers thus we can pull and evalaute it.
compile_job.modify_sharing(add_emails=['[email protected]'])
model = compile_job.get_target_model()

# Profile model if requested
profile_job = hub.submit_profile_job(
    name="lpcvc25_track2_sample_solution",
    model=model, 
    device=hub.Device(deploy_device)
)

3. Inference and Evaluation

👉 See [compile_profile_inference_aihub] for complete inference and evaluation pipeline.

⚠️ Important Note

During evaluation, only the following inference commands will be used. Ensure your submitted model is correctly compiled and produces valid outputs on AIHub:

# Prepare inputs
aihub_inputs = {
    'image_input': [image_input.detach().cpu().numpy()], 
    'text_input': [text_input.detach().cpu().numpy()]
}

# Run inference
inference_job = hub.submit_inference_job(
    name="lpcvc25_track2_sample_solution",
    model=model,
    device=hub.Device(deploy_device),
    inputs=aihub_inputs
)
qnn_outputs = inference_job.download_output_data() # shape=[1024, 1024], numpy.array

Evaluation Details

Device

Snapdragon X Elite CRD

Test Dataset

1000 images from ~200 categories
3-5 annotated masks per image
Balanced across mask sizes and categories
3-5 text descriptions per mask
Text descriptions include:
- Keywords
- Short phrases
- Detailed sentences (appearance, location, semantics, relationships, etc.)

Input Format

Image:
- RGB format, shape: 1x3x1024x1024
- Longest edge resized to 1024, padded to square
Text:
- Shape: 2x1x77 (tokenized embedding + attention mask)
- Uses CLIP tokenizer output

# Image preprocessing example
"""Loads an image, resizes it while maintaining aspect ratio, and pads it to 1024x1024."""
    # Read the image
    image = cv2.imread(img_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    h, w, _ = image.shape
    # Resize the longest edge to 1024 while maintaining the aspect ratio
    if h > w:
        new_h = 1024
        new_w = int((w / h) * new_h)
    else:
        new_w = 1024
        new_h = int((h / w) * new_w)
    image_resized = cv2.resize(image, (new_w, new_h))
    pad_image = np.zeros((1024, 1024, 3), np.uint8) # Create a 1024x1024 canvas for padding
    pad_image[:new_h, :new_w] = image_resized # Place the resized image on the canvas (top-left corner)
    pad_image = pad_image.astype(np.float32)
    pad_image = torch.as_tensor(np.ascontiguousarray(pad_image.transpose(2, 0, 1)))
    image_input = torch.unsqueeze(pad_image, 0).detach().cpu().numpy() # shape: 1xcx1024x1024

# All the input images have the same input shape 1x3x1024x1024 with RGB values [0, 255]. The original images are first resized to make the longest edge equals 1024, then padded to square 1024x1024 by 0s.

Text Processing

Uses OpenAI CLIP tokenizer
Format: [input_ids; attention_mask]

# Text tokenization
tokenizer = CLIPTokenizer.from_pretrained('openai/clip-vit-base-patch32')
tokenizer.add_special_tokens({'cls_token': tokenizer.eos_token})
text = text if text.endswith('.') else text + '.'
tokens = tokenizer(text, padding='max_length', truncation=True, max_length=77, return_tensors='pt')
text_input = torch.stack((tokens['input_ids'], tokens['attention_mask']))  # Shape: 2x1x77

# (Text tokenization) QNN library does not support tokenization of text input yet. In order to reduce the influence of different text tokenizer used to the final performance, accuracy and latency, we pre-fixed the text tokenizer and only input the tokenized vector of the input text to the model

Evaluation Metric

mIoU (mean Intersection over Union)

def compute_IoU(pred_seg, gd_seg):
    I = (pred_seg & gd_seg)
    U = (pred_seg | gd_seg)
    return I.sum() / (U.sum() + 1e-6)

# Compute mIoU across test set
pred = output['grounding_mask']  # The output of model should already be binary mask (after threshold prediction.sigmoid() > 0.5)
gt = input['groundings']['masks'].bool()
IoUs = [compute_IoU(p, g) for p, g in zip(pred, gt)]
mIoU = sum(IoUs) / len(IoUs) * 100

Acknowledgements

The sample solution for LPCVC 2025 Track-2 is built on XDecoder

Contact

LPCVC 2025 Organizers:

Website: Homepage
Community: Slack
Email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 202 Commits
assets		assets
compile_and_profile		compile_and_profile
configs		configs
datasets		datasets
demo		demo
inference		inference
modeling		modeling
pipeline		pipeline
trainer		trainer
utils		utils
xdecoder_data		xdecoder_data
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
LPCVC2025_BASELINE_INSTRUCTION.md		LPCVC2025_BASELINE_INSTRUCTION.md
LPCVC2025_BASELINE_REQUIREMENTS.txt		LPCVC2025_BASELINE_REQUIREMENTS.txt
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
XDecoder_README.md		XDecoder_README.md
__init__.py		__init__.py
command.sh		command.sh
entry.py		entry.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sample Solution - Track 2: Open-Vocabulary Segmentation with Text-Prompt (LPCVC 2025)

🔥 News

1. Model Training and Evaluation

2. Model Compilation and Profiling (Qualcomm AI Hub)

3. Inference and Evaluation

⚠️ Important Note

Evaluation Details

Device

Test Dataset

Input Format

Text Processing

Evaluation Metric

Acknowledgements

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

zyaocoder/LPCVC_track2

Folders and files

Latest commit

History

Repository files navigation

Sample Solution - Track 2: Open-Vocabulary Segmentation with Text-Prompt (LPCVC 2025)

🔥 News

1. Model Training and Evaluation

2. Model Compilation and Profiling (Qualcomm AI Hub)

3. Inference and Evaluation

⚠️ Important Note

Evaluation Details

Device

Test Dataset

Input Format

Text Processing

Evaluation Metric

Acknowledgements

Contact

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages