- [2025.03.10] More detailed baseline instruction is provided!
- [2025.02.13] OpenCV Webinar by Professor Lu introducing LPCVC 2025
- [2025.02.01] Sample solution of Track2: X-Decoder released
- [2025.01.10] LPCVC 2025 accepted as CVPR 2025 Workshop
- [2024.12.10] LPCVC 2025 announced at NeurIPS 2024
π Please refer to [LPCVC2025_BASELINE_INSTRUCTION] for the baseline environment setup and training instructions. As well as some hints to obtain a better solution for the competition!
For more architectures and details, please refer to [XDecoder].
For detailed instructions on model compilation, profiling and inference, refer to the [AI Hub Documentation].
π LPCVC 2025 Track 2, Sample solution compilation, profiling and inference pipeline available in [compile_profile_inference_aihub]. Run python ./compile_and_profile/compile_profile_inference_aihub.py
to get your torch model inference, converting to onnx format, andn QNN model compiling + profiling + inference on AIHub all together.
# Submit compilation job to AIHub
compile_job = hub.submit_compile_job(
model=model_path,
name="lpcvc25_track2_sample_solution",
device=hub.Device(deploy_device),
options="--target_runtime qnn_context_binary",
)
# IMPORTANT! You must share your compile job to lpcvc organizers thus we can pull and evalaute it.
compile_job.modify_sharing(add_emails=['[email protected]'])
model = compile_job.get_target_model()
# Profile model if requested
profile_job = hub.submit_profile_job(
name="lpcvc25_track2_sample_solution",
model=model,
device=hub.Device(deploy_device)
)
π See [compile_profile_inference_aihub] for complete inference and evaluation pipeline.
During evaluation, only the following inference commands will be used. Ensure your submitted model is correctly compiled and produces valid outputs on AIHub:
# Prepare inputs
aihub_inputs = {
'image_input': [image_input.detach().cpu().numpy()],
'text_input': [text_input.detach().cpu().numpy()]
}
# Run inference
inference_job = hub.submit_inference_job(
name="lpcvc25_track2_sample_solution",
model=model,
device=hub.Device(deploy_device),
inputs=aihub_inputs
)
qnn_outputs = inference_job.download_output_data() # shape=[1024, 1024], numpy.array
- Snapdragon X Elite CRD
- 1000 images from ~200 categories
- 3-5 annotated masks per image
- Balanced across mask sizes and categories
- 3-5 text descriptions per mask
- Text descriptions include:
- Keywords
- Short phrases
- Detailed sentences (appearance, location, semantics, relationships, etc.)
- Image:
- RGB format, shape: 1x3x1024x1024
- Longest edge resized to 1024, padded to square
- Text:
- Shape: 2x1x77 (tokenized embedding + attention mask)
- Uses CLIP tokenizer output
# Image preprocessing example
"""Loads an image, resizes it while maintaining aspect ratio, and pads it to 1024x1024."""
# Read the image
image = cv2.imread(img_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
h, w, _ = image.shape
# Resize the longest edge to 1024 while maintaining the aspect ratio
if h > w:
new_h = 1024
new_w = int((w / h) * new_h)
else:
new_w = 1024
new_h = int((h / w) * new_w)
image_resized = cv2.resize(image, (new_w, new_h))
pad_image = np.zeros((1024, 1024, 3), np.uint8) # Create a 1024x1024 canvas for padding
pad_image[:new_h, :new_w] = image_resized # Place the resized image on the canvas (top-left corner)
pad_image = pad_image.astype(np.float32)
pad_image = torch.as_tensor(np.ascontiguousarray(pad_image.transpose(2, 0, 1)))
image_input = torch.unsqueeze(pad_image, 0).detach().cpu().numpy() # shape: 1xcx1024x1024
# All the input images have the same input shape 1x3x1024x1024 with RGB values [0, 255]. The original images are first resized to make the longest edge equals 1024, then padded to square 1024x1024 by 0s.
- Uses OpenAI CLIP tokenizer
- Format:
[input_ids; attention_mask]
# Text tokenization
tokenizer = CLIPTokenizer.from_pretrained('openai/clip-vit-base-patch32')
tokenizer.add_special_tokens({'cls_token': tokenizer.eos_token})
text = text if text.endswith('.') else text + '.'
tokens = tokenizer(text, padding='max_length', truncation=True, max_length=77, return_tensors='pt')
text_input = torch.stack((tokens['input_ids'], tokens['attention_mask'])) # Shape: 2x1x77
# (Text tokenization) QNN library does not support tokenization of text input yet. In order to reduce the influence of different text tokenizer used to the final performance, accuracy and latency, we pre-fixed the text tokenizer and only input the tokenized vector of the input text to the model
- mIoU (mean Intersection over Union)
def compute_IoU(pred_seg, gd_seg):
I = (pred_seg & gd_seg)
U = (pred_seg | gd_seg)
return I.sum() / (U.sum() + 1e-6)
# Compute mIoU across test set
pred = output['grounding_mask'] # The output of model should already be binary mask (after threshold prediction.sigmoid() > 0.5)
gt = input['groundings']['masks'].bool()
IoUs = [compute_IoU(p, g) for p, g in zip(pred, gt)]
mIoU = sum(IoUs) / len(IoUs) * 100
- The sample solution for LPCVC 2025 Track-2 is built on XDecoder
LPCVC 2025 Organizers:
- Website: Homepage
- Community: Slack
- Email: [email protected]