Skip to main content

Open-vocabulary detection and segmentation

The open-vocabulary AI runs on cloud GPU (SAM2 + GroundingDINO + CLIP) and answers free-form prompts like "find every traffic sign", "segment the road surface", "highlight cracks larger than 10 cm". It is the cross-vertical edge-case handler: anything the trained models do not cover specifically can usually be addressed by a well-phrased prompt here.

Tier: Live for detection and segmentation on outdoor RGB imagery.

Run from the product

  1. Open a survey, raster, or single image.
  2. Click the AI icon.
  3. Pick Open-vocabulary detection or Open-vocabulary segmentation.
  4. Type a prompt. Examples:
    • traffic sign (detection)
    • road surface defect (detection or segmentation)
    • vegetation overhanging the carriageway (segmentation)
    • manhole cover (detection)
    • corroded section of pipe (segmentation)
  5. The estimated credit cost appears (1 credit per inference on a single image). Confirm.

Results return in 3 to 8 seconds depending on image size. Detections appear as bounding boxes; segmentations appear as coloured masks.

Run the public demo (no account required)

The public /try page at stratumly.com/try lets you upload one image and run an open-vocabulary inference without signing up. Useful for quick demos or for showing a prospect what the AI does.

  • 5 requests per IP per day.
  • 20 MB max upload size.
  • EXIF metadata is stripped before processing.

The /try demo is a single inference per request; no feedback, no persistence, no per-tenant fine-tuning. For production use, run the same models from inside a Stratumly project.

Prompting tips

The models work best on prompts that name a visible object or surface:

  • Good: "manhole cover", "broken bollard", "rusted railing".
  • Less good: "anything unsafe", "interesting features" (too abstract).

For segmentation tasks, prompt the surface or region you want masked, not the abstract concept:

  • Good: "road surface", "water in a flooded area", "snow cover".
  • Less good: "areas that need repair" (no visual concept to ground on).

If a prompt returns no detections, try:

  1. A simpler phrasing.
  2. Lowering the confidence threshold in the result viewer.
  3. A different vantage on the same scene.

Combine with trained models

The trained defect-detection and segmentation models handle the head of your distribution (the common asset classes in your vertical). The open-vocabulary model handles the long tail. Common pattern:

  1. Run defect detection across the survey.
  2. Review the results.
  3. For anything the trained model missed, run an open-vocabulary prompt against the specific photo.

This pattern keeps credits low (trained model is fast and cheap) while still catching unusual cases.

Provide feedback

Each open-vocabulary detection or segment has Accept / Reject controls. Accept promotes the result to a feature in your asset register or to an inspection finding. Reject (with optional reason) marks it as a false positive.

Accept / reject feedback on a recurring prompt is how an open-vocabulary inference graduates into a trained model for your tenant. After enough feedback on the same prompt class, we can train a per-tenant model that is faster and cheaper than the open-vocabulary path.

Limitations

  • Cloud GPU inference adds 3 to 8 seconds of latency per call. Not suitable for live video.
  • The model has a context-window per inference; very wide aerial imagery may need to be tiled, which the worker handles automatically but costs more credits.
  • Indoor imagery, thermal imagery, and night-time imagery work worse than daylight outdoor RGB.

What next?