Shortcuts

SegmentObject

class agentlego.tools.SegmentObject(sam_model='sam_vit_h_4b8939.pth', grounding_model='glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365', device='cuda', toolmeta=None)[source]

A tool to segment all objects on an image.

Parameters:
  • sam_model (str) – The model name used to inference. Which can be found in the segment_anything repository. Defaults to sam_vit_h_4b8939.pth.

  • grounding_model (str) – The model name used to grounding. Which can be found in the MMDetection repository. Defaults to glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365.

  • device (str) – The device to load the model. Defaults to ‘cpu’.

  • toolmeta (None | dict | ToolMeta) – The additional info of the tool. Defaults to None.

Default Tool Meta

  • name: SegmentObject

  • description: This tool can segment the specified kind of objects in the input image, and return the segmentation result image.

  • inputs:

    • image (ImageIO)

    • text (str): The object to segment.

  • outputs:

    • ImageIO: The segmentation result image.

Examples

Download the demo resource

wget http://download.openmmlab.com/agentlego/cups.png

Use the tool directly (without agent)

from agentlego.apis import load_tool

# load tool
tool = load_tool('SegmentObject', device='cuda')

# apply tool
segmentation = tool('cups.png', 'water cup')

With Lagent

from lagent import ReAct, GPTAPI, ActionExecutor
from agentlego.apis import load_tool

# load tools and build agent
# please set `OPENAI_API_KEY` in your environment variable.
tool = load_tool('SegmentObject', device='cuda').to_lagent()
agent = ReAct(GPTAPI(temperature=0.), action_executor=ActionExecutor([tool]))

# agent running with the tool.
ret = agent.chat(f'Please segment all water cups in the image `cups.png`.')
for step in ret.inner_steps[1:]:
    print('------')
    print(step['content'])

Set up

Before using the tool, please confirm you have installed the related dependencies by the below commands.

pip install openmim, segment_anything
mim install mmdet

Reference

This tool uses a Segment Anything model and GLIP model. See the following papers for details.

@misc{kirillov2023segment,
      title={Segment Anything},
      author={Alexander Kirillov and Eric Mintun and Nikhila Ravi and Hanzi Mao and Chloe Rolland and Laura Gustafson and Tete Xiao and Spencer Whitehead and Alexander C. Berg and Wan-Yen Lo and Piotr Dollár and Ross Girshick},
      year={2023},
      eprint={2304.02643},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@inproceedings{li2021grounded,
      title={Grounded Language-Image Pre-training},
      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
      year={2022},
      booktitle={CVPR},
}