AudioTextToImage¶

class agentlego.tools.AudioTextToImage(device='cpu', toolmeta=None)[source]

A tool to generate image from an audio and texts.

Parameters:

device (str) – The device to load the model. Defaults to ‘cpu’.
toolmeta (None | dict | ToolMeta) – The additional info of the tool. Defaults to None.

Default Tool Meta¶

name: AudioTextToImage
description: This tool can generate an image according to the input audio and the input description.
inputs:
- audio (AudioIO)
- prompt (str)
outputs:
- ImageIO

Examples¶

Download the demo resource

wget http://download.openmmlab.com/agentlego/cat.wav

Use the tool directly (without agent)

from agentlego.apis import load_tool

# load tool
tool = load_tool('AudioTextToImage', device='cuda')

# apply tool
image = tool('cat.wav', 'flying in the sky')

With Lagent

from lagent import ReAct, GPTAPI, ActionExecutor
from agentlego.apis import load_tool

# load tools and build agent
# please set `OPENAI_API_KEY` in your environment variable.
tool = load_tool('AudioTextToImage', device='cuda').to_lagent()
agent = ReAct(GPTAPI(temperature=0.), action_executor=ActionExecutor([tool]))

# agent running with the tool.
ret = agent.chat(f'Please generate an image according to the audio `cat.wav`, and it should fly in the sky.')
for step in ret.inner_steps[1:]:
    print('------')
    print(step['content'])

Set up¶

Before using the tool, please confirm you have installed the related dependencies by the below commands.

pip install timm ftfy iopath diffusers pytorchvideo

Reference¶

This tool uses a ImageBind model. See the following paper for details.

@misc{girdhar2023imagebind,
      title={ImageBind: One Embedding Space To Bind Them All},
      author={Rohit Girdhar and Alaaeldin El-Nouby and Zhuang Liu and Mannat Singh and Kalyan Vasudev Alwala and Armand Joulin and Ishan Misra},
      year={2023},
      eprint={2305.05665},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}