Shortcuts

SpeechToText

class agentlego.tools.SpeechToText(model='openai/whisper-base', device='cuda', toolmeta=None)[源代码]

A tool to recognize speech and convert to text.

参数:
  • model (str) – The model name used to inference. Which can be found in the HuggingFace model page. Defaults to openai/whisper-base.

  • device (str) – The device to load the model. Defaults to ‘cpu’.

  • toolmeta (None | dict | ToolMeta) – The additional info of the tool. Defaults to None.

默认工具信息

  • 名称: SpeechToText

  • 描述: The tool can translate spoken language audio into text.

  • 输入:

    • audio (AudioIO)

  • 输出:

    • str

Examples

Use the tool directly (without agent)

from agentlego.apis import load_tool

# load tool
tool = load_tool('SpeechToText', device='cuda')

# apply tool
text = tool('examples/demo.m4a')
print(text)

With Lagent

from lagent import ReAct, GPTAPI, ActionExecutor
from agentlego.apis import load_tool

# load tools and build agent
# please set `OPENAI_API_KEY` in your environment variable.
tool = load_tool('SpeechToText', device='cuda').to_lagent()
agent = ReAct(GPTAPI(temperature=0.), action_executor=ActionExecutor([tool]))

# agent running with the tool.
audio_path = 'examples/demo.m4a'
ret = agent.chat(f'Please tell me the content of the audio `{audio_path}`')
for step in ret.inner_steps[1:]:
    print('------')
    print(step['content'])

Set up

Before using the tool, please confirm you have installed the related dependencies by the below commands.

pip install -U transformers

Reference

This tool uses a Whisper model in default settings. See the following paper for details.

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}