Why Your Next Chatbot Should Understand More Than Just Text

Chatbots are getting smarter-but many still rely on text-only inputs, limiting their usefulness in today’s multimodal world.

Customers interact with your business using images, voice notes, and PDFs-not just words. That means your digital touchpoints should understand these inputs. Thankfully, the rise of multimodal AI is making this more accessible than ever.

Open-source models like Bagel and commercial offerings from OpenAI and others allow bots to process multiple forms of input simultaneously. This opens up innovative use cases:

Insurance: Clients can submit photos of damage and have a bot begin the claim process
Health & Wellness: Users can upload a fitness report and get personalized advice
Ecommerce Support: Bots can process screenshots of carts to resolve order issues faster

But what’s the ROI?

Faster resolution times, better customer satisfaction, and reduced burden on human support teams. In early pilots we’ve supported, multimodal chatbots have reduced average handling time by up to 40%.

Smart questions to ask your development partner:

Can this bot handle both structured and unstructured data (e.g. an image + a typed request)?
How will the system deal with ambiguity in visual inputs?
Can we fine-tune a lightweight model on our own data to reduce hosting costs?
What privacy controls and data flows are in place for user-uploaded content?

Like any tech investment, success starts with good questions tied to real business goals. Multimodal doesn’t mean replacing humans-it means augmenting them with richer context so they can serve your customers more effectively.

We're doing a lot of exciting work in this space. If you’re exploring smarter chatbot experiences-or want to discuss how this could work for your customers-I’d love to chat.

Blog

Why Your Next Chatbot Should Understand More Than Just Text