Multimodal AI in Research is Coming Just Not Yet

August 7, 2025

AI-driven voice-to-voice research has made significant progress. We now have tools capable of conducting live, human-like conversations using real-time speech recognition, natural language processing, and sentiment detection. This is already changing how some qualitative and quantitative research is conducted.

The next evolution many are anticipating is multimodal AI, which integrates voice and vision to create a more complete picture of respondent experience. I have been thinking about what this could mean for marketing research and where current limitations exist.

What Is Multimodal AI?

Multimodal AI refers to systems that process and interpret multiple types of input at the same time. In marketing research, this means capturing and interpreting what the respondent says, how they say it, and how they appear while saying it. These systems analyze audio, tone of voice, facial expressions, gaze, and posture using a combination of speech recognition, sentiment analysis, and computer vision. These signals are interpreted together to assess engagement, emotion, and behavioral cues.

The goal is to replicate what a skilled human moderator does naturally: observe, listen, and respond based on a combination of verbal and nonverbal cues.

Why It Matters

Multimodal AI could provide deeper emotional understanding, more accurate sentiment and intent detection, a more natural respondent experience, and richer nonverbal context in remote interviews. When used effectively, it may help researchers surface reactions and emotions that respondents do not or cannot articulate directly.

Voice-to-Voice Research Is Here. Multimodal Is Not.

Voice-to-voice research is already being used. Platforms like unSurvey.ai allow AI interviewers to engage respondents in dynamic, natural conversations using real-time voice generation and adaptive probing. These interviews can feel realistic and scale far beyond what human teams can support.

Multimodal research is still in development. While companies like Realeyes, Affectiva, and Entropik offer facial expression and attention analysis, these tools are typically used after interviews or in controlled environments. AI avatars from platforms like Synthesia and HeyGen are improving in visual realism and voice synchronization, but they remain scripted and are not capable of live interaction or emotional adaptation.

Today, there is no commercially viable platform that conducts real-time, fully multimodal interviews with a human-like AI moderator.

How Soon Could We Get There?

It is my understanding that several developments are underway that could bring multimodal AI closer to viable use in marketing research. We are beginning to see broader use of tools that analyze facial expressions and vocal tone after interviews, particularly in ad testing and user experience research. AI-generated avatars that look and sound human are also improving, but they remain scripted and asynchronous. They are not yet able to interact in real time or adapt meaningfully to complex emotional signals.

In adjacent fields like education and mental health, emotionally aware AI agents are being tested in controlled environments. These systems may respond to shifts in tone or facial expression, but they still operate within narrow parameters and lack the nuanced understanding that live human interaction requires.

At this point, I am not aware of any fully developed multimodal AI applications being used in commercial marketing research. Some companies may be experimenting in this space, but nothing appears ready for broad adoption. In the next few years, we may begin to see prototypes designed for specific use cases, such as concept testing or one-on-one interviews, but significant limitations will remain. These include emotional accuracy, cultural sensitivity, privacy protections, and interpretability of outputs.

Importantly, it may be years before multimodal AI is capable of conducting a focus group or even a mini-group. These formats require the ability to manage turn-taking, interpret group dynamics, respond to overlapping speech, and adjust based on subtle interpersonal signals. Human moderators handle these dynamics with fluidity and judgment. AI is not yet capable of managing these layers of interaction, even with access to voice and visual data.

In the near term, the most likely use of multimodal AI will be to support researchers rather than replace them. It may help identify emotional signals, flag potential moments of confusion, or enhance post-session analysis. But for now, live group moderation remains firmly in the hands of human professionals.

What Multimodal AI Misses

Even as the technology improves, m Insight depends on more than signals ultimodal AI remains limited in key areas. AI may recognize a frown or a pause, but it does not understand why the respondent is reacting that way. A human moderator can tell when someone is distressed, disengaged, guarded, or confused, and will adjust accordingly.

Humans bring context, empathy, and lived experience. We know when to pause, when to probe, and when to abandon a line of questioning. These are not capabilities that AI can replicate, even with visual input. Misreading facial or vocal signals can lead to incorrect interpretations. This is especially concerning when models are not well-calibrated across cultures, skin tones, or emotional baselines.

Multimodal AI may eventually reach a point where it adds value to live conversations, but it is unlikely to replace the nuance, social intelligence, and emotional intelligence of a trained moderator.

The Role of Human Oversight

As AI tools take on more of the research process, human oversight will become even more important. Someone will need to determine whether the AI is interpreting signals correctly, whether it missed something important, and whether bias or error crept in.
Roles that will become more important include:

  • AI Research Oversight Specialists who validate inputs and outputs, monitor for bias, and ensure ethical practices
  • Human Insight Interpreters who bring behavioral, cultural, and emotional nuance to the findings
  • Strategic Research Architects who design studies that use AI where appropriate and ensure research design aligns with strategic needs

These roles are not just temporary. They will become core to ensuring that AI is being used well and responsibly.

Human Interviewing as a Future Luxury

Looking ahead, it is likely that live human moderation will become a premium service. As AI becomes more capable and cost-effective, brands may reserve human moderators for emotionally complex topics, high-stakes audiences, or mission-critical initiatives.

It is also likely that organizations will hire human researchers to audit and validate AI-generated interviews and insights. In the same way that companies audit financial models or legal advice, they may audit research systems to ensure integrity and trustworthiness.

Hybrid approaches, where humans and AI collaborate throughout the research process, are also likely. For example, an AI moderator may flag emotional shifts in real time, while a human reviews the session afterward to interpret intent and guide decisions.

Watch This Space: Multimodal AI Will Evolve, But Not Overnight

Multimodal AI offers real promise. It could help researchers capture more of the emotional and behavioral layers of communication in virtual settings. It could make interviews feel more natural. And it could open up new possibilities for understanding customers more deeply and at scale.

But we are not there yet. The tools are not fully ready. The models are not yet emotionally aware. And the risks of misinterpretation remain high.

Even when the technology matures, there will still be a need for human researchers. Insight depends on more than signals. It depends on interpretation, critical thinking, judgment, and empathy. It depends on interpretation, judgment, and empathy. These are not things AI is built to replace.

As researchers, we should stay curious and engaged. We should guide how these tools are used and ensure they serve people, not just processes. The best research in the future will be the result of human and machine working together, each doing what they do best.


Kirsty Nunez is the President and Chief Research Strategist at Q2 Insights, a research and innovation consulting firm with international reach and offices in San Diego. Q2 Insights specializes in a wide range of research methodologies and predictive analytics. The firm uses AI tools to enhance the speed and quality of insights delivery while relying on the expertise and judgment of human researchers. AI is applied exclusively to respondent data and is never used to generate findings, which remain grounded in human analysis and interpretation.