The race for human-AI interaction usage data is on—and the stakes are high

April 25, 2024 ndowd

With the announcement of its new Llama 3 model, Meta is getting ready to put artificial intelligence across WhatsApp, Instagram, and Facebook. It’s a move that could eventually put the model in the hands of more than 3 billion daily users. This unprecedented flow of human-AI interaction data could help Meta pull ahead in the AI race.

For the past year, OpenAI’s GPT-4 has been the dominant large language model. The LLM’s popularity dwarfs that of rival offerings from Google, Meta, and Mistral, but the gap in model quality is getting smaller and smaller. After Cohere’s Command R+, Llama 3 is the second openly available model this month to surpass the original GPT-4 in the popular LMSYS Chatbot Arena. In the past couple of days, we saw Microsoft launch its Phi-3 models, and even Snowflake entered the race with an LLM of its own. The language model space is getting increasingly crowded, and its players struggle to set themselves apart.

The intense competition makes for a fast-moving space. An incredible technical breakthrough one day can become table stakes in a matter of months, as competitors quickly adapt their approach. Innovation in architectures and algorithms has not led to a defensible competitive advantage. Neither has budget or scale, even as big tech and VCs pour in billions of dollars in funding. The one way a lasting advantage might be found is via more and better data.

A positive feedback loop for AI

The research accompanying recent LLM releases has shown that even models trained on billions of words of training text remain undertrained. Increasing the data volume and quality is a proven way to drive performance. With the leading players already consuming most suitable public training data, the next place to look will be LLM usage itself. The data that stems from how we query and guide language models through conversations is more than just a volume augmentation—it can help reinforce what models already do well, and target and improve where they struggle. OpenAI and Google, which both offer their models directly to end users through a chat interface, have already started building that dataset.

When usage data improves quality, and quality drives more usage, the positive feedback loop can turn a space into a near-monopoly. We’ve seen this story before with search engines. While Google first pulled ahead thanks to an innovative ranking algorithm using public data, it has managed to maintain and grow that lead by learning from the massive volume of usage data originating from its users. This has left other players struggling to capture even a small fraction of the market. Even Microsoft CEO Satya Nadella testified last summer that Bing is worse than Google—and that it’s all tied to data.

We still need to see a company capitalize on the AI usage feedback loop effectively. While improvements to GPT-4 since its initial release likely stem in part from usage data, the changes feel iterative rather than transformational. With over 100 million monthly active users, however, OpenAI has built an unprecedented conversational dataset. With the highly anticipated GPT-5 rumored to be released this summer, we may finally see that data come into play. From there on, we could watch OpenAI widen the gap and continue asserting its dominance.

Even if quality gains from usage data were modest, your past interactions would still enable a model to tailor its ways to you better than anyone else. According to Sam Altman, that’s where the real long-term differentiation for AI models lies: knowing you best.

Mainstream AI will be delivered via messaging apps and social networks

The data feedback loop and OpenAI’s upcoming release make Meta’s timing particularly interesting. Much more than just claiming the time and attention of billions of people, Meta owns the default place most people go to chat. Bringing their Llama-based AI capabilities straight into that inbox could potentially grow adoption faster than the eye-watering rate of ChatGPT.

In the remainder of the year, we will likely see AI providers that can capitalize on user data pull ahead of everyone else. Players who own the virtual spaces where people interact with their models will be in much better shape than those who just provide the models but can’t collect the data. That’s great news for OpenAI, Microsoft, Google, and Meta, but could leave other non-Big Tech contenders like Mistral and Cohere out in the cold.

Next year’s AI space will likely have a very different competitive dynamic than what we see today. While overwhelming funding and competitive open-source models make the most powerful AI capabilities surprisingly cheap today, that price point is unlikely to be sustained once only the big players can deliver the strongest AI experience. Meta’s current habit of open-sourcing its models makes it the friendliest option today, but it may reconsider when other open-source players fall behind. Today’s top-notch AI is probably the most open and accessible we’ll ever see—we’ll do well to enjoy it while it lasts.

Jeroen Van Hautte is the cofounder & CPO/CTO at TechWolf, a VC-backed AI/HR-tech startup based in Ghent, Belgium. He has authored several research papers focusing on the application of artificial intelligence to HR data.

More must-read commentary published by Fortune:

The AI frenzy could fall flat as companies hoard chips without enough data centers to host them
I’m a venture capitalist, and here’s why I believe we need to guarantee everyone’s basic needs: The social floor is actually a trampoline that can propel our economy
Gen Zers and millennials are switching jobs at an accelerating pace, and it’s paying off. Here’s where it can still go wrong
Second Chance Month is a failed promise for former inmates. Here’s how we can fix that

The opinions expressed in Fortune.com commentary pieces are solely the views of their authors and do not necessarily reflect the opinions and beliefs of Fortune.

source