Predicting the Next State of Generative AI
What will be the I/O layer for generative AI? Will it forever be natural language chat? Or is chat just the degenerate case of a much broader tapestry of artificial linguistics?
Large language models (LLMs) try to predict the “optimal” next fragment of a word—called a token—given some cumulative history of a conversation. These models “generate” strings of text by recursively predicting one word after another in sequence. Each new word output by the model gets added to the running conversation, thereby becoming part of the input to the next run of the model.
ChatGPT wowed the world with what next-token prediction can accomplish, given a sufficiently large and well trained language model. Tech companies are tripping over themselves to ride the coattails of these legitimately impressive innovations. This has led to quite a lot of specious—possibly GPT-generated—fuzzy thinking going on about just how applicable LLMs will be to other domains of human endeavor.
What fraction of the work done by humans can realistically be replaced by a chat bot? Quite a few industries will no doubt be disrupted. But the economy is big, and it produces much more than just words.
How might AI actually be usefully applied to more than just words and images? What’s the natural language, token-embedding equivalent for a global supply chain, or a nationwide cell network, or a network of inter-dependent financial institutions? I propose that there is a language, of sorts, that underlies each of these fields—but it’s not English—and LLMs trained on the collective writings of humanity can’t speak it. Thankfully that’s ok, because there are efficient ways for AI to autonomously learn these languages.
Enough beating around the bush. Here’s my claim: the ideal “tokens” for a learnable machine language consist of the state vectors of the machine’s parameter space. That’s a mouthful. Let me break down what I mean, and how it generalizes the techniques of LLMs to supply chains, cell networks, finance, and more.
First, let’s acknowledge that machines already run the world. So applying AI to non-natural-language realms inherently implies working with machine data. Hence the focus on the language of machines.
Second, it’s important to emphasize that machines are characterized by complex state. A coffee pot can be in the on state or the off state. A kitchen stove has several states: each burner can independently be turned on or off. A particular arrangement of the knobs that control a stove’s burners can be represented by a list of states, called a state vector.
Third, the set of all possible states a particular system can occupy is referred to as its parameter space. The parameter space of a kitchen stove is all possible combinations of dial positions.
Back to the claim: the ideal “tokens” for a learnable machine language consist of the state vectors of the machine’s parameter space. A robot chef could communicate with a kitchen stove by exchanging state vectors (lists of which burners should be in what positions). This is what makes it a language: it’s a symbolic representation used for communication.
Now that we have a systematic method for generating a language for any machine, we can begin to apply the techniques of large language models. But instead of next-token prediction, we’ll apply next-state prediction.
To spice things up, let’s extend the parameter space of our example to include not just the stove, but every gadget in the kitchen. Our state vectors now include entries for things like whether or not the dishwasher is running, or the fridge is open. The beauty of working with state vectors is that you can concatenate them. You can construct a big “language” by merging together many little languages. The inherent composability of next-state prediction is what sets it apart from directly applying LLMs to, say, textual log data. Log data does not add together in a way that LLMs can efficiently reason about in context. And context is what you really need.
If you’re skeptical about the universality of state vectors, consider that the universe itself appears to be but a big vector of quantum states, with a Hamiltonian “language” governing its evolution through time (observations notwithstanding).
Just as an LLM learns the patterns of natural language by training on sequences of tokens, so too can we train models to predict the next most likely state of a kitchen, given a sliding window of recent kitchen states. Such a model might learn, for example, that when the coffee pot turns on, the toaster usually turns on soon after. Note that the more context that gets includes in the state vector, the more predictable the system becomes. If you just look at the toaster in isolation, its behavior appears random. But when considered in context, its behavior can be highly predictable.
Now consider the state vector of all product inventory of all the stores, warehouses, and trucks on Earth! Overwhelming, right? But the beauty of state vectors cuts both ways: you can subdivide them just as routinely as you can combine them. The incomprehensible state vector of all the routers that make up the internet is just a combination of many simpler systems, each comprehensible—and translatable—in its own right.
But what about training? Much of the appeal of LLMs stems from the fact that the $100m training bill can be amortized across all users of language. Who’s going to train a model on the state vectors of railroads, and how much will it cost? This is where machine data has a massive advantage: computers are exceptionally chatty.
It’s straightforward to construct state vectors from any data source. And the necessary training data are just successive samples of these state vectors. The real challenge is stitching enough state together—at a high enough sample rate—so that the correlations between states becomes apparent. This is most optimally accomplished by working directly with streaming data, joining many streams together in real-time.
Stream-to-stream joins are not the forte of traditional stream processors, which lack the data locality to splice streams together efficiently. A further challenge arises from the fact that streaming data is often differential in nature, and needs to be integrated into a coherent state for it to be meaningful to a model. A retail point-of-sale terminal, for example, publishes messages about goods sold (a difference in inventory), which needs to be integrated to get the net inventory.
A streaming application platform—like SwimOS + Nstream—is ideally suited to join many high rate data streams together in real-time, to transform streaming data into normalized real-time state, to continuously aggregate those states into large, coherent vectors, to sample those states into sliding windows of recent history, to access that monumental amount of state at memory latency, and feed it to autonomously trained models that generate the most likely sequence of future states, and to continually update those predictions in real-time. In fact, we’re currently doing exactly this—in production at scale.
Much remains to be discovered about how to apply generative AI techniques to next-state prediction. There are some key distinctions from LLMs that make training state machine models much more feasible for domains that lack centuries old public datasets. In particular, I believe that societies of millions of smaller models offer significant cost and latency advantages over monolithic models. But that’s a topic for another day.