A little while ago I read through the engineering publications from Bungie. Some of the AI papers discuss one of the worst problems in game AI: recognizing the player’s intention. The next time I played on Xbox Live, I picked up my controller, put on my headset, and had an idea: why not use voice to help AI figure out what you’re thinking?

For example, if you say “fall back,” the AI could recognize that and do it. If you have the thermal vision of the group and you see enemies around a corner, you might say “enemies around that corner.” Even though you didn’t tell the AI which corner you meant, they could either guess, or – since they’re controlled by the game system – use information in the system to figure out which corner you mean. It’s cheating, but it improves the user experience, so it works.

So why isn’t this in games already? Aside from localization issues there really aren’t any barriers. Speech recognition works pretty well – it’s a bit slow and not always accurate – but that’s good enough to provide help to an AI in a game (not to entirely guide it – I’m talking about a layer on top of the normal AI you’d find in a game). The only other issue would be that not everyone has a headset. But if you only use voice to assist AI, you can still provide a quality experience otherwise. We should at least be seeing this sort of thing in top-tier games.

In fact, to demonstrate how easy it is to get some very rudimentary speech-driven AI working, I decided to code a little test tonight. I set aside some time for it and went to work. 15 minutes later I was done. Turns out .NET 3.0 has speech recognition built into the System.Speech.Recognition namespace. The result is this:

Download it here
(requires .NET 3.0 and XNA 2.0)

You say some variation of top, bottom, left, right or center to move the ball to that position on the screen. It’d be trivial to translate that into a 3D space based on a first-person perspective (though “front” might be a good word). It could also be more context-sensitive – e.g. maybe “on the left” after “the door” would refer to the left side of the hall instead of the player’s current “left”.

So what are your thoughts? Why aren’t we seeing this in games already? Are there games that do something like this? I know there are games that use sound recognition, but why wouldn’t the more popular games use voice recognition? I looked for patents, and there are a couple, but they seem very specific, and not really applicable to the first-person genre. Other than taking system resources (which with more powerful systems should be less of an issue), what reasons are there for this not being in games already?