Recently I have been using my Siri like voice command capabilities on my phone. I have an Android based device and I have no idea what it is called but it does the same thing. If I ask, “Where is the nearest restaurant,” it fires up Google Maps and shows me without me having to type that into Google Maps. I know other apps can be context aware but I am curious about how an aural interface would work if you didn’t need to pull up a screen reader, it would just be always on.
When you have a full desktop browser and a mouse this still wouldn’t serve much purpose but if you are on a device without a mouse and you could just say “follow, link title” the browser would find the link titled as such and go there. Another example would be when you are on the page reading text you could say, “scroll down one page,” and the screen would scroll automatically for you. Dragon Naturally Speaking apparently can accommodate much of this.
CSS 3 already has a speech spec that is recommended. That spec is really geared towards implementing an enhanced experience for screen readers but I suspect that if you improve the markup for screen readers it would open up a doorway for use without screen readers as well. This is the classic progressive enhancement win that has been exemplified with each iteration of HTML, CSS, and JS.
Pairing this with future enhancements in eye tracking technology one day we might be able to look at the page and then dictate what we want to do. I think that would be very liberating as a user that is on a computer for 8+ hours a day. Beyond the CSS 3 spec, pages are now often built with ARIA regions that make it very straight-forward to make very robust identification/function for each HTML element.
If anyone is aware of a JS library, technology or stack that accommodates any of the above please let me know. It would be lots of fun to experiment further with this idea.