Martin Geddes

WebRTC, Hypervoice & Valuable Metadata

The Web is still an infant technology. Adding voice may have some unexpected consequences.

The Web’s conception was based in an idea about data structures and their navigation; the “hyper” of “hypertext” was about following links between documents. However, its evolution is defined by the enabling human-computer interface elements. Without the mouse and pointer metaphor, hypertext was unusable by general users. How could you quickly select one link from many on offer?

As we add richer forms of gesture and interaction – (multi)touch, device orientation, gesticulation, facial expression recognition, eye movement, and so on – the Web itself will mutate in response. Richer interaction capabilities allow us to interact with richer data structures – think Apple’s coverflow in iTunes, or the pinch gesture in Google Maps. Richer data structures will in turn drive the need for richer modes of interaction. Round and round we go.

The current evolution of the Web’s user interface is centred on voice, which is of particular interest to the Hypervoice community. The hot technology is WebRTC. This allows browsers to access the microphone and camera inputs of a device, together with standard elements to control the transfer of the resulting media streams. Thankfully, these capabilities are modular, and not hard-wired to any particular signalling system or application paradigm.

The present focus of WebRTC is on purely enabling two-way audio and video using web pages as the front end. The ubiquity of browser end points, with their ability to be remotely updated to newer releases, makes this a capability that is likely to be adopted at record speed. So what might the overlap with Hypervoice be?

Imagine a call centre agent servicing a customer. They are clicking on different records, interacting with trouble tickets, entering new data. All of these user interface elements can be given metadata, encoded using what are known as microformats. In other words, every gesture of the user in the Web page can create an activity stream of meaningful and structured interaction data. This data has value: it can act as an annotation to any recorded audio and its timeline, the different gestures and data elements can become associated together in search indexes, and it can be data mined for emerging trends and to enhance productivity.

Astute readers will now spot the Hypervoice angle: when voice becomes a native of the Web, the browser itself takes on the role of creating the “history” of your interactions. In the early Web, your history was just a list of Web pages you opened. Going forward, it is a stream of valuable metadata that will include when you were talking, and time-based links to the business objects you were interacting with. It is the browser, not the back-end applications, that is best placed to remember what you touched and tapped.

It is therefore possible to see WebRTC evolve not just to capture and transmit voice and video, but also to include an event stream of interactions plus metadata that may enable whole new classes of application to emerge. The future of voice as a hypermedium has barely begun to unfold.