Multimodality Starts Walking and Talking


By Ellen Muraskin


In which we witness mobile-phone and Pocket PC demos, sort out some of the markup confusion, note trial deployments, and highlight the platform maker furthest along in supporting applications that work visually and aurally.


We can’t put it off any longer. The time has come to talk about multimodality. The volume of buzz at speech, telecom, and wireless trade shows has gotten too loud to ignore. Microsoft and IBM have stomped in with all four feet, the former buying network prime time to show the mass market scenarios of pocket PCs and PDAs you can talk to, talk through, browse on, tap on, photograph, bar-code scan and sing along with – all retrieving and affecting data on the other side of  a wireless network. In a string of rollouts of games and messaging apps, we see the wireless operators hanging their ARPU hopes on data.


While these picture-sending, wireless web apps aren’t what we mean by “multimodality,” they help prepare the public for it. Anything that puts across the notion of pda/handset as digital Swiss Army Knife is a step in that direction. Equally important, they publicize the spread of a 2.5G wireless data stream that does support multimodal applications.


“Multimodality,” of course, adds voice to data, and data to voice applications in the wireless (or the wired) network. The term describes an automated interaction that combines visual and auditory modes, perhaps visually prompting for a user’s utterance, and displaying the recognized word, or a requested map or timetable, on a cell phone’s mini browser or PDA or pocketPC browser.


The faster our wireless data stream gets, the more comfortably we can make this multimodal stimulus-response work. Multimodality’s Golden Age might arrive at 3G, but in the meantime, speak-in, see-out applications have plenty of punch on 2.5 GPRS/GSM networks, where they are being demoed to good effect in the US over VoiceStream and other carriers, and tested on actual everyday users by wireless carriers in France to date.


On the enterprise (and carrier) side, multimodal apps look and sound great on pocket PCs or hybrid phone/PDAs operating over 802.11b wireless LANs. Still qualifying as multimodal – and even simultaneous -- are applications on 2G networks that combine SMS concurrently with voice; think here of directory assistance queries in Europe that send back an SMS to store itself in a handset’s speed dial; carriers have found users willing to pay extra for this. I would even include the app from Pocket This (cross ref innovations Nov.) in this category. That’s the one that lets you dial a published number and get back links and details in SMS for later use.


The best way I have found to explore the current state of multimodality is to focus on one company – Kirusa (908- 464-4467 Berkeley Heights, NJ – that has put all its eggs in this basket. From what I’ve seen, they have the longest history in this space, going back to 2000, after VoiceXML’s and WML’s appearance and perhaps at the same time we first started hearing about SALT, then promoted by Microsoft as a multimodal markup.  Kirusa is also the furthest along in demoed apps and real-world trials, with Orange and Bouygues (“Bweeg”) Telecom in France. We can also use Kirusa, as they themselves do, as a context in which to briefly clear up confusions about multimodality and markup languages.


 Kirusa Platform

Kirusa sells a software server platform, the KMMP (see figure 1.[1]) that reads markup language from a web application server such as Tomcat or Iplanet’s on the front end, and manages the interaction between the voice and visual sides of the application. It couples into a voice platform on the back end; demos of such teaming have ranged from Voxeo’s (Orlando, FL -- 407-835-9300, ASP VoiceXML interpretation to SandCherry’s (Boulder, CO -- 1 720-562-4500, SoftServer media server-cum-application broker and more stripped-down voice platforms. They’ve also integrated with Telera, VoiceGenie, SpeechWorks, Nuance, Telisma, and claim that the voice integration task is no real barrier.


Sequential Multimodality

They began with their own M3L markup, combining WML and VoiceXML “and two or three tags of our own creation that describe how and when to move from voice to visual or visual to voice,” according Mike Sajor, Kirusa VP, Business Development. With this you can create a sequential multimodal experience on a standard WAP device. Sequential, because WAP and voice cannot travel together over the same call.


Noah Breslow, Manager of Business Development and Investor Relations, demonstrates an English version of  France Telecom/Orange’s sequential multimodal trial application. See sidebar: “Sequential Multimodal Demo: WML + VoiceXML.” It’s under way in Paris with the KMMP and the VoiceXML interpreter of Telisma, France Telecom’s ASR spinoff.


(Sidebar) Maria, I want to put all demos in sidebars with a small graphic of the phone/pocket pc, text easily wrapped around it.

Sidebar 1: Sequential multimodal demo: WML + VoiceXML

In New Jersey, Breslow is using his Motorola P280 WAP phone and VoiceStream/T-Mobile’s  GPRS service. Noah’s WAP inbox shows the presence of three emails messages and a voice mail. By pushing buttons in a standard WAP interaction, he selects message number three, an email message with a voicemail attachment. Since that must be aurally presented, a Kirusa link for changing modes is displayed in WAP.

A click on the link triggers the KMMP to refresh the screen with a “fetching audio content” message and to make a voice call to the VoiceXML partner platform, being fed VoiceXML from the Kirusa platform. Noah hears the voice message, then says “next,” to navigate to the next one. When he says “show message,” the KMMP ends the voice call and goes back to WAP mode, because the next message is a graphical one (Kirusa’s logo). KMMP keeps track of what “next message” means, irrespective of modes. Another WAP click to reply to message three in voice launches another call to the voice server, which records the reply. We are then returned to WAP.


The VoiceXML part of this application is being interpreted, in this instance, by Voxeo’s platform and Nuance speech-recognition engine in Florida, receiving KMMP’s VoiceXML through HTTP transfer. KMMP is coordinating all voice activity with the screen refreshes in WAP, keeping track of the choices made in emails and navigation. It’s also working through included KMMP interfaces to Openwave or other WML servers.

End sidebar I


Phase II of the France Telecom/Orange trial, with apps and environment being specified at presstime, is a simultaneous, VoiceXML-plus HTML  multimodal app for the enterprise, centering around field force automation and CRM. A third, parallel phase is another simultaneous application to debut with a new consumer device; said device is being launched in time for the Christmas shopping season.

Simultaneous MultiModality

Glitzier demos are performed on Compaq Ipaq 3870 Pocket PCs over Wifi networks, simulating the bandwidth of 2.5G carrier networks or a true intra-campus enterprise app, say, in a warehouse, or a consumer  app that might work within the range of an Internet hot spot. At this point, all voice and data travel over the same data link. Packetization and compression are part of a Kirusa client in the hand-held. Also at this point, the visual part of the markup is plain HTML. (See Sidebar, “Simultaneous MultiModality: VoiceXML + HTML/’Pepper.’”


The hand-held is also performing the initial speech sampling and feature extraction, using a push-to-talk button.  This distributed method of speech recognition (DSR) protects the utterance from packet loss and gets the bandwidth requirement down as low as 4.8 KB, according to Sandeep Sibal, Kirusa’s CTO. “We’re using DSR with great results,” adds Sajor. This fits in well with a one-up, three-down GPRS, which is the most stringent bandwidth limitation we have to meet [for simultaneous multimodality].”


“GPRS in the US  is fairly wide,” adds Sajor, citing Cingular, T-Mobile/Voicestream, and AT&T wireless. “or CDMA 1xRTT  by Sprint PCS and Verizon is equally suitable.”


Sidebar 2: Simultaneous MultiModality: VoiceXML + HTML/”Pepper.”

The 2.5 simultaneous application demonstrates a much more fluid exchange than WAP and VoiceXML can manage, flipping from voice to visual, from uttered field to its appearance in a box, from played prompt to stylus-entered field, or from uttered request to answering display, with the same broadband alacrity we’ve come to expect from purely visual applications. The one Breslow shows me is a banking app, displayed at Intel’s Orlando developer conference. It uses a natural language interface, accepts fund transfers between accounts, queries, and verifies all actions visually, complete with Disney icons.


What this application hears (transfer $5000 from savings to checking) is what you see filled in on the web site boxes. The theory here? That printable, visual confirmation of recognized amounts and accounts provides just the reassurance a customer needs to trust mobile transactions.

End sidebar 2


Sajor says that after the simultaneous multimodal banking demo, a German mobile operator handed him his hybrid GPRS phone/ PocketPC and challenged him to replicate it on his device. In the course of 15 minutes, Sajor says, he loaded his phone with the Kirusa client from their NJ server, established a multimodal session over that operator’s network, and ran it between Germany and NJ. The delay from speaking the amount to seeing the amount filled in on the handset’s browser field was about 1.5 seconds. Did they sign up on the spot? No, but they’re getting close to trials. Trials are also currently in progress with Kirusa’s platform and HTML + VoiceXML simultaneous multimodality with Bouygues Telecom, number three wireless provider in France, as well as Orange. 


SALT, Pepper, X+V

Kirusa’s markup for the simultaneous multimodal banking applications is its own variant of VoiceXML-plus HTML, wryly named “Pepper.” But they would hurry to add – and have demoe’d as well, see below -- that their KMMP can administer multimodal applications through SALT markup as well.


SALT, for Speech Application Language Tags, was initially promoted by Microsoft and then by the Salt Forum, whose founding members were the Redmond folks plus Comverse, Cisco, Intel, Philips, and SpeechWorks. SALT had the multimodal stage to itself for awhile. This was simply because VoiceXML made (and still makes) no claims to anything but voice, SALT had marketing muscle behind it, and because companies such as Kirusa were still in stealth mode. Leaving aside for the moment the fact that SALT V. 1.0 itself was only received for comment by the W3C last August, it has since had to share developer mindshare with IBM’s, Motorola’s, and Opera’s joint announcement of X + V, or XHTML + VoiceXML. So at this point, we’re up to Pepper, SALT, and X+V.


Tools Race

SALT is somewhat ahead on the tools front; Microsoft’s .NET Speech SDK for SALT has been available in beta for some months, operating within the .NET Visual Studio IDE; Philips has announced a SALT browser for the voice mode and SpeechWorks is rewriting its Dialog Modules for SALT, bundling its ASR engine into Microsoft’s .NET  SDK. SpeechWorks is also talking up its compact, embedded ASR should the SR task be offloaded to hand-helds.


Microsoft also is preparing a “.NET Speech Application Platform”[2] (See Figure 2), to run SALT applications and hopefully grow speech developers from its worldwide pool of Visual Studio .NET programming talent.  This platform will include an “enterprise grade” ASR engine from Microsoft, a W3C-compliant SALT browser-interpreter, telephony platform software interfaces from Intervoice and Intel, and hardware supplied by Intel for call processing and telephony network interface. As of presstime, a limited developer release is planned for late October, geared to promote enterprise applications in vertical markets. The platform also depends, on the receiver side, on new speech objects for speech-enabling Microsoft’s  IE web browsers.


A bit behind, IBM/Opera/Motorola’s X+V SDK is due for announcement at October’s SpeechTEK show. They say they have some pilots going in the U.S., but can’t specify carriers. As with their VoiceXML interpreter and speech server, IBM will promote an end-to-end, speech engine-to-web server range of offerings.


Neither Kirusa nor the speech technology vendors really care whether  Pepper or IBM’s X + V learns to find happiness alongside SALT or if one vanquishes the other. Their platform will run SALT generated on Microsoft’s SDK or, with minor modifications, on X+V from IBM. (Sajor says that Kirusa’s “Pepper” markup is but a few grains (tags) different from X+V, and could easily be changed to match.)  Kirusa positions itself at the end of all roads to multimodality. Religious wars – in the programming sense of the term – will only draw the media attention they need and can’t afford to buy in 2002.


Look just a little ways beyond Kirusa’s accommodating flexibility, however, and the nagging question arises: If SALT’s program control is determined by the host program, and if SALT is inherently multimodal, why do we need a separate “multimodality” platform?


Here’s a two-part answer.  Kirusa’s Sajor concedes that in a “perfect world,” where all end points can be outfitted with SALT plug-ins to their browsers, SALT could shake out the whole marketplace. But the world is not perfect; not all end points can fit IE browsers; these will need different mediating platforms to synchronize the visual and aural parts of applications, (see more on this below). Kirusa is also buttressing its offer with carrier-grade provisioning, billing, and customer care elements that leverage its management’s long backgrounds at Lucent, AT&T, Airtouch, and other places.


Secondly, Microsoft seems, at least for now, to be gunning for the enterprise and for its own Pocket PC Pocket-IE marketplace. Altho it pledges that its browser will be W3C SALT compliant, and that means conforming to other XMLs as well, like WML, Microsoft is demoing through SALTed HTML browsers for now.


Microsoft’s sites may widen, however. Listening for developments on Intel’s  newly announced Manitoba chip for cellular phones. In the words of Uri Barkai, R&D manager of Intel's Cellular Communications Division, responsible for developing the chip, quoted on “We have an innovative architecture called PCA (Personal Internet Client Architecture), which we’re marketing with heavyweights like Microsoft, Palm (Nasdaq:PALMD) and Symbian. The objective is to create a new standard for cellular devices and hand-held devices that will improve their combined capabilities.”


Markup Religious Wars

The SALT-vs.-X+V comparison is best laid out by James Larson, Intel’s Manager, Advanced Human I/O and can be viewed at: Larson also lays out excellent cases for when multimodality beats voice or data alone in a slide presentation he promises to post at


“XHTML” is an XML-ified version of HTML, but for now let’s lump Kirusa’s “Pepper,” i.e., Voice-plus HTML, together with X+V. In very broad strokes, Pepper looks like recognizable VoiceXML code followed in the same page by HTML code, with linking tags embedded in both. A lot of its visible “cleanliness” owes itself to the fact that a VoiceXML browser and its forms interpretation algorithm take care of most of the control and coordination activities, leaving the programmer to specify the prompts, grammars, and event handlers using a declarative style of programming.


A page of SALT is not SALT per se, but the governing, event-triggered UI language – HTML, XHTML, or XML, say -- with SALT tags imbedded in it. It may be the choice of those who are used to working in a more classic web model. It is a lower level and more flexible programming model, and keeps with the object-oriented nature of the document object model. SALT also has a sophisticated call control object that is used to create telephone call legs for both telephony and multimedia applications.


In Larson’s words, “different versions of SALT browsers execute on a server or on a client such as a hand-held computer or advanced cell phone. SALT Forum members are expected to announce the availability of SALT browsers, development tools, and other products after the SALT 1.0 specification is submitted to a standards organization and made public this summer.” Microsoft plans to SALT-enable its  Internet Explorer and Pocket Internet Explorer browsers, thereby making a large base of pocket pcs multimodal-capable.


SALT Simultaneous MM Demo

The endpoint, again, is a Compaq iPaq 3870 pocket PC, loaded with a Kirusa MultiModal browser and a distributed speech rec engine that extracts the speech features needed by the recognition engine. Users have tight control of ASR because they press the push-to-talk button to speak; a habit now growing on users by virtue of commonly sold cee phones with push-to-talk voice-activated dialing. The link is 802.11b wireless LAN -- but could just as well be GPRS/GSM or CDMA 1xRTT  to Kirusa’s multimodal platform.

Kirusa’s MM platform  is 1)fetching the HTML + SALT pages from a standard web application server, 2) translating those SALT requests and responses into Java so it fits on this particular end point  and 3) requesting speech resources, recorded audio, and grammars from a SandCherry Softserver, a media server/applications broker. Both of the SALT demonstrations I saw were developed jointly by Kirusa and SandCherry. See Sidebar:


Sidebar 3: Simultaneous Multimodality: SALT

The application is a World Cup Soccer fan’s, it opens on the iPAQ’s browser to an HTML page of links to teams, and the user can say or use the stylus to drop down a menu of country names. Say “Ireland,” and the browser shows the flag of the Republic of Ireland. Say “submit” and get back a menu of team roster, game highlights, or video. Back on the application home page, I say “Player: Ronaldo.” The application is not sure whether I want Ronaldo or Rivaldo; it sends me back both written names, letting me choose the correct one by stylus. I say “video” and a streaming video presentation appears on the screen with Ronaldo’s winning goal. Another Flash video is a World Cup quiz, which buzzes me if I tap in the wrong answer and applauds if I get it right.

Another demo –one the industry really pins its hopes on, because it can sell KMMPs and voice servers to enterprises – is hypothetically carried by a Coca Cola delivery truck driver. The driver sees a table of the stops of his route, and says, “sort by distance.”  The table sorts itself. “Sort by urgency.” Then say “Show map for stop number 2.” and see the map to that spot displayed on the screen. Say, “Zoom in.” Best, the driver says, “show stock for number one” and it retrieves a chart for stop number one with each Coke brand of inventory on hand, total required, and computed delivery amount. Says Sajor:  “The driver can then carry out inventory activities by voice, by saying things like “ Diet Coke, inventory 23 delivery 12” – and see the results right on screen. If it’s not right, just say ‘undo’ and try again.”

End Sidebar 3


As we’ve seen in both voice and wireless data applications, the opportunity here is to let the field forces of all kinds of companies enjoy the wireless data advantages of a Fedex or UPS, without OEMming their own tablets and networks. Any multi-modal app can be branded with logos on PDAs or even WAP phones. And why add voice to this? “Speed and convenience,” says SpeechWorks’ CEO Steve Chambers. “One handed operation,” says Sajor.


Note in the simultaneous MM demos that the Kirusa handset client is not a SALT browser. Instead, it understands the Java output by KMMP. “There is a crucial difference between what others do in SALT and us,” says Mike Sajor, Kirusa Vice President, Business Development. “Classic wisdom says the SALT browser is out in the hand held device. We don’t agree. We don’t see every device and every OS – IDEN, Symbian, Palm -- being able to handle SALT browsers. That’s a non-trivial thing, to be able to manage the voice-data alignment, to be able to send the right requests to the voice platform in the network. Some devices just aren’t that sophisticated. So in our architecture, SALT talks to our multimodal platform, and our platform decides what to do with each type of handset. In the case of the iPAQ, we’ve sent it Java elements. This gives us a leg up in being able to support emerging J2ME through Symbian devices.”


And from the mobile operator’s or the enterprise’s standpoint, making multimodality work is only 20 percent of the problem, says Sajor, a 21-year Lucent alumnus. Making it work reliably, scalably, with human factors factored in, operations support, registration, authorization, provisioning, maintenance, TMM compliance, CORBA, SNMP- all that junk in the telecom domain – that’s 80% of the problem.


In the same vein, an important part of KMMP’s  platform is its billing interface and customer care. For a sequential multimodal app, these can combine multiple calls into one billable event. For a simultaneous one, it might charge by volume of data traffic, or by flat rate. Also: hooks for personalization data, using its own Oracle database or LDAP or HLR and VLR. KMMP  runs on Windows 2000 or Solaris, on hardware ranging from a Dell 1650 to a Sun Netra. Written mainly in Java, it can sit in a Java containerized architecture for maximum scalability.


Like its predecessors in the VoiceXML platform space, Kirusa is trying to nurture a multimodal community by making its platform available to qualified developers. At present, it only supports the M3L WAP-plus-VoiceXML markup. Developers need only their own web application server and endpoints to test. They are also issuing multimodal style guides and demo applications with KADP membership, and offer their own professional development and consulting services as well.

[1] Put in KMMP platform diagram here

[2] Put in Microsoft platform diagram here