23
JAN 2013

Posted by Will at 04:00 PM EST

17165 reads

Share this

WebRTC, SIP, and HTML5: A Brief Introduction

WebRTC. It has certainly generated a lot of interest in the web community. Last month, you may have even caught us saying we believe the browser to be the ultimate destination of SIP communications. And with another Java security flaw being discovered (and patched) this month, the idea of a purely browser-based option is very appealing. So what is this great new technology? It's actually a couple of different HTML5 specifications, each with its own role. Let's take a look.

Note: For the sake of brevity, I have left off the use browser-specific prefixes. Be sure to check resources such as Can I Use... when implementing your web app.

The Pieces

WebRTC requires the use of two main component JavaScript APIs: MediaStream (more commonly known by its JavaScript function getUserMedia) and RTCPeerConnection. The MediaStream API provides the ability to capture video and audio data from the user's device and turn them into into usable JavaScript objects. Creating an RTCPeerConnection, meanwhile, allows the browser to connect directly to other users' browsers, or peers. Peers are found through the exchange and negotiation of session information, the exact method of which is left up to the application.

Now that may seem nice and simple, but it isn't the whole story. While you can set up a simple video chat with just getUserMedia and RTCPeerConnection, a full-featured web application probably requires inclusion of a few more common HTML5 APIs. For example, when an application needs to send the media stream to a server rather than a peer, WebSockets, another HTML5 addition to JavaScript, can allow for this real time communication to take place. The destination need not be just a server, either. Existing IP phones don't support receiving RTCPeerConnection data, so the media stream must be proxied through a gateway server, possibly using these WebSockets.

If you're thinking to yourself, "Why do I have to use two different APIs to send real time information, depending on where I'm sending it?" you're not alone. WebRTC (along with every other API discussed in this post) is still only a draft specification. In August 2012, Microsoft proposed a competing spec, Customizable, Ubiquitous Real Time Communication over the Web (CU-RTC-WEB), that cites this incompatibility with existing devices and a few other flaws as reason to discard WebRTC. Instead, they propose a lower level, more customizable API. While it seems to be a reasonable argument, CU-RTC-WEB has mostly been seen as Microsoft’s gut reaction to competitor Google’s WebRTC and so has not yet been formally drafted by the W3C. Currently, the specification for WebRTC recognizes CU-RTC-WEB as a proposed alternative but advises that it is unclear how or if it will affect the final design of WebRTC. For now at least, WebRTC is the most popular solution.

MediaStream API and getUserMedia

The MediaStream API allows web applications to capture data from the user’s microphone and camera. A call to getUserMedia kicks off this process by prompting the user for permission. Callbacks are passed to getuserMedia for continuation after success or error. Here’s an example that displays camera input in a video element:


navigator.getUserMedia({audio: false, video: true}, function(stream) {
  var video = document.getElementById(‘myVideo’);
  video.src = window.URL.createObjectURL(stream);
  video.play();
}, function (error) {
  console.log(‘something bad happened’);
});

Handling audio is exactly the same. For a demo of similar code in action, try talking to yourself. getUserMedia is a very simple yet important building block of WebRTC. For more information about the MediaStream API, see HTML5 Rocks's article on getUserMedia.

Standardized Peer2Peer

After obtaining a media stream, all that remains is sending it to a peer (and receiving theirs back). RTCPeerConnection attempts to standardize this process. After an RTCPeerConnection is created and a local stream is added to it, an offer is created. The offer is simply a description of the possible codecs, encryption, etc. available for sessions. It uses none other than SDP, the session description protocol also used in SIP. After the description is created, it is sent to a potential peer via an unspecified method:


var callerPC = new RTCPeerConnection();
callerPC.addStream(localStream);
callerPC.createOffer(gotSDP);

function gotSDP(description) {
  callerPC.setLocalDescription(description);
  /* Now the description must be sent to the peer.
  This tells the peer how to connect.  It could be anything,
  but here we’ll name this method ‘invite’ */
  invite(description);
}

On the callee side, the description is received, again through an unspecified means:


var calleePC;

// This function would be called when receiving a remote connection
function processRemoteDescription(description) {
  // Callee creates PeerConnection
  calleePC = new RTCPeerConnection();
  calleePC.setRemoteDescription(description);
  calleePC.createAnswer(function (localDescription) {
    calleePC.setLocalDescription(localDescription);

    /* We need to send our own SDP back to the caller.
    This method is left up to the application to create.  Here,
    we name the function ‘okay’ */
    okay(localDescription);
  });
  calleePC.onaddstream = function (remoteStream) {
    // Show the caller’s stream in a video/audio element
  };
}

Lastly, the caller needs to receive the answer and add the new remote media stream:


function onOkay(description) {
  callerPC.setRemoteDescription(description);
}
callerPC.onaddstream = function (remoteStream) {
  // Show the callee’s stream in a video/audio element
};

For a demo of RTCPeerConnection in action, try connecting to yourself. RTCPeerConnection also includes some features like NAT traversal and built-in jitter buffers. To read more about RTCPeerConnection, check out HTML5 Rocks's Getting Started Guide.

Where does SIP fit into all this?

Throughout this post, there have been certain ‘unspecified mechanisms’ exchanging session description information. This lack of specification is not an oversight, but an intentional decision by the designers of WebRTC. The signaling, or actual method of exchanging these descriptions, is protocol agnostic. That means that applications can use Google’s Channel API, HTTP POSTs, email -- or SIP. I hinted at this by naming the functions in the examples above ‘invite’ and ‘okay;’ that’s exactly what they do. WebRTC’s offer/answer model fits very naturally onto the idea of a SIP signaling mechanism.

There are, however, some other technical issues that make SIP somewhat of a challenge to implement with WebRTC, such as connecting to SIP proxies via WebSocket and sending media streams between browsers and phones. These issues probably deserve a blog post of their own, but they are not insurmountable. Several JavaScript SIP stacks are being developed, such as sipML5 (‘The world’s first open source HTML5 SIP client’) and the older, also open source SIP-JS project. It surely won’t be long until a full-fledge SIP Client is available in the browser, thanks to WebRTC.

Edit 2014-04-28: As of today, we've released our own JavaScript SIP stack for WebRTC developers: SIP.js. Check it out, or visit our new OnSIP Network platform as a service offering.