VoIP News OnSIP News Developer News

Music On Hold and the SIP Offer/Answer Model - What Were We Thinking?

by Erick Johnson

An OnSIP engineer shares his thoughts on the design and implementation of OnSIP's MoH service.

Published: May 16, 2011

We received a comment on a recent blog post that asked us why we have implemented our music on hold (MOH) service in the manner we did. While reading the comment I realized that we released the service and alerted customers it existed, but nobody in the engineering department stepped up to explain any of the technical details behind the mechanisms we decided to support - by nobody I mean me... oops. The comment I'm referring to is quoted here:

While you are right about that INVITEs without SDP conform with RFC3261 why would you use such [an] obscure mechanism for music on hold? ...

The commenter reached the conclusion that we require an INVITE without SDP for MOH based off one of our own misquotes from the same blog post:

...INVITES without SDP are not supported, which is required for music on hold...

So as you can see, everybody is confused so I will take that to be my fault for not clearly explaining the way the OnSIP MOH service works when we first released it. I'll also try to share some of the thoughts that went into our design of the service as well.

The Basics

My last evaluation of the MOH landscape was early in 2010 so please forgive me if much of what I say has changed. First let me start by explaining in the most basic sense how a third party MOH service works. Imagine 2 callers, Bob and Alice. When Bob and Alice are on the phone with each other and Bob places Alice on hold, he wants Alice to hear hold music. If his phone isn't the source of the music then his phone must go ask a third party source to send the music to Alice for him. A simple protocol agnostic sequence diagram may look like this:

           Alice                Bob                   Music Service
             |                   |                          |
             | bob calls alice   |                          |
             |<------------------|                          |
             |                   |                          |
             | alice answers     |                          |
             |------------------>|                          |
             |                   |                          |
             | talk,talk,talk    |                          |
             |<=================>|                          |
            ...                 ...                        ...
             |                   |                          |
             | bob holds alice   |                          |
             |<------------------|                          |
             |                   | bob asks to send music   |
             |                   |   to alice               |
             |                   |------------------------->|
             |                   |                          |
             |                 music                        |
             |<=============================================|
             |                   |                          |
             | bob unholds alice |                          |
             |<------------------|                          |
             |                   | bob tells music service  |
             |                   |   to stop playing music  |
             |                   |------------------------->|
             |                   |                          |
            ... talk,talk,talk  ...                        ...
             |<=================>|                          |
             |                   |                          |

Nothing ground breaking here, Bob and Alice are on the phone. Bob puts Alice on hold and asks something somewhere to somehow get music to Alice. Before we go any further, let's pause for a moment and lay out some unofficial requirements for what it means to be on hold:

  1. There is a holder and a holdee. The holder places the holdee on hold
  2. Once on hold, the holder MUST:
    1. be able to see the holdee is still on hold
    2. unhold the holdee
    3. end the call while on hold
  3. Once on hold, the holdee MUST:
    1. see that she is on hold with the holder
    2. be able to place the holder on hold, at which point both parties would be simultaneously a holder and a holdee
    3. be able to end the call

The State of Music on Hold and SIP

Ok, so now we have a simple diagram and some ground rules for what it means to be on hold. In order to move forward with an implementation for a music on hold service in SIP we need to look at the available recommendations, and SIP certainly has no shortage of those. The standard I'll be referring to for MOH is RFC 5359 - SIP Service Examples; specifically section 2.3, "Music on Hold". As a hosted provider whose goal is to support a heterogenous user-agent environment, we must also consider what the phone manufacturers actually implement. First let's look at the RFC 5359 signaling recommendation for music on hold:

           Alice             Bob       Music Server
             |                |              |
             |    INVITE      |              |
             |--------------->|              |
             | 180 Ringing    |              |
             |<---------------|              |
             |    200 OK      |              |
             |<---------------|              |
             |     ACK        |              |
             |--------------->|              |
             |       RTP      |              |
             |<==============>|              |
             |                |              |
             |   Bob places Alice on hold    |
             |                |              |
<span class="highlight">             | INVITE (hold)                 |</span>
             |&lt;---------------|              |
             |    200 OK      |              |
             |---------------&gt;|              |
             |     ACK        |              |
             |&lt;---------------|              |
             |    no RTP      |              |
             |                |              |
             |  Bob initiates music on hold  |
             |                |              |
<span class="highlight">             |                |   REFER Refer-To: A</span>
             |                |-------------&gt;|
             |                |    202       |
             |                |&lt;-------------|
             |                |   NOTIFY     |
             |                |&lt;-------------|
             |                |    200       |
             |                |-------------&gt;|
<span class="highlight">             |  INVITE     Replaces: B       |</span>
             |&lt;------------------------------|
             |          200 OK               |
             |------------------------------&gt;|
             |           ACK                 |
             |&lt;------------------------------|
             |           RTP Music           |
             |&lt;==============================|
             |     BYE        |              |
             |---------------&gt;|              |
<span class="highlight">             |                |  NOTIFY      |</span>
             |  200 OK        |&lt;-------------|
             |&lt;---------------|  200 OK      |
             |                |-------------&gt;|
             |                |              |
             | The music on hold is complete |
             |                |              |
             |    Bob takes Alice off hold   |
             |                |              |
<span class="highlight">             |  INVITE Replaces: M           |</span>
             |&lt;---------------|              |
             |    200 OK      |              |
             |---------------&gt;|              |
             |     ACK        |              |
             |&lt;---------------|              |
             |       RTP      |              |
             |&lt;==============&gt;|              |
             |            BYE                |
             |------------------------------&gt;|
             |          200 OK               |
             |&lt;------------------------------|

Let's analyze this briefly. The RFC is asking phone manufacturers to kick off the hold process normally. However this is where "normal" hold ends. The holder then is to send an out of dialog REFER with replaces to the MOH server. The MOH server is to take that REFER with replaces and INVITE itself to the holdee and replace the holder's call on the holdee. By conformance to RFC 3515, the MOH server will then send a NOTIFY to the holder via the implicit subscription of the REFER. Additionally the MOH server must add the identifying parts of the new call with Alice to the NOTIFY body, the call ID, from tag, and to tag of the new call in the body. By standard INVITE with replaces conformance, the holdee will then send a BYE to the holder and end the original dialog there. When the holder wants to retrieve the call, he is supposed to send the holdee a second INVITE with replaces, filled in with the parameters of the second NOTIFY from the MOH server. <crazy-rant> IMHO... Yuck!... What is the holder's phone supposed to look like while there is no call on it? But he effectively still has Alice on hold? Why can't we leverage the old hold mechanism of redirected SDP by putting the holdee media session in sendonly? What if Alice hangs up the phone - how is Bob supposed to know that? What if Alice wants to put Bob on hold while she is on hold? She can't!!! She doesn't have a call with Bob anymore. What if Alice's phone doesn't support INVITE with replaces? Why does Alice's phone need to support INVITE with replaces in order for her to receive hold music? Is Bob's phone supposed to analyze the Supported headers (if any) from earlier in the dialog or the result of the NOTIFY for switching between "music on hold mode" and "plain old hold mode"? That's just error prone. What does Alice's call display say - is it the caller ID of the MOH server? Further "yucks" - the holder phone must now support receiving a NOTIFY with a message/sipfrag body and store values from the sipfrag for later use... why? Also, the MOH server must support the REFER method, an out of dialog REFER with replaces to be specific. It must also support sending INVITE with replaces in order to complete the attended transfer. </crazy-rant> It's my belief the RFC authors wrote a severely crippled hold specification here. The recommendation (without seriously more guidance) breaks nearly every requirement of hold that I laid out above. This spec is much more like a call parking service than hold. If that's what it were called - then I'd have no problem with it. I think the phone manufacturers believe this too... because I have yet to find a phone that supports this flow. If you know of one - let me know in the comments please. So then what do the phone manufacturers implement? Well, Polycom and Cisco (though Cisco has a few bugs with their implementation) do something much more simple - and require no extra support from the holdee's phone. If it supports standard hold - it can support music on hold. Here's the call flow:

           Alice             Bob       Music Server
             |                |              |
             |    INVITE      |              |
             |---------------&gt;|              |
             |    200 OK      |              |
             |&lt;---------------|              |
             |     ACK        |              |
             |---------------&gt;|              |
             |       RTP      |              |
             |&lt;==============&gt;|              |
             |                |              |
             |   Bob places Alice on hold    |
             |                |              |
<span class="highlight">             | INVITE (no SDP)|              |</span>
<span class="highlight">             |&lt;---------------|              |</span>
<span class="highlight">             |    200 OK      |              |</span>
<span class="highlight">             |   with SDP     |              |</span>
<span class="highlight">             |---------------&gt;|              |</span>
<span class="highlight">             |                | INVITE w/    |</span>
<span class="highlight">             |                | alice's SDP  |</span>
<span class="highlight">             |                |-------------&gt;|</span>
<span class="highlight">             |                |  200 OK      |</span>
<span class="highlight">             |                |&lt;-------------|</span>
<span class="highlight">             |     ACK        |              |</span>
<span class="highlight">             | with MOH SDP   |              |</span>
<span class="highlight">             |&lt;---------------|              |</span>
             |                |              |
             |               RTP             |
             |&lt;=============================&gt;|
             |                |              |
             | INVITE (unhold)|              |
             | with bob's SDP |              |
             |&lt;---------------|    BYE       |
             |   200 OK       |-------------&gt;|
             |---------------&gt;|  200 OK      |
             |     ACK        |&lt;-------------|
             |&lt;---------------|    ACK       |
             |     RTP        |-------------&gt;|
             |&lt;==============&gt;|              |
            ...              ...            ...
             |                |              |

To be fair Grandstream and Snom may support this as well, but like I mentioned above, I did this investigation over a year ago and just can't remember the relevant details. I know Counterpath does NOT support this and Aastra did not at the time of the investigation. In any case, this call flow is much simpler from everyone's vantage point. Bob and Alice keep their call up, and Alice and the MOH server never know they are communicating with each other. Bob's phone acts like a black box between the two, managing the media session via signaling. The hold mechanism marks the SDP answer sent to Alice with sendonly like a standard hold request would (although normally in the offer), except instead of supplying the old "0.0.0.0" media IP address, it is providing the IP/port of the MOH server. The implementation of the MOH server becomes trivial - accept session setups via INVITE. Period; Done. To be fair - there is a detraction from this method; the delayed ACK generally results in at least one retransmission of the 200 OK response from Alice to Bob. Also, I'm not sure that delaying the ACK like this doesn't violate some portion of RFC 3261's specification for the handling 200 OKs to an INVITE, but if I had to guess then I would say it's OK.

The Offer/Answer Model

So why does this work? SIP utlizes what it calls the SDP offer/answer model and clearly defines the rules surrounding creation of and responses to the INVITE in the context of this offer/answer. There relevant sections of the core RFC are quoted here:

13.2.1 Creating the Initial INVITE ... In this specification, offers and answers can only appear in INVITE requests and responses, and ACK. The usage of offers and answers is further restricted. For the initial INVITE transaction, the rules are: o The initial offer MUST be in either an INVITE or, if not there, in the first reliable non-failure message from the UAS back to the UAC. In this specification, that is the final 2xx response. o If the initial offer is in an INVITE, the answer MUST be in a reliable non-failure message from UAS back to UAC which is correlated to that INVITE. For this specification, that is only the final 2xx response to that INVITE. That same exact answer MAY also be placed in any provisional responses sent prior to the answer. The UAC MUST treat the first session description it receives as the answer, and MUST ignore any session descriptions in subsequent responses to the initial INVITE. o If the initial offer is in the first reliable non-failure message from the UAS back to UAC, the answer MUST be in the acknowledgement for that message (in this specification, ACK for a 2xx response).

And then later

13.3.1 Processing of the INVITE ... If the INVITE does not contain a session description, the UAS is being asked to participate in a session, and the UAC has asked that the UAS provide the offer of the session. It MUST provide the offer in its first non-failure reliable message back to the UAC. In this specification, that is a 2xx response to the INVITE.

And then even later

14.1 UAC Behavior The same offer-answer model that applies to session descriptions in INVITEs (Section 13.2.1) applies to re-INVITEs. ... Of course, a UAC MAY send a re-INVITE with no session description, in which case the first reliable non-failure response to the re-INVITE will contain the offer (in this specification, that is a 2xx response).

Wrap it up

OK so let me try to wind down here. The original commenter asked:

While you are right about that INVITEs without SDP conform with RFC3261 why would you use such [an] obscure mechanism for music on hold? ...

The problem I had with the comment was he purported appeared to understand the intricacies of the signaling involved, however, he failed to offer any solution to the problem or explain why he felt the delayed offer was obscure. It's my position that the scheme we chose to support is the simplest and most widely supported. It requires nothing more than being able to create and respond to an INVITE correctly, as described in RFC 3261 - something every SIP stack should be able to do. The RFC 5359 solution requires supporting RFC 3261 (obviously), RFC 3515 (SIP Refer), RFC 3265 (SIP Eventing), RFC 3420 (message/sipfrag), and RFC 3891 (SIP Replaces header). Aside from the extra requirements, the RFC 5359 solution creates a far less standard hold experience, for both the holder and holdee. It's for all these reasons we decided to support the former approach instead of the later. When it was said MOH requires supporting INVITE w/o SDP that is not quite accurate. What is more accurate is to say - to be put on hold by any of the major handset manufacturers devices, and expect to hear hold music, your device must accept and properly respond to an INVITE without SDP, the delayed offer/answer model. So as you can see, the OnSIP MOH service does not really levy the signaling requirements here, instead it's the phones. The only "OnSIP" requirement is that you have enabled MOH in your domain, and you call the MOH server from an AOR that can authenticate in the same domain as the AOR of the MOH service. For example, if your OnSIP account is "cupcakes.onsip.com" you can call the "moh@cupcakes.onsip.com" address from any user address in the "cupcakes.onsip.com" domain. The reason we restrict it by domain is because we want to make sure someone isn't stealing your MOH service and sticking you with the bill ;) I hope this clears up some of the misinformation. Please comment if there is anything else you'd like to know about how our MOH service works. Also, if you know of any other music on hold implementations then please let us know and we'll evaluate it and try to add it to our todo list.