Following my last post on Lync, Strings and Cans I need to report further detail on my test findings, wherein I identified that some Lync Online traffic would route peer-to-peer. This was an exciting finding for us, and remains so, although we’ve also uncovered some initially-unexpected nuances. To this end, my first post describes a model for understanding Lync traffic and details the default experience. In this post, I’ll talk about how in some cases, a two-person session will switch from peer-to-peer routing to a conference mediated by the Lync Online Edge servers. In these cases, Lync traffic routes in the same way as a multiple-participant conference, even though there are only two users involved. Put another way, in these cases all traffic will route via Office 365’s Lync Online Edge servers, even if all the internal ports are open for peer-to-peer communications.
NAT Traversal and Candidate Testing
To understand what’s going on, you first need to understand what to look for. Part of the reason for the delay producing this second post is that I’ve been trying to explain this by picking apart network monitor data. At first these captures were nothing more than an attempt to validate assumed behaviour, but it’s quite a bit more complicated than I expected. Thankfully, there are some excellent resources that describe precisely what I’ve seen with greater precision and detail than I could hope to reverse engineer. Having spun my wheels for a bit, I would recommend some healthy RTFM – getting to grips with Lync topology and possibly even consulting protocol documents, if you’ll spend any amount of time trying to decipher Lync network traffic. I cite some of these resources at the bottom of this post, but for the immediate considerations I’m focusing on some key descriptions from Bernd Ott’s How Communicator Uses SDP and ICE To Establish a Media Channel article.
There are some relatively light bits that I’ll call out here, as they’re about as concise as anyone is going to get with this stuff. The beginning of the, “Starting a PC2PC Call by obtaining the Candidate List”, section includes this summary of the call initiation process:
When Alice initiates the Communicator call to Bob, before sending out any SIP INVITE, OC* needs to determine what possible candidates Alice can send to Bob. This is the time for ICE, STUN and TURN and if you want to see more details on what is happening, you will have to use a network sniffing tool of your choice. Two very popular tools are Network Monitor 3 and Wireshark.
The candidate list includes the local list of IP address and port combinations (host candidates), a list of IP address and port combinations allocated by a NAT device (server reflexive candidates) and a list of TURN server IP address and port combinations (relayed candidates).
* “OC” refers to Office Communicator, Lync’s predecessor.
This basically means that each client is building up a list of possible routes for communication with the other participant. Later in the article, he clarifies the process of selecting a channel from these possible combinations:
After Alice received the SDP candidate list from Bob, she will start connection testing and build a matrix with possible media channels to Bob (for more details, please check the “ICE Candidate testing” section). The same process happens on Bob’s side. Depending on the priority of the possible candidates, Alice will send a single SDP candidate in a second “SIP INVITE” and, as she is the controlling agent, she will ask Bob to use certain candidates from his list for this media session. Bob now has to double check the proposed candidates from Alice and will accept the candidates in his answer packet (“SIP 200 OK”).
As both parties agreed on their IP, protocol and port combinations, they will now create the media channels and media information gets transmitted between both parties. Depending on the intermediate network layout, this might be a direct connection (always preferred) or a relayed connection with the A/V Edge as the data relay.
The key point here is that a direct connection is always preferred and the specific candidate pair is selected by analysing priority. This is the process that enables peer-to-peer communications. In our testing we’ve confirmed this is the normal behaviour for media traffic, whenever all the required ports are open. When firewalls prevent either peer from communicating directly with the other, that channel routes via the Microsoft Edge servers (asymmetrically, if needed). This is the technical description of what I’ve detailed in my last post, but there are some exceptions to follow that form the real meat of these topics.
When Peer-to-Peer Communications Require Mediation from the Edge
There are a few things that disrupt peer-to-peer routing:
- Most significantly, we’ve found that initiating collaboration (a Poll, Whiteboard or PowerPoint presentation) during a peer-to-peer media session (desktop sharing, voice or video) will move the entire session to the Lync Online Edge servers. These collaborative services route via Lync’s Edge because these conferencing facilities belong to the server infrastructure rather than the Lync client. Once this collaborative traffic starts to route via the Edge, everything else goes via the Edge as well, for the duration of that session. To be clear, this means that the entire session routes over the internet, from the moment that a Poll, Whiteboard or PowerPoint data is presented.From my perspective (or rather, from an Office 365 user’s perspective) this isn’t really desirable. It’s perhaps not an enormous issue since the Lync codecs are optimised for the WAN, but it’s still not ideal. I’ve discussed this with a Microsoft TSP and I’ve been told that Lync wasn’t designed to decouple media traffic from the datacenter services in a peer-to-peer session, and there is no facility to re-trigger Candidate Testing once the collaborative sharing is complete. In short, this is just how it works. If this is truly disruptive to users (if internet connectivity is saturated) or the business (perhaps there’s an impact on other services) then the solution would be to consider further investments in connectivity or to move Lync on-premise. But this needs to be put in perspective. The service exists in the cloud, so it’s probably not unreasonable to expect that some traffic will be mediated by the service. Another way of looking at it is that any peer-to-peer communication is a bonus.
- The most obvious way to fundamentally alter the peer-to-peer context is to invite another peer to the session. This immediately requires mediation and all participants will route via the Lync Online Edge servers for the duration of the session. Perhaps less obviously, if the session contracts to two participants, all traffic continues to route via the Edge servers. This is entirely expected, but it helps to get your head around why this is the case. As described above, there is no facility to re-initiate Candidate Testing within a session.
- Along these same lines, when we blocked the channel on either end of a peer-to-peer session, it terminated. When we reconnected the session, all media traffic routed via the Lync Online Edge servers. This is completely expected since the peer-to-peer route is no longer available. However, when we re-opened the blocked channel during the same session, media traffic continued to route via the Lync Online Edge server. Again, there is no facility to re-initiate Candidate Testing. This is also expected behaviour, according to our TSP, but it’s worth being clear about this with support staff at a minimum, as re-connecting to the session or putting a call on hold does not change the routing. For the duration of the conversation, all traffic routes via the Lync Online Edge servers.
The most important planning implication to fall out of this is that bandwidth modelling may struggle to account for the behaviour described in the first point above. The impact will depend entirely on when the collaborative services are invoked. But having said this, the Lync bandwidth calculator must have been validated in a context where this behaviour occurred. I suppose like most things in the world of capacity management, this underscores the importance of piloting with a representative group of participants who will use the service in a similar way to the bulk of users.
The most important operational consideration is that support staff need to understand why this behaviour occurs so they can respond to related issues and track quality issues beside internet bandwidth usage. If an internet connection becomes saturated then media session quality issues may emerge.
- Troubleshoot STUN with TURN in Office Communications Server 2007 R2
- Office Communications Server 2007 R2 Audio/Media Negotiation
- Lync Server 2010 Port Ranges and Audio/Media Negotiation
- Lync 2010 Media Bypass with and without Voice Resilience
MS-ICE Protocol documents
- Interactive Connectivity Establishment (ICE) Extensions
- Candidates Gathering-Phase Timer (should be 10 seconds)
- Connectivity Phase Timer (must be 10 seconds)
- ICE keep-alive Timer (must be 19 seconds or less)