About the Authors

Scott Firestone holds a master's degree in computer science from MIT and has designed video conferencing and voice products since 1992, resulting in five patents. During his 10 years as a technical leader at Cisco, Scott developed architectures and solutions related to video conferencing, voice and video streaming, and voice-over-IP security. Thiya Ramalingam is an engineering manager for the Unified Communications organization at Cisco. Thiya holds a master's degree in computer engineering and...

Accessing the Focus

The central entity in the distributed architecture is called the focus. The focus maintains a signaling relationship with all the endpoints (or participants) in the conference. Conference and participant operations such as creating maintaining destroying conferences and adding deleting participants occur in the focus. Each conference must have a unique address of record (AoR) that corresponds to a focus. A conference server could contain multiple focus instances, and each focus may control a...

Ad Hoc Conference Initiation Conference Button

The Conference button on the phone creates an ad hoc conference by expanding a two-party call into a multiparty conference. Consider the following call scenario 1. Bob places a call to Alice, and Alice answers. 2. Bob decides to include Fred in the call. Bob presses the Conference button to put Alice on hold. 3. Bob places a call to Fred, and Fred answers. Bob announces that he will include Fred in the preexisting conversation with Alice. 4. Bob presses the Conference button again to connect...

Ad Hoc Conferences

As previously stated, ad hoc conferences are the simplest form of meeting. Phone users create them in two ways When the meeting host presses the Conference button on the phone. The conference functionality enables a user to escalate an existing two-party call into one with multiple participants. By using the Meet Me option on the phone. Ad hoc meetings do not reserve resources in advance and do not require participants to interact with a voice user interface before joining the meeting.

Ad Hoc Video Conferencing

A video-enabled endpoint uses the same procedure to join a conference but offers additional parameters in the SDP offer to describe the properties of the video media stream. Example 5-4 shows an SDP offer, in which endpoint A sends an INVITE to the conference server. Example 5-4 SDP Offer from an Endpoint for Joining Ad Hoc Video Conference o san 1549546120 0 IN IP4 10.10.10.26 c IN IP4 10.10.10.26 m audio 49220 RTP AVP 0 8 m video 49222 RTP AVP 109 34 96 31 a rtpmap 109 H264 90000 a fmtp 109...

Address and Port Dependent Filtering

Figure 8-11 shows a NAT that implements address- and port-dependent filtering. Figure 8-11 Address- and Port-Dependent Filtering After the NAT creates the binding, it forwards a packet from the external network to the internal network if The source address port of the packet is Ae Pe The destination address port of the packet is Am Pm In this case, only the endpoint that received the packet can send a packet back to the internal network, and the packet must have a source port equal to the...

Asymmetric Encryption Public Key Cryptography

Unlike symmetric encryption, where both sender and receiver use the same key, public key encryption uses two keys. In this approach, each endpoint creates a public key and a private key. Each endpoint keeps the private key secret but makes the public key widely available. Public key cryptography can perform two major functions encryption and integrity protection. When used for encryption, public key cryptography relies on the fact that data encrypted with the public key can be decrypted only...

Audio Mixer

Within a conference, the audio mixer is responsible for selecting the input streams and summing these streams into a mixed output stream. This section provides a detailed view into the various modules that comprise it. The audio mixer is the core component in the media plane. It is responsible for selecting incoming audio streams, summing them, and distributing the summed output back to the participants. When mixing audio streams in a large conference, the audio mixer selects only a subset of...

Audio Receiver Path

The receiver requires the jitter buffer in the audio path because packets arriving at the receiver do not have uniform arrival times. The sending endpoint typically sends fixed-sized RTP packets onto the network at uniform intervals, generating a stream with a constant audio bit rate. However, jitter in the network due to transient delays causes nonuniform spacing between packet arrival times at the receiver. If the network imposes a temporary delay on a sequence of several packets, those...

Call Hold Signaling with the Empty Capability

To indicate to the remote device that a hold operation is in progress, the endpoint initiating the hold operation sends a special form of the TCS, known as the ECS message, sometimes referred to as TCS 0. The ECS is a TCS with all capability fields set to null and support for it is a mandatory part of H.323 Version 2 and later. It does not disconnect the call, but simply informs the remote side that the sender does not currently have any decoding capability. As a result, the remote side closes...

CAM Table Flooding

One Layer 2 exploit is a content-addressable memory (CAM) table flood, which allows an attacker to make a switch act like a hub. A hub forwards all packets to all ports. A switch learns about Ethernet MAC addresses at each of its ports so that it can forward packets only to the port that provides a link to the destination address of the packet. In a heavily switched environment, an attacker receives only packets destined for the attacker. By exploiting a CAM table flood, the attacker can cause...

Canonical RTP Model

Figure 7-12 shows the canonical RTP RTCP model for a video audio sender and receiver. Figure 7-12 Canonical RTP RTCP Model Figure 7-12 shows five different clocks. At the sender Clock A, used by the audio capture hardware to sample audio data Clock B, used by the video capture hardware to sample video data Clock C, the common timebase clock at the sender, used for the purposes of stream synchronization with RTCP packets Clock D, the clock used by the audio playout hardware to play audio data...

Codecs Bit Rates and Annexes Supported by Endpoints

Table A-23 identifies the annexes and codecs supported by different enterprise endpoints. Polycom View Station shows that it supports annexes F, I, and T at 64K and 128K bit rates. H.261, H.263, H.263-1998, H.264 VSX 3000 and VSX 7000 also support SIP signaling. Cisco soft clients Cisco Unified Personal Communicator (CUPC), Cisco Unified Video Advantage (CUVA) E-Conf Version 4 supports H.264 baseline profile.

Common Reference Lip Sync

The goal of lip sync is to preserve the relationship between audio and video in the presence of fluctuating end-to-end delays in both the network and the endpoints themselves. Therefore, the most important restriction to keep in mind when discussing lip sync for video conferencing is the following Video conferencing systems cannot accurately measure or predict all delays in the end-to-end path for either the audio or video stream. This restriction leads to the most important corollary of lip...

Components of a Conferencing System

A conferencing system is composed of several components, including a user interface, a conference policy manager, media control, a player recorder, and other subsystems. This section explores these individual elements, providing details about the functionality found in each service and how together they make up a conferencing system. Figure 2-1 shows the major layers of a conferencing system User interface The user interface typically consists of several separate interfaces A scheduler to...

Conference Control

The conference control layer has three main functions Conference management and scheduling The conference scheduler works with the resource allocation module to reserve ports during the time window when meetings are scheduled to be active. The resource allocation module is aware of how the administrator has configured the system with respect to conferencing, floater, and overbook ports and uses this information when responding to resource allocation requests. At meeting time, after the user has...

Conference Policy Server

The conference policy server is the repository for the various policies stored in the system. There is only one instance of the conference policy server within the system. No standard protocol exists for communication between the focus and the policy server. Users join a conference by sending a SIP INVITE to the unique URI of the focus. If the conference policy allows it, the focus connects the participant to the conference. When a participant SIP endpoint wants to leave the conference, the...

Conference URI

A conference in a SIP framework is identified through a conference URI. The conference URI is the destination where all the SIP requests are sent and created managed by the conference server. An example of the conference URI is sip meetingplace cisco.com. Users can enter these URIs manually in their SIP client to dial into the conference system. Alternatively, the conference system embeds this in a web link and sends the link to the user through e-mail or instant messenger. If the user dials in...

Connection Hijacking

After two video conferencing endpoints establish a legitimate connection, an attacker might attempt to hijack the connection by impersonating one of the participants by issuing signaling commands to take over the conversation. The attacker might also use this type of spoofing to cause the connection to fail, in which case the attack is also considered a DoS attack. Solution Endpoints can thwart connection hijacking by authenticating the signaling messages. RTP Hijacking Whereas connection...

Continuous Presence Conferences

Continuous presence (CP) conferences have the benefit of displaying two or more participants simultaneously, not just the image of the loudest speaker. In this mode, the video MP tiles together streams from multiple participants into a single composite video image, as illustrated in Figure 1-2. CP conferences are also referred to as composition mode conferences or Hollywood Squares conferences. The video MP can either scale down the input streams before compositing or maintain the sizes of...

Correlating Timebases Using RTCP

The RTCP protocol specifies the use of RTCP packets to provide information that allows the sender to map the RTP domain of each stream into a common reference timebase on the sender, called the Network Time Protocol (NTP) time. NTP time is also referred to as wall clock time because it is the common timebase used for all media transmitted by a sending endpoint. NTP is just a clock measured in seconds. RTCP uses a separate wall clock because the sender may synchronize any combination of media...

Delays in the Network Path

A lip sync solution must work in the presence of many delays in the end-to-end path, both in the endpoints themselves and in the network. Figure 7-3 shows the sources of delay in the network between the sender and the receiver. The network-related elements consist of routers, switches, and the WAN. Figure 7-3 End-to-End Delays in a Video Conferencing System xCoder Figure 7-3 End-to-End Delays in a Video Conferencing System xCoder Router X experiences congestion at time T, resulting in a step...

Detecting Stream Loss

Conference server components must handle endpoint failures properly. Signaling protocols might provide some failure information, such as the SIP session-expires header. However, the media plane of the entire conferencing architecture must ensure that a backup mechanism detects and handles an endpoint failure in mid-session. The two common mechanisms to handle such scenarios are Internet Control Message Protocol (ICMP) unreachable messages and RTP inactivity timeout messages. If the application...

DHCP Exhaustion

DHCP exhaustion is a Layer 2 attack that also implements a DoS. An attacker sends a flood of DHCP request packets to the DHCP server, each requesting an IP address for a random MAC address. Eventually, the DHCP server runs out of available IP addresses and stops issuing DHCP bindings. This failure means that other hosts on the network cannot obtain a DHCP lease, which causes a DoS. Solution Cisco switches implement a feature called DHCP snooping, which places a rate limit on DHCP requests.

Early and Delayed Offer

Endpoints establish connections on the media plane by first negotiating media properties such as codec types, packetization periods, media IP address RTP port numbers, and so on. This information is transmitted with SIP messages using SDP. An endpoint may use two methods of exchanging SDP information Early offer In the early offer, the endpoint sends the media SDP in the initial INVITE and receives an answer from the conference server. Delayed offer In a delayed offer, the endpoint sends an...

Endpoint Independent Filtering

Figure 8-9 shows a NAT that uses endpoint-independent filtering. Figure 8-9 Endpoint-Independent Filtering Figure 8-9 includes the following addresses that appear on the internal private network Ai Pi The source address port of packets from the internal endpoint Ae Pe The destination address port of packets from the internal endpoint Figure 8-9 also includes the following addresses that appear on the public network Am Pm The source address port of packets from the NAT to endpoints on the public...

Entropy Coding

Table A-18 shows the attributes of entropy coding in H.264. Table A-18 Entropy Coding for H.264 (Continued) The run and level are not coded jointly. H.264 codes the number of coefficients using a context-adaptive VLC table. H.264 codes the zero-run length sequence using a context-adaptive VLC. H.264 codes the coefficient levels using a fixed VLC table. H.264 codes trailing ones (+1 or -1) as a special case. Motion vectors are coded using a modified Exp-Golomb, nonadaptive VLC. Two zigzag...

Entry IVR

Play Welcome to xxx Enter Conference id In a distributed conferencing model, however, one central, logical conference server is composed of many individual servers. An endpoint might need to be moved from one physical server to another. In Figure 5-12, endpoint EP dials into the entry IVR associated with the conference server, enters the meeting ID, and goes through the name-recording process. Centralized logic then moves the endpoint to another entity in the conference server that hosts the...

Error Resiliency

If the network drops bitstream packets, decoders may have difficulty resuming the decoding process for several reasons Bitstream parameters may change incrementally from one MB to another. One example is the quantization level Most codecs allow the bitstream to change the quantization level by a delta amount between MBs. If the network drops a packet, the decoder will not have access to the previous incremental changes in the quantization level and will not be able to determine the current...

Escalation of Pointto PointtoMultipoint Call

In this scenario, a point-to-point call between two participants becomes a conference call with more than two parties. Participant A is in a point-to-point call with participant B and wants to invite a third participant, participant C. Participant A finds a conference server, sets up the conference, gets the URI or meeting ID, and transfers the point-to-point call to the conference server. Participant A then invites participant C into the conference call. Participant A can add participant C...

Evaluating Video Quality Bit Rate and Signalto Noise Ratio

When evaluating the efficiency of a video codec, there is one primary criterion the quality at a given bit rate. Most video conferencing endpoints negotiate a maximum channel bit rate before connecting a call, and the endpoints must limit the short-term one-way average bit rate to a level below this negotiated channel bit rate. A higher-efficiency codec can provide a higher-quality decoded video stream at the negotiated bit rate. Quality can be directly measured in two ways By visually...

Event Subscription and Notification

RFC 3265 extends the SIP specification, RFC 3261, to support a general mechanism allowing subscription to asynchronous events. Such events can include statistics, alarms, and so on. The two types of event subscriptions are in-dialog and out-of-dialog. A subscription that uses the Call-ID of an existing dialog is an in-dialog subscription, whereas the out-of-dialog subscription carries a Call-ID that is not part of the existing ongoing dialogs. Figure 5-6 shows an example of out-of-dialog...

Feedback Information

At Cisco Press, our goal is to create in-depth technical books of the highest quality and value. Each book is crafted with care and precision, undergoing rigorous development that involves the unique expertise of members from the professional technical community. Readers' feedback is a natural continuation of this process. If you have any comments regarding how we could improve the quality of this book, or otherwise alter it to better suit your needs, you can contact us through email at...

Forming RTCP Packets

Each RTP stream has an associated RTCP packet stream, and the sender transmits an RTCP packet once every few seconds, according to a formula given in RFC 3550. As a result, RTCP packets consume a small amount of bandwidth compared to the RTP media stream. For each RTP stream, the sender issues RTCP packets at regular intervals, and those packets contain a pair of time stamps an NTP time stamp, and the corresponding RTP time stamp associated with that RTP stream. This pair of time stamps...

Full Mesh Networks

Another option for decentralized conferencing is a full-mesh conference, shown in Figure 2-7. This architecture has no centralized audio mixer or MP. Instead, each endpoint contains an MP that performs media mixing, and all endpoints exchange media with all other endpoints in the conference, creating an N-by-N mesh. Endpoints with less-capable MPs provide less mixing functionality. Because each device sends its media to every other device, each one establishes a one-to-one media connection with...

Gatekeeper Signaling Options

There are two signaling modes in a gatekeeper-controlled H.323 network Gatekeeper routed call signaling (GKRCS) When the gatekeeper is configured for direct endpoint signaling, the calling and called endpoints exchange RAS admission control messages with the gatekeeper, but the H.225 and H.245 messages are exchanged directly between the calling and called endpoints, without gatekeeper involvement. Figure 6-12 shows the signaling path for direct endpoint signaling. Figure 6-12 Direct Endpoint...

H

H.224, FECC applications, 17-18 H.225, 188 gatekeepers, 217 messages, 188-189 Alerting, 190 Call Proceeding, 190 Connect, 190 Notify, 191 Release Complete, 191 Setup, 189-190 Setup ACK, 190 H.232v4, H.235, 313 H.235.1, 314-316 H.235.2, 316-319 H.235.3, 319 H.235.6, 319-320 H.235, 313 H.235.1, 314-316 H.235.2, 316-319 H.235.3, 319 H.235.6, 319-320 H.235.1, 314-316 H.235.2, 316-319 H.235.3, 319 H.235.6, 319-320 H.245, 191-192 DTMF relay support indicators, 193-194 messages CLC ACK, 201 Close...

H225 Call Setup for Video Devices Using a Gatekeeper

The message sequence chart shown in Figure 6-16 illustrates two endpoints registering with a gatekeeper. The call flow shows endpoint A initiating a video call to endpoint B. In the diagram, both endpoints first register with the H.323 gatekeeper. After registration, Endpoint A initiates a call to Endpoint B using the gatekeeper direct endpoint signaling model. Figure 6-16 H.225 Connection Establishment with a Gatekeeper H.225 Video Call Establishment Via Gatekeeper Direct Endpoint Signaling...

H245 Control Protocol

The H.245 recommendation provides the mechanism for the negotiation of media types and RTP channel establishment between endpoints. Using the H.245 control protocol, endpoints exchange details about the audio and video decoding capability each device supports. H.245 also describes how logical channels are opened so that media may be transmitted. Like H.225, H.245 messages are encoded using ASN.1 notation. The H.245 session information is conveyed to the calling device during the H.225 exchange....

H323 Overview

H.323 is a widely deployed International Telecommunication Union (ITU) standard, originally established in 1996. It is part of the H.32x series of protocols and describes a mechanism for providing real-time multimedia communication (audio, video, and data) over an IP network. In this chapter, the intent is to familiarize you with some of the basic concepts involved in the H.323 architecture and signaling models, with an emphasis on voice and video conferencing. It does not attempt to cover all...

How This Book Is Organized

Chapter 1 provides an overview of the conferencing models and introduces the basic concepts. Chapters 2 through 8 are the core chapters and can be read in any order. If you intend to read them all, the order in the book is an excellent sequence to use. The chapters cover the following topics Chapter 1, Overview of Conferencing Services This chapter reviews the elementary concepts of conferencing, describing the various types of conferences and the features found in each. It also provides an...

Hybrid Decoder

When analyzing a hybrid codec, it is easier to start by analyzing the decoder rather than the encoder, because the encoder has a decoder embedded within it. Figure 3-17 shows the block diagram for the hybrid decoder. The encoder creates a bitstream for the decoder by starting with an original image, with frame number N, denoted by Fn o. Because this frame is the original input to the encoder, it is not shown in the decoder diagram of Figure 3-17. For this image, the output of the encoder...

Intra Prediction

H.264 has an intra prediction mode that predicts pixels in the spatial domain before the intra transform process. For luminance, the encoder can use two different modes a 16x16 prediction mode or a 4x4 prediction mode. For chrominance, the encoder can use an 8x8 prediction mode. In both cases, the pixels inside the block are predicted from previously decoded pixels adjacent to the block. The 16x16 prediction mode has four methods of prediction. Figure A-3 shows two modes. Figure A-3 Two of the...

Psec

IPsec operates by applying encryption at the IP layer, below the TCP and UDP stack. Because IPsec applies to the lowest layers of the IP stack, endpoints typically implement it as part of the operating system kernel, independently of the upper-layer application. Therefore, the applications are unaware of the underlying security, but the IPsec tunnel protects the UDP and TCP packets. However, administrators and users must manually configure IPsec on the originating and terminating endpoints and...

ISDN Gateway

In the early days of IP video conferencing, the only practical way to allow NAT FW traversal between enterprises was to circumvent the problem by using H.320 ISDN gateways to connect two endpoints over the public switched telephone network (PSTN). Figure 8-14 shows the topology for interenterprise H.323 connectivity, in which two endpoints connect over the PSTN WAN. Figure 8-14 Using ISDN to Circumvent the NAT FW Traversal Problem The major downside of this approach is the added delay of...

Layer 2 Attacks

Several attacks are possible at Layer 2, the Ethernet link layer. These attacks often require the attacker to have direct access to the internal network. Layer 2 attacks are extremely virulent because after an attacker compromises Layer 2, all layers above Layer 2 might not detect the attack. Solution Add security at Layer 2 within the network. A deployment that implements Layer 2 protection inside the network and Layer 3 firewall protections at the edge achieves layered security. An enterprise...

Lecture Mode Conferences

A lecture mode conference has a lecturer who presents a topic, and the rest of the participants can ask questions. There are two different styles of lecture mode meetings Open Open meetings allow participants to ask questions any time without requesting permission to speak. Controlled In a controlled meeting, the meeting administrator or lecturer must give a participant permission to ask questions or speak. If the administrator denies the request from an audience member to ask a question, the...

Low Resolution Video Input

If the video endpoint is configured to send low-resolution video, the endpoint typically starts with a full-resolution interlaced video sequence and then discards every other field. The resulting video has full resolution in the horizontal direction but half the resolution in the vertical direction, as shown in Table 7-2. Table 7-2 Video Formats Field Sizes Table 7-2 Video Formats Field Sizes When capturing from a typical interlaced camera and using only one of the fields, the encoder must...

M

Macroblocks, 101-102, 172 malleable playout devices, 244 malware, 262 mapping characteristics of NAT, 278-279 matrix quantization, 61 MC (multipoint controller), 10 MCTF (motion-compensated temporal filtering), 353 MCUs (multipoint control units), 9, 26, 209 MC, 10 service prefixes, 219-220 transrating, 12 media control support for ad hoc video conferencing, 172-173 media encryption MIKEY, 313 security-descriptions, 312 media multiplexing, 294 media plane, 22, 27 generation module, 32 speaker...

Maninthe Middle Attacks

A MitM attack occurs when an attacker inserts a rogue device between two connected endpoints. The MitM can then listen to packets that flow between the endpoints and can modify packets in transit. The MitM is invisible to the two endpoints, which are unaware of the attack. One way for an attacker to become a MitM is to spoof the identity of each endpoint to the other. Figure 8-3 shows this scenario. Figure 8-3 A Man-in-the-Middle Attack Between Two Endpoints Figure 8-3 A Man-in-the-Middle...

Mid Call Bandwidth Requests

When a device needs to modify the session bandwidth during a call, it sends a bandwidth request message to the gatekeeper. For instance, an endpoint might need to request additional bandwidth when it adds video streams to an existing call. Endpoints adjust the bandwidth by sending a Bandwidth Request (BRQ) message to the gatekeeper with the new bandwidth requirement. If the bandwidth is available, the gatekeeper grants the request, signaled via the Bandwidth Confirm (BCF) message. If the...

Mikey

Another key exchange method is Multimedia Internet Keying (MIKEY). The base MIKEY specification is defined in RFC 3830, and the method that describes using it with SDP information is RFC 4567. Like s-descriptions, MIKEY inserts the key material as a parameter entry inside the SDP section of the SIP message. However, unlike s-descriptions, MIKEY encrypts this SDP entry. One of the benefits of MIKEY is that the SDP information, and therefore the SIP messaging, can transit in the clear, without an...

Motion Vectors

For the purpose of assigning MVs, each 16x16 MB may be segmented in several ways as a 16x16 block, as two 8x16 blocks, as two 16x8 blocks, or as four 8x8 blocks. The four 8x8 segmentation mode allows any of the 8x8 blocks to be further subdivided as two 4x8 blocks, two 8x4 blocks, or four 4x4 blocks, as shown in Figure A-2. Figure A-2 Segmentation of a Macroblock in H.264 As a result, an H.264 MB may contain 16 4x4 blocks, and in a B-frame, each block may have up to two MVs, for a total of 32...

Mute and Unmute

An endpoint can mute itself using one of two methods The endpoint can halt transmission of audio video media packets to the conference server. The endpoint can request that the conference server ignore packets from the endpoint. An endpoint can instruct a conference server to ignore audio or video media packets by sending proper DTMF tones. In Figure 5-16, the key sequence 5 notifies the conference server that the endpoint wants to be muted. In response, the conference server plays an...

NAT Complications for VoIP Protocols

NAT presents multiple problems for video conferencing and VoIP protocols, such as the following External endpoints cannot connect to an internal endpoint in the private address space until the internal endpoint creates a NAT binding by sending packets to the external endpoint. In other words, internal endpoints may not receive unsolicited connections. Of course, this restriction may be considered a security feature. However, one of the goals of NAT traversal is to allow authorized external...

NAT Mapping Characteristics

The mapping characteristic of a NAT describes how the NAT allocates external addresses Am Pm, based on the internal source address Ai Pi. The NAT may implement two main types of mapping Endpoint-independent mapping The internal endpoint may send packets with source address Ai Pi to multiple external endpoints, each with different addresses. Figure 8-7 shows a NAT that implements endpoint-independent mapping. In this case, the NAT uses the same external mapped address Am Pm for packets destined...

Overview of RTCP

RTCP is the companion control protocol for RTP. It provides periodic reports that include statistics, quality of reception, and information for synchronizing audio and video streams. As stated in RFC 3550, RTCP performs two major functions It provides feedback on the quality of the media distribution. This function is performed by RTCP receiver and sender reports. For each sender, RTCP maps RTP time stamps for each RTP stream to a common sender clock, which allows audio and video...

Overview of RTP

The Audio Video Transport (AVT) working group of the Internet Engineering Task Force (IETF) developed RTP in 1996 and adopted it as a standard in RFC 1889. Subsequently, the IETF added more refinements to the protocol and republished it as RFC 3550. Always refer to the later RFC for the most current information on RTP. Figure 4-1 shows the relevance of RTP to other protocols used in IP collaboration systems. Figure 4-1 RTP in IP Collaboration Systems Session Control Gateway Control Media...

Poor Mans Lip Sync

The simplest incarnation of a lip sync algorithm is known as Poor Man's lip sync. In this method, the receiver uses one criterion to synchronize audio and video Packets of audio and video that arrive simultaneously at the network interface of the receiver are considered to be synchronized to each other. This approach is fundamentally flawed because delays in the end-to-end path vary both in space (at different points of the path) and time (fluctuations in delay from one moment to the next). In...

Predictor Loops for Parameters

The section Predictor Loop explained the predictor loop for a hybrid codec, in which the output of the encoder is a coded residual image based on a motion-compensated predicted frame. This prediction loop forms the outer loop of a hybrid coder. However, this paradigm of coding a residual can be extended to other parts of the codec algorithm and is not limited to motion-compensated prediction of pixel areas. Most codecs use smaller prediction loops in various parts of the bitstream. Like the...

Preprocessing

Before an image is handed to the encoder for compression, most video conference endpoints apply a preprocessor to reduce video noise and to remove information that goes undetected by the human visual system. Noise consists of high-frequency spatial information, which can significantly increase the pixel data content, and therefore increase the number of bits needed to represent the image. One of the simpler methods of noise reduction uses an infinite impulse response (IIR) temporal filter, as...

R

RAS messages (H.323), 213-214 RAS signaling (H.323), 212-213 receiver-side processing, 241 reconnaissance attacks, mitigating, 264 reconstructed images, 74 record routing (SIP), 153 redirect servers, 147 redundant slices, error resiliency, 90 reenrollment, 309 reference frames, 73 reflexive transport addresses, 276 registrars, 147 Release Complete messages (H.225), 191 Request Channel Close message (H.245), 201 components of, 150-151 required H.323 gatekeeper features, 209-210 reservationless...

Receiver Video Path

The receiver has several delays in the video path The packetization delay This latency might be required if the video decoder needs access to more than one slice (or group of blocks) to start the decoding process. However, video conferencing endpoints typically use a low-latency bitstream that allows endpoints to decode a slice without needing to use information from other slices. In this case, the input video packetization process simply reformats the video packet and does not perform any type...

References

ATSC Implementation Subcommittee Finding Relative Timing of Sound and Vision for Broadcast Operations, Document IS-191 of the ATSC (Advanced Television Systems Committee), June 2003. www.atsc.org standards is_191.pdf Blakowski and Steinmetz, A Media Synchronization Survey Reference Model, Specification, and Case Studies, IEEE Journal on Selected Areas in Communications, Vol. 14, No. 1, January 1996. Schulzrinne, H., S. Casner, R. Frederick, and V. Jacobson. IETF RFC 3550, RTP A Transport...

RSVPQoS Support in Conferencing Flows

Bandwidth reservation is important for the audio and video streams, and RFC 3312 provides the resource-reservation support in SIP. Audio streams should have a higher quality of service (QoS) than video streams because video tolerates delays better than audio. The endpoint may include a successful bandwidth reservation as a precondition of joining the conference. Or, the endpoint can make the reservation optional. Figure 5-18 shows a Resource Reservation Protocol (RSVP) conference flow where the...

RTCP Receiver Report

The RTP receivers (endpoints or conference server) provide periodic feedback on the quality of the received media through the RR packet type. An endpoint can use this information to dynamically adjust its transmit rate based on network congestion. For example, if a video endpoint detects high network congestion as a result of packet loss, the endpoint may choose to send at a lower bit rate until the congestion clears. Figure 4-7 illustrates the format of the RR report, and the following list...

RTCP Sender Report

The RTP senders (endpoints or conference server) provide information about their RTP streams through the SR packet type. SRs serve three functions They provide information to synchronize multiple RTP streams. They provide overall statistics on the number of packets and bytes sent. They provide one half of a two-way handshake that allows endpoints to calculate the network round-trip time between the two endpoints. Figure 4-6 illustrates the format of the SR. Figure 4-6 RTCP Sender Report Format...

RTCP Source Description SDES

RTCP SDES packets provide participant information and other supplementary details (such as location information, presence, and so on). Figure 4-8 shows the packet format of the RTCP SDES The following list explains the format Payload type Is set to 202. Source count (SC) Indicates the number of SSRC CSRC items included in this packet. SDES Follows the SSRC. A list of SDES items describes that SSRC source. Each of the SDES items is of the format Type (8 bits), Length (8 bits), and Value (text of...

RTP Header Extensions

RFC 3550 provides the flexibility for individual implementations to extend the RTP header to add information. RTP header extensions are most useful in distributed conferencing systems. To extend the RTP header, the sender sets the X bit to 1 in the first octet of the RTP fixed header. Figure 4-4 shows the RTP header extension format. The first 16 bits of the header extension are left open for distinguishing identifiers or parameters. The format of these 16 bits is defined by the application...

RTP Time Stamps

Each capture device (microphone and video capture hardware) has a clock that provides the RTP time stamps for its media stream. The units for the RTP time stamps depend on whether the media stream is audio or video For the audio stream, RTP uses a sample clock that is equal to the audio sample rate. For example, an 8-kHz audio stream uses a sample clock of 8 kHz. In this case, RTP time stamps for audio are actually sample stamps, because the time stamp can be considered a sample index. If an...

Scalable Layered Codecs

Scalable codecs offer a way to achieve progressive refinement for a video bitstream. A scalable bitstream is composed of a base layer accompanied by one or more enhancement layers. The base layer provides a base level of quality, and each enhancement layer provides incremental information that adds to the quality of the base layer. These codecs are also called layered codecs because they provide layers of enhancement. A video conferencing system can use scalable codecs in several ways Capacity...

Sender Audio Path

This section focuses on the audio path, which uses an analog-to-digital (A D) converter to capture analog audio samples and convert them into digital streams. For the purposes of synchronization, it is necessary to understand how each of the processing elements adds delay to the media stream. The delays in the audio transmission path consist of several components Audio capture packetization delay Typically, audio capture hardware provides audio in packets, consisting of a fixed number of...

Sender Video Path

Video capture hardware digitizes each image from the video camera and stores the resulting fields of video in a set of circular frame buffers in memory, as shown in Figure 7-7. The capture hardware fills the frame buffers in order until it reaches the last buffer, and then it loops back to frame 1, overwriting the data in frame buffer 1. Notice that each frame buffer contains two fields an odd field and an even field, corresponding to the odd and even field of each frame of interlaced video. To...

Send Side Voice Activity Detection Module

Voice Activity Detection (VAD) is a network optimization that omits packets with a low energy level. If the energy level drops below a certain threshold, RTP packets are no longer transmitted. The use of VAD can significantly reduce the amount of bandwidth consumed by a VoIP call. When VAD is active, the sending side stops transmitting audio RTP packets and instead transmits a special silence packet to the remote device. The silence packet carries a silence detection (SID) payload, indicating...

Session Description Protocol

SIP uses SDP (defined in RFC 2327), which defines a syntax to describe the media sessions. The SDP is carried as an application body (Content-Type application SDP) in the SIP messages. SDP consists of text messages using the ISO 10646 character set in UTF-8 encoding. A SDP consists of a session-level description (details that apply to the whole session and all media streams) and optionally several media-level descriptions (details that apply to a single media stream). Table 5-2 describes the...

Setting Up Scheduled Conferences

When creating a scheduled meeting, the meeting organizer might specify the resources required to support the number of participants and whether a meeting should support video callers. The organizer also specifies the start and end times of the meeting. Because conferencing system resources such as dial-in capacity and audio processing power are finite, the scheduling system must manage these facilities. The conferencing system's scheduler must ensure that a meeting will actually have the...

Signaling Protocols Conferencing Using SIP

Session Initiation Protocol (SIP) is a signaling protocol used for establishing media (audio, video, and instant messaging) sessions as part of audio video conferencing, telephony, and other IP collaboration systems. SIP can also be used for presence and event notifications. SIP is defined in RFC 3261. This chapter addresses the following topics Overview of SIP, including different elements of the protocol and message structures. Overview of Session Description Protocol (SDP) and its different...

SIP Encryption

The SIP standard defines a method of establishing a secure SIP signaling connection by using TLS on port 5061. In this case, endpoints use a sips URL rather than the usual sip URL. TLS offers either single-sided authentication or mutual authentication, and it provides encryption and integrity for data flow in both directions. The downside of TLS is that it is hop by hop For the end-to-end connection to be secure, devices at all hops in the end-to-end path must trust each other. An example of a...

SIP Requests

The following are the different types of SIP requests INVITE Invites an endpoint to join the call BYE Terminates the dialog between two UAs OPTIONS Requests information on the capabilities of the remote UA MESSAGE Sends instant messages (not part of a dialog) ACK Confirms that a UA has received a final response to an INVITE method REGISTER Provides the registration of the location CANCEL Terminates the last pending request INFO A mid-session method to pass the informational elements PRACK...

SIP Transactions and Dialogs

A transaction is defined by a request response sequence A SIP client sends requests to a SIP server, and the SIP server returns responses to the client. In Figure 5-5, a SIP UA sends an INVITE to another SIP UA and receives the responses (100 Trying 200 OK). The initial INVITE and the responses are considered to be part of one transaction. In general, ACK is not considered part of the transaction. Later SIP messages may include the disconnect request, known as the BYE message these later...

SNR and Spatial Scalability

SNR scalability uses a base layer, providing a lower level of image quality. Each enhancement layer acts much like the residual difference image of a hybrid codec and represents a correction layer that is added to the base layer. The addition of each enhancement layer reduces the error between the decoded image and the original image, thus increasing the SNR and the quality. Spatial scalability uses a base layer consisting of a smaller-size image sequence. The enhancement layers add information...

SSRC Collisions

SSRC collision occurs if the endpoint and the conference server choose the same SSRC for their RTP streams. RFC 3550 specifies solutions for how to handle the SSRC collisions. If the conference server finds that both it and the endpoint use the same SSRC for the same session, the conference server should send an RTCP BYE packet, close the connection, and reestablish the connection using another SSRC. RFC 3350 requires that the SSRC identifiers be unique among the devices in the mixer or...

Standard and High Definitions

Chapter 7, Lip Synchronization in Video Conferencing, describes the formats for standarddefinition (SD) and high-definition (HD) video formats. Some high-end video conferencing systems, such as telepresence endpoints, support HD video cameras. These cameras provide video images with a higher resolution than the traditional SD formats (NTSC PAL SECAM) allow. SD and HD differ in several aspects Aspect ratio Aspect ratio refers to the ratio of width to height of the video frame. SD typically has a...

Summary

This chapter provided an overview of voice and video conferencing systems. The chapter discussed the various modes in which conferencing systems operate and briefly described the components that comprise a system. In addition, you learned about the features available in each conference type and how the user interacts with and invokes them. The chapter closed with a description of the three tiers of video conferencing endpoints currently available in the marketplace and a description of their...

Symmetric Encryption

Data encryption allows a sender and receiver to ensure the confidentiality of data. Video conferencing algorithms encrypt signaling or media using symmetric encryption schemes, which use a single fixed-length key to both encrypt and decrypt the data. Figure 8-21 shows the operation of symmetric encryption. The original, unencrypted data is called the cleartext, and the encrypted data is called the ciphertext. The conferencing industry is moving to adopt the Advanced Encryption Standard (AES)...

Telepresence Systems

At the extreme high end of room conferencing is the telepresence system. These systems use studio-quality high-definition cameras, large display systems, and special room lighting to provide a life-size view of the remote conference room and participants. Discrete multichannel, high-quality speaker systems and spatial audio codecs provide a vastly improved experience over traditional room conferencing systems. Some systems such as the Hewlett-Packard HALO video collaboration system require a...

Temporal Scalability

In a bitstream with temporal scalability, the base layer represents a lower frame rate sequence, and the enhancement layer adds information to increase the frame rate. Two methods are commonly used for spatial scalability B-frames and temporal sub-band filtering. B-frames offer temporal scalability, because they can be discarded by either the encoder or the decoder. As described previously, no frame in a bitstream relies on information in B-frames, which means that either the encoder or decoder...

Trademark Acknowledgments

All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Cisco Press or Cisco Systems, Inc., cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark. Publisher Paul Boger Associate Publisher Dave Dusthimer Executive Editor Kristin Weinberger Managing Editor Patrick Kanouse Development Editor Dayna Isley Senior Project Editor San...

Understanding the Receive Side

Figure 7-10 shows the receiver-side processing. The audio path consists of the jitter buffer, followed by the audio decoder, followed by the digital-to-analog (D A) converter. The video path consists of a video decoder, a video buffer, and a video playout device. Figure 7-10 Receiver-Side Processing Figure 7-10 Receiver-Side Processing

Understanding the Sender Side

Figure 7-4 shows the video and audio transmit subsection of a video conferencing endpoint. The microphone and camera on the left provide analog signals to the capture hardware, which converts those signals into digital form. The sender encodes both audio and video streams and then packetizes the encoded data for transport over the network.

Using RTCP for Media Synchronization

The method of synchronizing audio and video is to consider the audio stream the master and to delay the video as necessary to achieve lip sync. However, this scheme has one wrinkle If video arrives later than audio, the audio stream, not the video stream, must be delayed. In this case, audio is still considered the master however, the receiver must first add latency to the audio jitter buffer to make the audio the most delayed stream and to ensure that synchronization can be achieved by...

Using Service Prefixes with MCUs

MCUs can host multiple conferences simultaneously, and a single conference may have multiple video layouts or video presentation modes. Predefined service prefix codes allow MCUs to associate network services and video layouts with specific patterns within E.164 access numbers. Users can call different numbers to access the same meeting, but with different bit rates and different video layouts. For example, a user could start a conference by dialing the following digit sequence In Table 6-3,...

Using the Empty Capability

Basic phone features include the ability to transfer a call to another party and to place a call on hold and resume it later. Calls are placed on hold or transferred by means of the hold and transfer buttons on the phone. As part of the hold and transfer operations, the RTP media channels are closed and reopened again. In the case of hold resume, the channels are opened to the same phone for transfer, media resumes with a new device. The next section describes how the Empty Capability Set (ECS)...

Video Formats

Table A-20 shows the source video formats and options possible with MPEG-4, Part 2. Table A-20 Video Formats for MPEG-4, Part 2 Field frame coding per MB The top half of the MB is one field, and the bottom half is the other field. There are no standard sizes All sizes are custom. Five standard aspect ratios, and custom aspect ratios. There are no standard frame rates All frame rates are custom.

Video Source Format

Most video conferencing endpoints can accept analog video signals from a standard-definition video camera. Three video formats exist National Television Systems Committee (NTSC), used primarily in North America and Japan Phase-Alternating Line (PAL), used primarily in Europe S quentiel couleur m moire (SECAM), used primarily in France Many video endpoints can accept either NTSC or PAL formats, whereas SECAM is less well supported. Table 7-1 shows the maximum possible resolution of each format...

Video Stream Hierarchy

Most codecs organize the bitstream into a hierarchy, as shown in Figure 3-34. Figure 3-34 Definition of the Bitstream Hierarchy Figure 3-34 Definition of the Bitstream Hierarchy At the top of the hierarchy is a group of pictures (GOP). A GOP often consists of a fixed pattern of I-, P-, or B-frames. One level down in the hierarchy is a picture, consisting of an intra- or interframe. One level further down within this frame is a group of MBs codecs may refer to this group as either a slice or a...

Video Stream RTP Formats

This section describes the RTP payload formats for three video codecs H.263v1, H.263v2, and H.264. The payload formats describe how the bitstream for a single frame may be fragmented across multiple RTP packets. In addition, each payload format defines a payload header, containing details such as key frame indicators. Because H.263 has largely replaced H.261, this section does not go into the details of H.261 packetization. As discussed earlier in this chapter, each RTP packet consists of three...

Video Switcher

The video switcher gets the video streams from the endpoints, applies the conference policy to select one or more of the video streams, and sends the video streams back to the endpoints (with no transcoding, transrating, or composition). The video switcher is implemented either as an appliance that just runs this application or as part of the conference server. The video switcher is also known as a media switcher or video passthrough device. Video switchers do not change the payload carried in...

Video Transcoder

Video transcoding converts one stream type into another and changes one or more of the video characteristics. The block diagram of a transcoder is shown in Figure 2-4. A video transcoder may change the encoding format (codec), bit rate, resolution, and frame rate by decoding the incoming stream into a raw video buffer and then re-encoding it. Because the transcoder can easily select the output bit rate, transrating functionality is built in, and therefore, conference topologies do not need a...

Video Transrater

A video transrater is a device inserted in the path between two endpoints that lowers the video bit rate in one direction. Figure 2-2 shows a topology with several endpoints and a transrater. Video transrating is a key component needed to create an integrated conferencing service that links endpoints from LAN, broadband, and mobile networks. Figure 2-2 Video Transrating Network Video H.263 10 Frames Sec 128 Kbps Video H.263 10 Frames Sec 128 Kbps Video H.263 10 Frames Sec 256 Kbps Video H.263...