RTP — Real-Time Transport Protocol

RTP

How RTP carries real-time audio and video over UDP, what the RTP header provides for jitter compensation and payload identification, and how RTCP provides feedback and synchronization between media streams.

layer4rtprtcpvoipstreamingjittertimestamprfc3550

Overview

Audio and video data have unique transport requirements that neither TCP nor raw UDP satisfies well. TCP’s retransmission causes jitter — a late-arriving retransmission disrupts real-time playback. Raw UDP provides no mechanism for the receiver to detect lost packets, measure jitter, or synchronize audio and video streams.

RTP (Real-Time Transport Protocol) fills this gap. RTP is an application-layer protocol that runs on top of UDP and provides the minimum additional structure needed for real-time media delivery:

RTP is paired with RTCP (RTP Control Protocol), which runs alongside RTP and carries statistics and control information: packet loss rates, jitter measurements, round-trip time, and synchronization between audio and video streams.

RTP is defined in RFC 3550. It is the transport protocol behind every VoIP call (SIP+RTP), every video conferencing session (WebRTC uses SRTP, the secure version of RTP), most IPTV systems, and countless streaming applications.


Why RTP Exists on Top of UDP

Raw UDP gives you port numbers (to identify which application receives the datagrams) and optional error detection (checksum). It does not give you:

RTP adds all of this with a compact 12-byte header (plus optional extensions). Unlike TCP, RTP does not retransmit lost packets — real-time media cannot wait for retransmissions. Instead, codecs are designed to gracefully handle missing packets through interpolation, frame concealment, and error correction.


RTP Header

RTP Header — 12 bytes minimum

V=2
2b
version
P
1b
X
1b
CC
4b
M
1b
marker
Payload Type
7b
Sequence Number
2B
Timestamp
4B
SSRC
4B
FieldSizeDescription
V (Version)2 bitsAlways 2 for current RTP
P (Padding)1 bitIf set, the packet ends with padding bytes (for alignment requirements of some encryption methods)
X (Extension)1 bitIf set, a fixed header extension follows (for extra fields needed by specific applications)
CC (CSRC Count)4 bitsNumber of Contributing Source (CSRC) identifiers following the fixed header — used in conferencing mixers
M (Marker)1 bitApplication-defined; for video, typically marks the last packet of a frame; for audio, marks talk spurts
Payload Type7 bitsIdentifies the codec — what format the media payload uses
Sequence Number16 bitsIncrements by 1 for each RTP packet sent. Used to detect loss and reorder out-of-sequence packets.
Timestamp32 bitsThe sampling instant of the first byte of the payload. The clock rate depends on the codec (8000 Hz for PCMU/G.711, 90000 Hz for video). Used to schedule playback and measure jitter.
SSRC32 bitsSynchronization Source — a random 32-bit identifier unique to each stream. Distinguishes multiple streams in the same RTP session.

Sequence Numbers and Loss Detection

The RTP sequence number starts at a random value (to prevent replay attacks) and increments by 1 for every packet sent. The receiver tracks received sequence numbers and can immediately determine:

What the receiver does with loss:

What the receiver does NOT do: Retransmit lost packets or pause playback to wait for them. Real-time media cannot tolerate retransmission latency.


Timestamps and Jitter Compensation

Network jitter is the variation in delay between packets. Even though RTP packets are sent at regular intervals (e.g., every 20ms for voice), they may arrive irregularly — 20ms, 25ms, 15ms, 30ms gaps between arrivals.

If the receiver played audio immediately upon receipt, the playback would sound choppy and irregular. Instead, the receiver maintains a jitter buffer — a small buffer that absorbs jitter and plays audio at a constant rate.

The RTP timestamp is the key to jitter buffer operation:

  1. The sender encodes audio samples and timestamps each packet based on when the samples were captured, not when the packet was sent. For G.711 at 8000 Hz with 20ms packets: each packet advances the timestamp by 8000 × 0.020 = 160 samples.

  2. The receiver plots received packets on a timeline using their timestamps. Regardless of when they arrived, the timestamps tell the receiver the correct playback order and spacing.

  3. The receiver plays audio from the jitter buffer at the correct timestamp rate, adding a configurable delay (the playout delay or buffer depth) to absorb jitter. A 100ms playout delay means the receiver buffers 100ms of audio before starting playback, smoothing out jitter up to 100ms.

Jitter buffer tradeoff: A larger buffer absorbs more jitter but adds more latency. For interactive voice, the total one-way latency must be under ~150ms to feel conversational. Too much jitter buffering makes the call feel like talking through a satellite link.


SSRC — Synchronization Source

Each RTP stream has a randomly chosen 32-bit SSRC identifier. In a simple point-to-point call, there are typically two RTP sessions: one for each direction of audio. Each direction has its own SSRC.

In a conference call with a mixing server (where multiple audio streams are combined into one), the mixer generates its own SSRC for the mixed stream. The CSRC (Contributing Source) list in the RTP header identifies the original participant SSRCs that contributed to each mixed packet.

The SSRC is also what RTCP reports reference — when a receiver reports statistics, it identifies the stream it is reporting about by SSRC.


Payload Types

The 7-bit Payload Type (PT) field identifies the codec used to encode the media. IANA maintains a registry of statically assigned payload types:

PTCodecClock RateChannels
0PCMU (G.711 µ-law)8000 Hz1
8PCMA (G.711 A-law)8000 Hz1
9G.7228000 Hz (payload; actual 16 kHz)1
18G.7298000 Hz1
26JPEG video90000 Hz
31H.26190000 Hz
34H.26390000 Hz
96–127DynamicNegotiated via SDP

Payload types 96–127 are dynamic — their meaning is negotiated out-of-band (typically via SDP in a SIP or WebRTC signaling exchange). This is how modern codecs like Opus (audio), H.264, VP8, VP9, and AV1 (video) are identified. The codec is announced in the SDP offer/answer exchange, and both sides agree on which payload type number corresponds to which codec for the duration of the session.


RTCP — RTP Control Protocol

RTCP runs alongside RTP on the next higher port (if RTP uses port 16384, RTCP uses port 16385). RTCP carries control and statistics information and uses approximately 5% of the session bandwidth.

RTCP packet types:

SR (Sender Report): Sent by active senders. Contains:

RR (Receiver Report): Sent by receivers. Contains for each SSRC being received:

SDES (Source Description): Carries descriptive information about participants: CNAME (canonical name — a persistent identifier for a participant), NAME, EMAIL.

BYE: Signals that a participant is leaving the session.

Sender
Receiver
RTP packets (continuous)
Audio/video media — UDP
RTCP SR (every ~5s)
Sent packet stats + NTP/RTP timestamp mapping
RTCP RR (every ~5s)
Loss fraction: 2%, jitter: 12ms, RTT feedback

SRTP — Secure RTP

Plain RTP provides no encryption or authentication. Media content is transmitted in plaintext. On a shared network or over the internet, this means anyone in the path can listen to voice calls.

SRTP (Secure Real-time Transport Protocol), defined in RFC 3711, adds:

SRTP requires keys to be established out-of-band. The two common mechanisms:


RTP in WebRTC

WebRTC — the technology enabling real-time audio and video in web browsers — uses SRTP for media transport. A WebRTC session involves:

  1. ICE for NAT traversal and finding a usable network path
  2. DTLS for key negotiation
  3. SRTP for encrypted media
  4. RTCP (also encrypted as SRTCP) for statistics and feedback

The entire media stack runs in user space (in the browser or application), over UDP. WebRTC typically uses Opus for audio and H.264 or VP8/VP9 for video.


Key Concepts

RTP timestamps are media clocks, not wall clocks

The RTP timestamp represents the sampling instant of the media, not the time the packet was sent. For a 20ms G.711 packet, the timestamp advances by 160 regardless of network delay. This is what allows the receiver to reconstruct correct playback timing even after packets are buffered in the jitter buffer.

RTCP enables adaptive quality

The loss fraction and jitter measurements in RTCP Receiver Reports are fed back to the sender. The sender can use this feedback to adjust the codec bitrate, change packet sizes, request keyframes, or trigger congestion control algorithms. WebRTC’s media engine uses RTCP as a continuous quality feedback loop to maintain the best possible quality given current network conditions.


References