Building a Global RTC Infrastructure with Zenlayer – Part 1: What Is RTC?

About the author

Jeff Ji is a Product Solutions Manager at Zenlayer. With nearly 10 years of experience in cloud and infrastructure solutions, Jeff enjoys helping clients from a wide variety of industries achieve business assurance and drive new innovations. He’s always looking for new ways to create more value for our clients and their users.



This is the era of video! Research shows that video usage grew 24% in 2022, and now accounts for 65% of all internet traffic. Video is profoundly changing our lifestyles, communication methods, entertainment preferences, and even knowledge structures.

Recent advancements in real-time communication (RTC) are making videos more accessible and interactive than ever before.

In this two-part series, we’ll explore how RTC is enabling live interactive video, and how Zenlayer can help build a powerful and flexible global RTC service infrastructure. Read on for part one, which includes an overview on RTC, a technical analysis, and a common usage scenario.


Getting to know RTC

Like many of my colleagues, I frequently browse TikTok after work to unwind. Recently, I discovered that I could watch TikTok videos in sync with my friends. Even if my friends and I are in different places, we can still chat and watch videos together like we’re in the same room.

Interactive video content is not only found on TikTok but also on many other social and entertainment applications including virtual office, remote meetings, online education, online karaoke, cloud gaming, and more. In addition, RTC has become an indispensable technology for the metaverse, which demands ultra-low latency for an immersive experience. By leveraging real-time audio and video, these new interactive applications are faster, more direct, and more engaging for users.

So, what makes interactive video content possible? It’s all thanks to network connectivity and real-time communications. Users can now watch and meet wherever and whenever they want, without having to download content or wait for it to load.


How is RTC different from traditional livestreaming?

Traditional livestreaming, which features one-way videos and no audience interaction, is mostly used for scenarios like entertainment, e-commerce, and news. It has an average latency of 3-5 seconds.

RTC supports PaaS and SaaS architectures and is mainly used for one-to-one and one-to-many host connections like video calls, video conferences, online education, and other scenarios where low latency and user participation are important. Most audio and video social apps are built on RTC technology. The latency requirements for RTC is generally sub-400 milliseconds (ms), with 200-300 ms being optimal. Remote control and cloud gaming scenarios requiring split-second interactions may demand even lower latency — sometimes down to tens of milliseconds.

Graphical user interface, application Description automatically generated



RTC’s technology and applications

Mesh Structure

You might be familiar with the traditional livestreaming architecture, which consists of a source, secondary nodes, and edge nodes in a tree-like structure. This architecture requires the video stream to be transmitted through each node in sequence, with a certain delay at each node.

RTC, on the other hand, uses a mesh structure in which all nodes have equal status and can forward packets downstream. This allows for path optimization and various strategies based on the real-time status and location of the participants, reducing the number of intermediate steps and ultimately, latency.


A picture containing diagram, plan, line, origami Description automatically generated


Protocol Selection

RTC relies on protocols, or rules, to determine how data moves across the internet. Traditional live video streams are usually based on RTMP, HLS, DASH, and other protocols, which all belong to the TCP protocol category. Even with optimization, they still have a latency of around 2-3 seconds.

In contrast, the open-source, UDP-based WebRTC project has dramatically reduced latency to the millisecond level. Unlike other protocols, WebRTC is a streaming communication framework covering the entire process of audio and video capture, encoding and decoding, transmission, and rendering for real-time audio and video.

Chart Description automatically generated with medium confidence

TCP is a reliable transmission protocol that sacrifices real-time transmission for data integrity. In weak network environments, its “three-way handshake” connection before data transmission can cause significant latency.

UDP has the advantage of high real-time performance, but does not guarantee data arrival and sequencing, making it less reliable. Despite this, real-time audio and video services often use UDP as the transport layer protocol. For example, Agora’s self-developed transport layer protocol, AUT, optimizes UDP at the protocol and algorithm layers to improve transmission reliability and logic.


Real-Time Network (RTN)

In addition to the right protocols, you also need to have a proper network in place to support RTC.

A Real-Time Network (RTN) is a stable, high-quality transmission network designed specifically for real-time communication. It’s built on top of the public internet or dedicated lines using software-defined networking (SDN) for network virtualization, and focuses on the communication route calculation and fault recovery of link anomalies. Its control plane is mainly responsible for network quality detection, path planning, and rule configuration management. Its data plane is responsible for data transmission and forwarding, serving the roles of edge and intermediate nodes.

Based on a decentralized architecture, a real-time audio and video transmission network gives end-users access from nearby edge nodes. Using intelligent routing algorithms to calculate the optimal transmission path in real-time, it effectively solves routing link and bandwidth cost issues.


How RTC is used

To demonstrate the entire RTC architecture, let’s look at a typical scenario involving real-time interactive livestreaming between a host and guests.

• In this scenario, the roles include the host, guests, and audience, where guests can be either viewers in the same room or hosts from other rooms.
• In interactive scenarios, the audio and video streams from the host and the connected user are pushed to the RTC server.
• The interactive audio and video content is mixed into one audio and video stream on another server, transcoded, then pushed to the CDN server for viewing by the audience
• Once the audience ‘s request to go live is approved, their CDN addresses will be switched to RTC addresses for interaction.

RTC is usually combined with a content delivery network (CDN), with RTC as the foundation and CDN as the supplement. RTC guarantees that the delay of host and guest interaction is within 200-400 ms. On the viewer side, CDN can support thousands of viewers watching simultaneously, resolving the challenge of high concurrency in RTC technology.

CDN is relatively inexpensive compared to RTC and is suitable for viewers who do not require high-frequency interactions and can tolerate some delays.

A picture containing logo Description automatically generated

Of course, real-time interactive audio and video technology creates a huge demand for “distribution” and “computing” — or distributed computing.

From the perspective of distribution, traffic and bandwidth generated by real-time audio and video on RTN network are increasing, and global users are demanding smoother playback experiences for live broadcasts. The interactive requirements of high-traffic applications such as cloud gaming, 4K/8K, AR/VR/XR, etc. are also growing.

From the perspective of computing, RTC nodes and CDNs are typical resource architectures deployed globally at the edge. In the future, high bit rates or immense computing power may be required for video transcoding processing and AI analysis.


The complete business flow of RTC

Let’s take video interactions as an example. Behind our experience of seamless video calls, there are a series of technical components supporting smooth communications, clear visuals, and even filters that make us look more attractive.

The business flow of RTC applications includes video capture, video pre-processing (low-light enhancement, beautification, AI processing, etc.), and adaptive video encoding according to specific device conditions. After encapsulation into the RTP protocol, the video is sent to the app’s server.

At this point, app operators need to consider how to enable users to send videos smoothly and efficiently. This requires employing protocol optimization and other means to combat weak network conditions. Users around the world then upload videos to RTC node servers in their respective regions.

To enable global users to interact, a global real-time transmission network needs to be built. Based on different interactive and non-interactive requirements, the videos are then mixed, transcoded, and either pushed to the CDN to be delivered to the audience or directly connected to interactive users.

A picture containing text, font, diagram, line Description automatically generated


Part 2 is coming soon…

Share article :

We’re coming to a city near you! Come grab a drink and a bite with the Zenlayer team as we celebrate our 10-year anniversary!