Buyers' Guide: Hardware Transcoders

Article Featured Image

Although software transcoding is accept­able for transcoding most VOD streams and even low-volume live programming, most high-volume live applications need hardware for efficient transcoding, both to save you money and to save the planet. This buyers' guide will cover:

  • What hardware transcoders are
  • What you need to bring to the table to identify the best hardware transcoder
  • Factors to consider when choosing one
  • Choosing a hardware transcoder for cloud workflows
  • Choosing a hardware transcoder for on-prem workflows

As with all buyer’s guides, lists are intended to be representative, not exhaustive. If you have a hard­ware transcoding device you think should be men­tioned, leave a comment on the web version of this ar­ticle, or let me know at jan.ozer[at]streaminglearningcenter.com.

For the record, all throughput, quality, and cost calculations are for H.264 only.

What Hardware Transcoders Are

For the purposes of this article, hardware trans­coders include devices that enable high-volume trans­coding, such as those powered by graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs). I don’t include hardware trans­coding available in a CPU, like Intel’s Quick Sync, because this approach is chal­lenging to scale. As you’ll see, you can pack 10 or more GPU- or ASIC-powered devices in a server, but that’s challenging (and much less affordable) with CPUs.

What You Need to Bring to the Table

I’ll be computing costs throughout this article using the following assumptions:

  • A 100-channel FAST service
  • Running 24/7/365
  • Using the encoding ladder in Table 1

While the numbers will vary with your specific re­quirements, you should be able to adjust the analysis to fit most transcoding configurations.
There’s nothing special about the ladder in Table 1, but I needed some numbers to work with. In the table, you see the column “% of 1080p” and the line “1080p equivalents.” This represents each rung as a percent­age of the pixel count of a single 1080p stream. I’ll add these percentages to calculate a total workload of 1.87 1080p streams for this encoding ladder, which I’ll use to estimate throughput in a later section. Also note the total bitrate in kilobits per second, which I’ll use to compute bandwidth costs later.

sample encoding ladder

If you want to follow along, you should know the number of input streams and the number and con­figuration of output streams. You’ll also need your cost per gigabyte for transferred bandwidth for fu­ture calculations. I’m working off a Google Sheet that contains all of the calculations shown in this article. It’s not exhaustive, but it should be a use­ful starting place for anyone who is making these calculations. You won’t be able to modify the Google Sheet, but you can download it to an Excel file that you can use as you wish.

Factors to Consider When Choosing a Transcoder

Let’s review the factors to consider when choosing a hardware transcoder. Obviously, the device must support your current output format and codecs you might think about adding over the next 3–4 years. For most services, this includes H.264, HEVC, and AV1. None of the current transcoders output VVC, and I didn’t consider LCEVC, although some of the transcoders mentioned likely support it with the necessary software.
Next, consider how you’re going to control trans­coding. Virtually all trans­coding devices will support FF­mpeg and offer a lower-level API, with G­Streamer as another popular option. FFmpeg and GStreamer are straight­forward, but if you’re controlling the hardware via the API, you may find significant differences in complexity and ease of support. Assess this before making a buying decision. If you’re using an applica­tion like Norsk or Wowza Streaming Engine, be sure that it supports the transcoder you’re considering.

Next up are throughput and quality, which go hand in hand. As with software transcoders, most hardware transcoders have presets that balance quality and throughput. With software encoding for VOD, you typ­ically care about encoding speed, but with hardware transcoding for live, throughput is all about the num­ber of real-time streams the transcoder can output, not frames per second. That’s because some hardware transcoders are limited in the number of simultane­ous streams they can encode or decode; just because a transcoder can output 1080p30 at 1,200 frames per second doesn’t mean it can produce 40 1080p30 out­puts. If you’re transcoding VOD, frames per second is relevant as a measure of encoding speed; if trans­coding live streams, only the number of real-time out­puts matters.

As I discuss in the article Choosing the Best Pre­set for Live Transcoding, once you get above 200–300 viewers per channel, consid­ering hardware and bandwidth costs, it makes the most economic sense to use the highest-quality pre­set to deliver maximum quality at the lowest bitrate. Although your transcoding costs will be the highest using these presets, your reduced bandwidth costs at even moderate viewer levels should more than make up the difference. We’ll look at a calculation of this in a moment.

Measuring Throughput

Throughput specs are most useful when differen­tiated by the encoding preset, like those shown at go2sm.com/ma35d for the AMD MA35D. The density numbers in the MA35D specs are worth exploring. Single density means streams encoded with H.264, HEVC, and mid-quality AV1, while double density means H.264, HEVC, and mid-quality AV1, plus high-quality AV1.

This is because the MA35D deploys two ASICS: one capable of H.264, HEVC, and mid-quality AV1 and the other, high-quality AV1 only. If your application in­volves equal streams of H.264, HEVC, or mid-quality AV1 and high-quality AV1, you can double capacity at no extra cost. Specifically, the MA35D can transcode 32 1080p30 streams of H.264, HEVC, or mid-quality AV1. Simultaneously, it can handle 32 additional 1080p30 streams of high-quality AV1 on the same hardware.

In general, if a vendor doesn’t designate preset data according to its specifications, you should assume that the company used a high-throughput/low-quality pre­set that you likely won’t want to use for production. That means you have to test yourself.

NETINT presents throughput data on its spec sheets and some product reviews published on its site. Another valuable re­source for performance and quality results is Derrick Freeman’s excellent review of Quadra for Streaming Media.

NVIDIA provides some performance data here, but it has too many encoding-capable GPUs to fully document their performance (see the decoding and encoding support matrices here). Intel documents the features but not the performance of its transcoding-capable CPUs here but expects its integrators to doc­ument the performance of their respective apps on the Intel GPU technologies.

Wowza has a resource that details the throughput of various AWS instances, including AWS EC2 G4, C5, and VT1, which I’ll refer to in the next section.

In all cases, note that throughput is shown for gener­ic configurations, which may or may not match your own. If you’re inputting interlaced feeds, check if the transcoder can de-interlace in hardware; otherwise, you’ll have to de-interlace using the host CPU, which will cut throughput. Ditto for input formats the board might not natively support, like MPEG-2- or AV1-encoded contribution streams.

Assessing Quality

Next up is quality. Most vendors provide basic quality-related information, like this from AMD: “The MA35D card nominally produces video quality that is closely correlated to x264 medium, x265 medium and x265 slow presets, concerning its accelerated AVC, HEVC and AV1 encoders.” Obviously, this is too vague to help you pick a quality winner among available cards.

There are some published quality comparisons, but you probably won’t find a study that covers all of the devices you’re considering in your anticipated configuration. One useful study is the Moscow State University (MSU) hardware comparison, which benchmarked AMD’s Radeon RX 6800 XT, Intel’s Arc A380 GPU, and NVIDIA’s RTX 4070TI, along with several other hardware devices that aren’t as commercially available. Unfortunately, the report doesn’t include the AMD MA35D.

Table 2 shows the bitrate savings each codec deliv­ered compared to the x265 codec using the very fast preset as measured by VMAF. As an example, the In­tel ARC A380 produced H.265 at 80.8% of the refer­ence as compared to 99.3% for the NVIDIA card. This means that the bitrate of the NVIDIA stream would have to be 23% higher to deliver the same quality as Intel’s. The situation reversed with AV1, where NVIDIA was about 7.5% more efficient than Intel.

bitrate savings per codec

Table 3 shows the 5-year bandwidth cost for the 100-channel FAST service example in this article, based on a $.04/GB CloudFront charge for that volume level. Using the Intel transcoder for HEVC production could reduce these charges by 23%, saving approximately $170,000. Similarly, producing AV1 with NVIDIA could reduce bandwidth costs by 7.5%, saving around $55,300. Both represent substantial savings worth considering in your purchase decision.

5-year bandwidth cost fast

Note that the MSU report includes results from the NETINT Quadra. Citing a desire for privacy, NETINT declined to cooperate with MSU and wanted its re­sults pulled from the study. For this reason, I didn’t present them in Table 2.

Also, note that MSU tested game-oriented GPUs, as opposed to data centre cards like the NVIDIA T4 that you’re more likely to deploy in a cloud platform. I would guess there are a few qualitative differences between the encoding delivered by game and data centre CPUs, but that’s just a guess.

Interestingly, a Tom’s Hardware review tested multiple gaming-oriented cards from AMD, Intel, and NVIDIA. It found NVIDIA to be first in perfor­mance and quality, with Intel in the middle and AMD a distant third. If you’re evaluating GPU transcoders—and you should—NVIDIA is the best candidate.

Before wrapping up this quality section, note that if your application involves scaling to lower resolutions, you should measure quality at those resolutions. All hardware devices use different internal scaling al­gorithms that are optimised for speed, rather than quality. There are likely substantive quality differenc­es between the hardware alternatives, but you’ll need to test using your encoding ladder to quantify them.

I’ve covered basic qualifying variables (codec and application support) and the likely need to measure quality and throughput yourself. Once you have this data, here’s how you would apply it to identify the best transcoder for cloud and on-prem use.

Transcoding in the Cloud

Let’s explore hardware transcoders available in the cloud, starting with transcoding-specific hardware in­stances. Amazon EC2 VT1 instances configured with up to eight AMD Alveo U30 media accelerator cards have been up and running since 2021. AMD’s MA35D is available on Microsoft Azure as NMads MA35D Virtu­al Machines in a preview mode. No performance specs or pricing are provided, but performance should be similar to what AMD provides.

NETINT Quadra cards are available in a beta program on the Akamai Cloud as of this writing. The beta is a free program for “approved customers who have identi­fied workloads that will benefit from NETINT T1U VPU Accelerated plans.” No pricing or performance data is provided, although information on NETINT’s web site at go2sm.com/t1t2t4 should offer some guidance.

The next option is generic GPU instances. All cloud platforms offer a variety of NVIDIA instances for GPU-based hardware transcoding, but there are too many options to list. Instead, I’ll focus on the NVIDIA T4-powered g4dn instances benchmarked in the afore­mentioned Wowza performance data and the VT1 in­stance also tested by Wowza. Specifically, these bench­marks report the number of simultaneous 1080p30 streams each instance can deliver.

Again, for this comparison, we assume a 100-channel FAST service, where each channel uses an encoding ladder consisting of multiple renditions (e.g., 720p, 540p, 360p). The ladder increases the 1080p30 work­load by a factor of 1.87, as calculated in Table 1. Table 4 analyzes the hardware options, assuming a 3-year commitment to AWS for pricing.

hardware options 100-channel fast service

Wowza tested two configurations of the AMD U30 instances: the vt1.6xlarge with two GPUs and the vt1.3xlarge with one GPU. Since performance and pricing were both linear, the cost for each was iden­tical. That is, although the vt1.6xlarge (96 streams) costs twice as much as the vt1.3xlarge (48 streams), it delivers twice the throughput, making the per-stream cost identical.

With the NVIDIA T4 instances, the pricing and per­formance aren’t linear. In my comparison, the g4dn.xlarge delivered 60 streams for $0.21/hour. The larg­er g4dn.16xlarge costs 8.3x more but only delivers 10% greater throughput. The g4dn.12xlarge costs 7.4x more than the g4dn.xlarge but delivers only 3.5x the throughput.

The g4dn.xlarge emerges as the most economical cloud option, costing $36,792 over 3 years. Again, this is with a 3-year commitment to Amazon; dropping this commitment to 1 year would increase costs by roughly 50%.

Note that if there are substantial quality differenc­es between the alternatives, you should compute the associated bandwidth costs as shown around Table 3 and factor them into the equation.

While the decision between cloud and on-prem solutions often involves a mix of practical consider­ations and entrenched preferences, cost comparison remains a critical factor. With that in mind, let’s turn our focus to the on-prem side and examine how the numbers stack up.

On-Prem Installations

If you’re buying your gear for on-prem or co-location, you have to compute the CapEx and OpEx components. CapEx will include the transcoders and server(s) to house them, while OpEx will include co-location or other storage allocation costs plus power.

Table 5 contains cost and performance data for three hardware transcoders. Throughput numbers for AMD are from here, with cost and power figures from the AMD website. For the 187 required 1080p30 equivalents, we need six cards. Super­micro quoted a turnkey price of $18,500 for the 2RU AS-2015HS-TNR server with six MA35D cards installed.

cost and performance data 3 hardware transcoders

The NETINT data is for the Quadra Video server which is sold fully configured (and yes, it’s also a Supermicro server). If you’re considering a NETINT system and don’t need all cards, it’s worth checking if you can buy a unit with fewer than 10. Note that NETINT sells a $19,000 system with a lower-powered CPU, which might work for many applications, and also a more expensive Ampere-based system if you need de-interlacing or transcription.

T4 throughput numbers are from here and from my tests here. At 16 1080p30 streams, you’ll need 12 cards to deliver the 187 re­quired streams. You can verify the power draw at go2sm.com/t4. The T4 costs around $750, and Super­micro quoted a price of $4,500 for the 4RU GPU Super­Workstation 7049GP-TRT that can house up to six GPUs. This means two 4RU servers, which means twice the CapEx and OpEx.

Table 6 shows how the three systems compare. Co-location costs are from go2sm.com/4uco but will vary widely. In all cases, I assumed power would be charged separately at $0.15 per kilowatt hour.

3 systems compared

As you can see, even with two systems needed, the T4 is the cheapest CapEx-wise, but housing and pow­ering two 4RU servers for 5 years more than makes up the difference. Otherwise, from a cost perspective, the MA35D and Quadra are very close, although at $1,500 a card, the Quadra would gain a significant advantage if NETINT allowed you to buy a system with only six cards. Either way, it’s probably not enough financial difference to drive the decision; it will come to quality or other implementation details.

Note that if there are substantial quality differenc­es between the alternatives, you should compute the associated bandwidth costs as shown around Table 3 and factor them into the equation.

As it is between the cloud and on-prem expenses, buying your own gear is about 22% less costly than op­erating in the cloud, without considering the time value of money. That assumes a 3-year AWS EC2 com­mitment, which seems fair given you’re making a lifetime commitment on the hardware purchase.

Of course, it’s not all about comparing the cloud to on-prem operation; it’s about choosing the best option for either operating mode. Hopefully, the analyses presented in this article and the associated Google Sheet will do just that.

Author’s Note: I would like to thank Ben Lee from Su­permicro for supplying all of the system and related in­formation and pricing. Also, you should know that I worked at NETINT between August 2022 and March 2024 and that I produced a training course for AMD’s MA35D in 2024. I have no continuing contractual agreements with either company.

Streaming Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

Hardware-Based Transcoding Solutions Roundup: Testing Performance

We put hardware-based solutions from NVIDIA, Intel, and NGCodec to the test to see which offers the strongest performance and the highest quality.