Category Archives: HomeLab

The Hyperconverged Homelab—Hardware Accelerated Video Transcoding

Share Button

There are many uses for a dedicated GPU besides gaming. Potentially the most appealing, for homelab enthusiasts, is providing hardware accelerated transcoding of video files. Recent consumer graphics cards include dedicated hardware modules for both decoding and encoding the most common video compression formats. This allows the heavy lifting of video transcoding to be offloaded from the CPU, freeing up compute resources for other tasks and enabling faster, more efficient video encoding and decoding.

This is essentially the same functionality as Intel’s QuickSync hardware-accelerated video transcoding, but available on consumer and enterprise graphics cards as opposed to being part of the on-chip graphics package. Although I initially hoped to leverage QuickSync on my current processor, I have not been able to get pass-through of the integrated graphics working properly. I’ll cover those issues in another entry, but for now I want to talk about leveraging the dedicated video transcoding functionality of consumer-grade NVIDIA graphics cards.

NVENC On the Cheap

Today, the absolute best bang for your buck in the GPU market is a secondhand GTX 1060. With a Passmark score right around 9000, depending on the variant (8971 for the 3GB, 9095 for the 6GB), these GPUs have adequate performance for last-gen gaming at 1080p. And at a consistent local Craigslist price of around $140 USD, they represent an absolutely phenomenal 64–65 Passmarks per dollar. To illustrate my point, here’s the total Passmark score and Passmark/$ for typical local Craigslist prices for all 10-series GPUs:

As you can see, the 1060 variants lead the pack in terms of value, at more than double the performance per dollar of a 1080 TI. And while they may no longer be suitable for demanding gaming and rendering tasks due to their relatively low total performance, they’re still quite capable. Side note: while the GTX 1050 theoretically includes the hardware we care about, the overall graphics performance is much lower and the RAM is significantly less; the used market prices are not cheap enough to justify these shortcomings.

The GP106 chipset that first-gen GTX 1060 cards are based on is the same as in the Quadro P2000. Like almost all Pascal architecture GPUs, it features dedicated hardware for video encoding and decoding. Looking at the official video codec support matrix, we can also see that this chipset is recent enough to support modern codecs, including up to H.265 4K YUV 4:4:4 encoding and up to H.265 12-bit YUV 4:2:0 decoding. H.265 4:4:4 is still relatively uncommon, and hardware decoding for it isn’t supported on any Pascal architecture card, so we’ll accept that as a reasonable upper limit.

Quadro vs. GTX

So what’s the difference between a Quadro P2000 and a GTX 1060 when it comes to video transcoding? On the hardware side, not much that I’ve been able to find. The memory size and bus width differ, but the decode/encode module appears to be the same. Of course, NVIDIA imposes a number of artificial software-based limitations on their GTX cards designed to force enterprise customers to pay the premium for the Quadro brand. For video encoding, this manifests as an artificial restriction on the number of simultaneous transcoding sessions supported by the driver. Where Quadro cards (from the P2000 and up) are “Unrestricted”, all GTX cards are capped at 2 sessions. What does the potential performance look like if we ignore this restriction?

According to the latest revision of NVIDIA’s technical notes for NVENC and NVDEC, the maximum performance for encoding H.264 is 648 FPS, while decoding is 658 FPS. This is a per-module performance figure, so it doesn’t depend on the GPU chipset generally but only on the number of modules included in a specific GPU’s configuration. The P2000 and GTX 1060 both feature single NVENC and NVDEC modules. Here’s a look at Pascal video coding performance by quality option:

Frames Per Second, single source, higher is better, maximum values in red. Data pulled from NVENC/NVDEC Application Notes as of 2 April 2018.

Doing a little simple math (Encoding FPS ÷ 30 FPS per stream) we can estimate the performance for multiple stream encoding. A single NVENC module should be capable of encoding 21 simultaneous streams in High Performance mode and 11 in High Quality Two-Pass mode. NVDEC performance is similar, at 21 streams regardless of their encoding level. A P2000 or GTX 1060 should be able to perform the theoretical maximum 21 simultaneous 1080p30 transcodes on its single NVENC module, minus some potential overhead for task switching and performance degradation from memory restriction. Of course, this is all hypothetical, but YouTuber Alex over at Sloth Tech TV reports that he was able to achieve 18 streams on a P2000. That seems fairly realistic, and is in the ballpark for previously-published numbers for Pascal NVENC modules (575 FPS, according to a previous version of the technical notes), as well as being approximately what would be expected when accounting for the lower boost clock speed of the P2000 compared to the GTX 1060.

Hardware Transcoding, Un-Capped

What does this look like in the real world? Thanks to some clever reverse-engineering of the NVIDIA drivers, it is now possible to un-cap consumer GTX cards for unrestricted video transcoding. This is the same basic technique used by DifferentSLIAuto, the utility for enabling SLI on unsupported hardware configurations: we simply patch some new data into the right memory address in the driver blob. Fortunately, others have already gone to the trouble of verifying this modification and have even put together a simple shell script that will safely do it for you.

Unfortunately, this driver requirement means that it’s not possible to do this under FreeNAS, so I’ve spooled up a fresh Ubuntu Server 18.04 VM for testing purposes. First things first, we’ve enabled PCI Passthrough on the GPU in the Host’s hardware configuration. Then we simply add the card (and its associated audio device, even though we won’t be using it) to the VM as a regular PCI device, making sure to reserve all guest memory:

Then we hide the hypervisor from the guest OS so the NVIDIA drivers will load and not simply refuse to work. Add hypervisor.cpuid.v0 = FALSE under the guest’s Advanced Configuration:

Finally, although it was necessary for functionality under Windows 10, I have not found any trouble with Message Signaled Interrupts under Ubuntu 18.04. If the GPU is flaky, it may be necessary to configure pciPassthroughX.MSIEnabled = FALSE for each device; I’ll update this entry if extended testing surfaces any stability problems.

Now we can boot up the VM and install the NVIDIA drivers. The caveat here is that you need a version of the driver which is supported by the patcher. A list is maintained in the README; for Ubuntu 18.04 I was able to use the community PPA to install a version 415 driver, which was using a compatible point release supported by the patcher:

Once this is done, reboot the system and check functionality with nvidia-smi. You should see your card(s) listed alongside some performance stats:

From here, instructions proceed as documented in the patch README. Clone the repo then simply run bash ./patch.sh. The script will automatically validate the driver version, locate the blob, and apply the patch bytes. Reboot to complete the process.

Full Transcoding With Plex

In order to test out our new hardware configuration in a common real-world scenario, I’ve selected the personal cloud streaming software Plex. It runs on pretty much anything, and will allow us to verify both encoding and decoding functionality without having to manually invoke something like ffmpeg a bunch of times and then do napkin math to figure out what that means for other workloads. This will allow me to test multiple simultaneous real-time transcoding sessions in a realistic environment.

There are a couple caveats to hardware transcoding with Plex. The first, of course, is that you need a Plex Pass subscription to enable the currently-beta Hardware Accelerated Transcoding feature. The second is that you actually need to be running one of the latest beta releases if you want to enable hardware accelerated decoding on NVIDIA chips. Although QuickSync/VAAPI hardware decoding has been available for a while, NVENC support too, NVDEC support is a recent addition.

According to Plex Forums user AnonymousRetard, the main change in Plex appears to have been updating the version of ffmpeg they use for transcoding to a more current release that supports NVDEC. However, this change is so recent that it’s not even unofficially supported in the beta releases, so some slight modification is needed to inject a flag to the encoder to enable support.

User revr3nd on the Plex Forums has put together a handy guide and convenient little script for this part of the process. Essentially all you need to do is get a copy of Plex version 1.15.1.791 or later and then patch the Plex Transcoder with a shim to conditionally inject the NVDEC argument when invoked on supported codecs. Find it on GitHub; I did have to submit a PR with a couple fixes, though, so your mileage may vary. Since NVDEC only works on certain codecs, it’s important that Plex doesn’t attempt to use it when it won’t work. The codecs supported vary by architecture, chipset, and specific card version. Although Pascal generally supports MPEG-1, MPEG-2, VC-1, H.264, and H.265 4:2:0, support for VP8 and VP9 10- and 12-bit is only found on some cards.

Confirm which codecs are supported by your GPU (check the NVDEC support matrix) and invoke the script accordingly. In our case, running a GTX 1060 (GP106) under Ubuntu, the command is:

Note: you will need to re-run this command every time Plex is updated, until NVDEC is officially supported, since upgrades will overwrite the shim.

Performance

Finally, the moment of truth, that magic “(hw)” on the Dashboard:

To confirm that hardware accelerated decoding is active, check the “dec” output column of nvidia-smi dmon -s u:

Utilization will spike when you initiate a new stream, as Plex transcodes a buffer, and will then settle out to low values as the transcode keeps up with playback.

Over a short four minute test interval running six transcodes from H.264 1080p ~17Mbps to the 10Mbps preset, I measured 24.5% encode utilization and 21.1% decode utilization average from nvidia-smi. CPU utilization, only for container and audio codec transcoding, was minimal, at a little under 10% per stream (note that percentage is per-core: total system capacity is 400% utilization). However, reported CPU usage from the ESXi host was rather high, with total combined CPU use from the FreeNAS and Plex VMs averaging around 85% of total capacity during this stress test. This was largely due to file transfer overhead of SMB, which should be improved by switching to NFS or iSCSI.

As much as I would like to further test this setup, I don’t have enough playback devices at hand. I intend a more thorough performance report when I am able to borrow additional playback devices.

That’s it for now.

The Hyperconverged Homelab—Upgrades

Share Button
Thanks to the magic of Craigslist and eBay.

After two years of trouble-free service running FreeNAS and Ubiquiti’s UniFi Controller under an Ubuntu Server 18.04 VM, it was finally time for some upgrades. Although I was able to expand my storage capacity by growing a vDev of old, small drives, I wanted to take this opportunity to future-proof and expand my capabilities.

Goals:

  • GPU for passthrough. My system has more capacity than needed for its primary tasks, so I want to try out VM gaming.
  • Better network monitoring and control.
  • Full-size motherboard. Although this project originally started life as a mini-ITX build, my needs have changed, and I am no longer size-constrained on my case.
  • More SATA devices. Using slow consumer spinning platters means I can put a large number of drives on a single HBA before exceeding the available bandwidth and creating a performance bottleneck.

As pictured, clockwise from upper left:

  • EVGA Superclocked GeForce GTX 1060 3GB—Craigslist, $140 with 2 years mfr. warranty, used/like-new
  • Ubiquiti UniFi 8-port Gigabit Managed Switch with 4 PoE (US-8-60W)—eBay, $109 shipped, new/open box
  • SuperMicro C7Z170-OCE-O LGA 1151 ATX Intel Motherboard—eBay, $165 shipped, new/old stock
  • Intel RES2CV240 24-Ports SAS / SATA 6.0Gbps RAID Expander Card—eBay, $149 shipped, new/open box

The GPU was selected by crawling Craigslist for every local listing with “GeForce”. I then constructed a spreadsheet of the listings and calculated the PassMark/$ score to find the best value. At 64 PassMarks per dollar, this EVGA card was one of the best value outside of 1080 and 1080TI models previously used for cryptocurrency mining. Crypto mining is to GPUs like drifting is to cars: you can do it safely, if you’re careful, but when buying second-hand the deals aren’t worth the potential headache of having a unit that’s been thrashed. The 1060 class cards also have enough performance to run current-gen games at decent settings.

I purchased the Unifi Switch because I was experiencing some bizarre network performance issues with the server. After way too much mucking around on the software side of things I discovered that it was just the consumer-grade Intel NIC on the server motherboard dying (as they are known to). I switched to the other, unused NIC (an Atheros unit) and my connectivity problems went away. It was too late to cancel the order, and I figured that it would be nice to have a managed switch with PoE anyways. Unfortunately it turns out my UAP-AC-LR was cheap for a reason: despite packaging to the contrary, it predates that model’s support for 802.3af standard PoE and requires 24V passive… Oh well, no real loss.

The motherboard was chosen by chance, as I was browsing Newegg for compatible models and was surprised to see that SuperMicro made a desktop gaming-oriented motherboard. A quick trip to eBay surprised me even more with this inexpensive new old stock unit, which I quickly purchased. Single 1GbE, no WiFi/BT, but with Thunderbolt-capable USB 3.1 module. Interesting possibilities abound.

The Intel SAS Expander I found by recommendation on one of the forums, either FreeNAS, ZFS, or ServeTheHome (I can’t recall). This model is particularly desirable not only for its performance to cost ratio, but because it supports dual uplink configuration. Two SAS ports can be used for transparent uplink to the HBA, doubling the throughput available compared to using a single channel uplink. Since 6Gb SAS transfers approximately 600MB/s, splitting it in half leaves 300MB/s available for each hard drive channel in simultaneous utilization. That’s enough to saturate my slow 5k drives. Without this dual uplink, I would be limited to 120MB/s per drive at full utilization, which would be a performance bottleneck.

Aside from the SAS expander, cable management went quite well.

Breaking down the server, I took the opportunity to perform some much-needed cleaning. Although not particularly old, this device has had to live in some fairly awful conditions, including the dustiest room in the dustiest house in the dustiest neighborhood I have ever lived in. Unfortunately, I don’t have filters for the HDD bays, which serve as the primary system intake (the top radiator is the primary exhaust). They also live at ground level. Addressing this enclosure shortcoming is on my to-do list.

As you can see the case fits my components fairly well, and I’ve used the back panel cable management for everything except the SAS cables. This includes fans, pump, front IO, boot disk, and even the motherboard and CPU power supply cables. It’s quite tidy without the SAS cables. Unfortunately there is no way to terminate the SAS cables myself to custom length (the plug ends are actually PCBs), and the cables really don’t like to be bent and do not hold their shape at all, so they get to be spaghetti.

The SAS expander is simply suspended by its Molex 4-pin power connector (don’t crucify me: it doesn’t weigh a lot, this system doesn’t move, and it’s only temporary while I sort out a new case) and held in position by the fairly stiff SAS cables.

The GPU has been mounted in the primary PCIe slot. The IBM M1015, mounted below it, is half-height and so does not obscure the GPU intake fan too badly.

The case will soon be replaced with a Rosewill RSV-R4000 or RSV-L4500, to be rack mounted. This unit is plug and play with my existing hot swap cages, provides plenty of room for my GPU and water cooling loop, is extremely cheap, and even has a front panel intake filter.

Next time, the trials of GPU passthrough.

The Hyperconverged Homelab—Growing RAIDZ vDevs

Share Button

Quickly approaching 85% utilization of my pool, I found myself in need of more storage capacity. Since the first revision of this project’s hardware was scrounged together on a small budget and utilized some already-owned drives, one of my vDevs ended up being a RAIDZ1 vDev of only 3x2TB. Adding more vDevs to my pool would require either an additional HBA (not possible with my now-undersized motherboard’s single PCIe slot) or a SAS expander. In either case, I would need the drives themselves. I figured that this was a good opportunity to experience growing a ZFS pool by increasing the size of a vDev’s disks.

ZFS does not support “growing” of arrays by adding disks (yet!), unlike some other RAID and RAID-like products. The only way to increase the size of a pool (think of it as pooling the capacity of a bunch of individual RAID arrays) is to add vDevs (the individual RAID arrays in this example), or to replace every single disk in a vDev with a larger capacity. vDevs can be constructed out of mixed-size disks, but are limited to the maximum capacity of the smallest disk. For example, a ZFS vDev containing 2x 2TB and 1x 1TB disks has the same usable capacity as one containing 3x 1TB disks: the “extra” is ignored and unused. Replace the lone undersized disk, however, and ZFS can grow the vDev to the full available size.

Expanding vDevs is a replace-in-place strategy that essentially works the same as rebuilding (“resilvering”) after a disk failure. Recent versions of ZFS support manually replacing a disk without first failing it out of the vDev, which means that on single-parity (RAIDZ1) vDevs this process can be accomplished safely, without losing fault-tolerance. The FreeNAS documentation provides more information and instructions.

Growing by “too much” is not recommended and will result in poor performance, as some metadata will be an non-optimal size for the new disk size. As far as I have read (unfortunately I can’t find a link for this), it’s definitely “too much” around an order of magnitude, although aiming for no more than a factor of five is probably wise. For my case, as an example, we’re growing from 2TB disks to 6TB disks, which is only a factor of 3. This should be perfectly fine.

Speaking of 6TB drives… Hard drives may be cheap in historical terms, but there’s still value in being thrifty. For my use-case, which currently includes read-oriented archival storage, grown mostly write-only and used for backups and media storage, accessed by 1Gb network links, the performance requirements are rather low. The data is (mostly) replaceable, so single redundancy is adequate. This means that I can safely use the cheapest hard drives possible, which are currently found in Seagate Backup Plus Hub 8TB carried by Costco for only $129. (At the time I purchased, the last of their stock of the 6TB variant was being cleared for even cheaper.)

These drives are Seagate Baracuda ST8000DM005, which are an SMR drive. This technology, which has been used to great effect to increase the size of cheap consumer drives, essentially by overlapping the data on the platters, is only really suitable for write-once use and is known to be rather failure-prone. However, these have plenty of cache and perform just fine for reading, and adequately for writing, so are perfectly acceptable for my use-case.

Growing the target vDev was fairly straightforward. I had extra drive bays unused so simply shucked the drives from their plastic enclosures and proceeded one at a time. After formatting each disk for FreeNAS, I initiated the resilvering process. This took somewhere between 36–48 hours to resilver 1.7TB of data per drive. I found this performance rather poor, but was not able to locate an obvious bottleneck at the time. In hindsight, inadequate RAM was likely the cause. After resilvering I removed the old drive to make room for the next replacement. Although my drive bays are hot-swap (and this is supported by both my HBA and FreeNAS), I didn’t label the drive bays when I installed them initially and had some difficulty identifying the unused drives. The best solution I found was to leverage the per-disk activity light of the Rosewill hotswap cages.

A lovely sight.

With capacity to spare, I can finally test out some new backup strategies to support, such as Time Machine over SMB.

The Hyperocnverged Homelab—Configuration c.2018

Share Button

Although my original use-case included virtualizing a router/firewall, it was only beneficial for a couple months while I was still living in accommodation with a shared network. I ran OpenWRT for simplicity of configuration and had two separate vSwitches configured in ESXi, one for each NIC. This allowed me to connect to the shared network while retaining control over my own subnet and not leaking device access or mDNS. I had hoped to pass through the motherboard’s 802.11ac WiFi NIC (which worked fine), but was stymied by OpenWRT’s glacial upgrade cycle. They were running an absolutely ancient version of the Linux kernel which predated support for my WiFi chipset. I considered working around this by creating a virtual Access Point using a VM of Ubuntu Server or other lightweight Linux which would support the WiFi chipset, but it just wasn’t worth the trouble.

After spending a couple months abroad with the server powered down I returned home and found a new apartment. I was able to get CenturyLink’s symmetric Gigabit offering installed, and running their provided router eliminated the need for a virtual router appliance. The OpenWRT VM was quickly mothballed and replaced with an Ubuntu Server 18.04 VM to run Ubiquiti’s UniFi Controller.

The current (Dec. 2018) software configuration is fairly simple:

  • ESXi Server 6.5
    • FreeNAS 9.10
      • 12GB RAM, 4vCPU, 8GB boot disk
      • IBM M1015 IT Mode via PCIe passthrough
      • 2x RAIDZ1 vDevs of 3 disks (consumer 2 and 5TB drives)
      • Jails for utilities benefiting from direct pool access
    • Ubuntu Server 18.04
      • 2GB RAM, 2vCPU, 8GB boot disk
      • Ubiquiti UniFi Controller
      • DIY Linode dynamic dns

The Hyperconverged HomeLab—Introduction

Share Button

Now in its second relatively trouble-free year, it’s finally time to get some upgrades on my hyperconverged homelab. First, however, a long-overdue introduction!

The current case configuration: a modified Cooler Master Centurion 590 mid-tower case.

This project started out as a compact, low-power, ultra-quiet NAS build. However, I quickly decided that I wanted to virtualize and give myself more power and flexibility. At the very least, being able to run pfSense or another router/firewall appliance on the same device represented a significant benefit in terms of portability: the ability to plug into basically any network without making the NAS available on it was a huge potential benefit.

I decided to use a 35W Intel desktop processor and consumer motherboard. They’re economical and readily available, with plenty of products available for performance and cooling enhancement. At the time, Skylake (6th Gen.) was mature and Kaby Lake didn’t have an official release date, so I chose the i5-6500T. The $100 premium on MSRP and near total lack of single unit availability prevented me from choosing an i7-6700T.

For motherboard I chose Gigabyte’s GA-H87N-WIFI (rev. 2.0), a mini-ITX motherboard from their well-regarded UltraDurable line. The primary driver of this decision was the onboard dual 1GBase-T and M.2 802.11a/b/g/n plus Bluetooth 4.0 via M.2 card. Dual LAN was critical for the device’s potential use as a router, as virtualizing my NAS would require utilizing the single available PCIe slot for an HBA or RAID card.

RAM was sourced as 2x16GB G.Skill Aegis modules (still the cheapest DDR4-2133 2x16GB kit on the market), providing a solid starting point while leaving two DIMMs free for later expansion to the motherboard and processor’s max supported 64GB. I sourced a Seasonic SS460FL2 a 460W fanless modular PSU, a cheap SanDisk 240GB SSD for a boot drive, and Corsair’s H115i all-in-one liquid cooling loop.

At this time I was still case-less, and waffling on the purchase of a U-NAS NSC-800 hot-swap enclosure, when I discovered Rosewill’s 4-in-3 hot swap cages. I quickly located the Cooler Master Centurion 590 on local Craigslist, which represented a decent compromise on size and offered 9 5.25″ drive bays.

The final piece of the puzzle was the HBA, an IBM M1015 RAID card which I cross-flashed to LSI generic IT Mode firmware. See this other post for details. With that, the build was hardware-complete and went together (fairly) smoothly. Only minor case modification was required to fit the ridiculously over-sized water cooling radiator, which had to be mounted on the top of the case with the fans inside, since the case was not designed for water cooling and here was inadequate clearance above the motherboard.

I installed ESXi on the boot disk and then installed FreeNAS into a VM. (Yes, I should have drive redundancy for my VM datastore.) After flashing the M1015 everything was relatively plug-and-play, set-and-forget, with the only notable downside being that the motherboard refused to POST without detecting an attached display. That issue was solved when I discovered that an HDMI VGA adapter I purchased acted as a display simulator. This system served me well for the last couple years, but recently I’ve wanted to expand my capabilities. Having a single PCIe slot is somewhat limiting, especially since I didn’t end up buying a mini-ITX sized case…

Crossflash IBM M1015 to LSI 9220-8i IT Mode for FreeNAS

Share Button

The IBM M1015 is a widely available LSI SAS2008-based RAID controller card. It is an extremely popular choice for home and enthusiast server builders, especially among FreeNAS users, for its low price point (~$60 US secondhand on eBay) and excellent performance.

In essence, it’s hardware equivalent to the LSI 9211-8i; officially, it’s the 9220-8i, sold to OEMs to be rebadged. Two SFF-8087 mini-SAS quad-channel SAS2/SATA3 ports, no cache, no battery backup. Cross-flash it to LSI generic firmware in IT mode, they say, and you get an excellent SATA III HBA on the cheap. Turns out that’s easier said than done, especially if you’re working with a recent consumer motherboard.

The comprehensive, only slightly dated, instructions are here. Ironically, I only found them after I had pieced together the procedure for myself.

At this point, FreeNAS 9.10 is compatible with version P20 firmware. User Spearfoot on the FreeNAS forums has a package containing the utilities and firmware files. I’ve also attached it to this post: m1015.

Notes:

  • If your motherboard lacks an easily accessible EFI shell, use the one in rEFInd.
  • If you get the error “application not started from shell”, that’s an EFI shell version compatibility issue. Use the shell provided in the link.
  • “No LSI SAS adapters found!” from sas2flsh.exe indicates that likely IBM firmware is still present. Use megarec to erase it.
  • “ERROR: Failed to initialize PAL. Exiting Program.” means your motherboard is not compatible with the DOS sas2flsh. Use the EFI version.

Additional References:

  • GeekGoneOld on the FreeNAS forums has a quick guide: #18
  • And a useful reply in that same thread: #28
  • Redditor /u/PhyxsiusPrime describes the EFI shell compatibility workaround via rEFInd here.