DL380 Fan Control Part 2

My server has been in use for a few months, and has been performing admirably, running the usual self-hosted suspects; NextCloud, PleX, Unifi etc, as well as facilitating my thesis for my masters in educational sociology.

This is going great — but there was one nagging annoyance: the fan curve had a bit of hysteresis in it that ramped the fans up; then down; then up;then down. My server closet is pretty much in the middle of the house as well, so noise becomes annoying fast. And, it is a literal closet; so the thermal solution isn’t great. Oh, to have a basement or outhouse…

There is a fix, though: Changing the hysteresis profile associated with thermal sensor 1 in the DL380e.

this is achieved by running

fan t 0 hyst 2

from the iLO prompt.

Hysteresis profile 2 was selected on the basis of being the profile of all other sensors, except the PSU sensors. The front ambient sensors ran hysteresis profile 3 previously.

The results

Fan utilization is now at a consistent 9.4-20-13-16-20-16 percent, which is decently quiet. The tradeoff, however, is higher component and ambient temperatures (all values in Celsius, at mostly idle):

Sensor      Now     Before
==========================
Ambient     28-33   26-32
Exhaust     47-50   44-46
P420        70-75   64-66
CPU         50-55   40-45
Disks       44-52   42-47

This, I’m willing to accept. YMMV. It is, however, imperative to actively cool the P420 - that’s a HOT chip.

Now, I want to investigate a R210ii fan swap. The higher ambient temps brought on by summer and, y’know, using this old thing, has set those fans (slightly) on fire. Those 40mm fans make their presence known!

More to play with

You can change the threshold values with:

fan t <n> caut <offset from original value>
fan t <n> crit <offset from original caut>

But I haven’t had a chance or a need to mess around. There is also a lot of tunables per sensor regarding setpoints, gain values and somesuch. I get rather lost, rather quickly.

Failed iLO management port

Last week, I had got my hands on a set of three Intel S3700 datacenter SSDs, and today was going to be the day I installed them.

The disk installation was straightforward enough. Legos and duct tape were involved. More on that some other time, perhaps.

The issue

So, I get disks in, I close up, and plug everything back in. And somehow I manage to kill the iLO management port along the way.

Well, the server still works. But now the iLO can’t be accessed over the network, which means my fan control scripts won’t work.

This greatly reduces the Everyone-In-The-House Acceptance Factor (EITHAF?), which is a problem, since I’ve actually, finally, started using the server for thesis word counting.

The solution

HPE has an utility for their servers called hponcfg, which allows you to set iLO parameters directly.

I goofed around with rpms and missing drivers in unRAID trying to get a sign of life from the iLO, but to no avail. It was dead.

Well, the network port was. No link light from either switch or iLO port, across multiple cables. A monitor allowed me to see what was going on; which indicated the iLO hardware otherwise was fine.

So: I needed to set the iLO to run on the onboard NIC of my 380e.

grml

grml is the grumpy sound sysadmins make when they can’t automate every task in front of them. It is also a very capable live CD, based on Debian.

Download grml, and get it on a bootable USB somehow. I used Etcher.

Boot off the grml USB. I added the boot parameters grml ssh=secret at the boot prompt to be able to ssh in.

hponcfg

hponcfg is available from HPE as part of the Management Component Pack.

To install it, first add the repository:

cat "deb http://downloads.linux.hpe.com/SDR/repo/mcp buster/current non-free" >> /etc/apt/sources.list.d/hp-mpc.list

and the GPG key:

curl http://downloads.linux.hpe.com/SDR/hpePublicKey2048_key1.pub | apt-key add -

pull a fresh list of packages, and install hponcfg:

apt-get update && apt-get install hponcfg

The XML file

I really don’t like XML as a representation of data. Too verbose. But nevertheless; hponcfg consumes XML.

The following snippet tells the iLO to use the onboard NIC instead of the management port:

<!-- HPONCFG VERSION = "5.5.0" -->
<!-- Generated 5/12/2020 23:15:55 -->
<RIBCL VERSION="2.1">
 <LOGIN USER_LOGIN="Administrator" PASSWORD="password">
  <RIB_INFO MODE="write">
  <MOD_NETWORK_SETTINGS>
    <SHARED_NETWORK_PORT VALUE="Y"/>
  </MOD_NETWORK_SETTINGS>
  </RIB_INFO>
 </LOGIN>
</RIBCL>

Save it on your grml live environment as, say, lom.xml, and apply the configuration with `hponcfg -f lom.xml. After a bit of time, the iLO will reset, and iLO is running off the onboard NIC.

Notes

I really want these tools in base unRAID, or at least the drivers for them.

After a iLO reset, it had forgotten to run off the shared networking interface.

Running the iLO off the LOM means that the server itself cannot ssh in to the iLO.

Unraid Persistent Home

In the unRAID subreddit I came across someone needing to set up a ssh key for accessing files remotely.

There is, however, a step zero to perform before this can work.

The problem

When unRAID boots, most of the filesystem is copied from the boot USB into RAM. This means that any changes will be lost upon a reboot. Thus, simply doing mkdir /home/norseghost for user settings won’t work.

Persistent home is still possible, though.

The solution

Create a home directory somewhere on the array. My location is /mnt/user/system/home. I’ve set the share system to be cache-prefer.

Install the User Scripts plugin. Create a new script - let’s call it mount_home:

#!/bin/bash

mount -o bind /mnt/user/system/home /home

This bind-mounts the newly created home dir on to system /home. Set it to run on array start. And run it now, while you’re at it.

Create another one - unmount_home:

#!/bin/bash

umount  /home

Set that one to run on array stop. This is to enable array shutdown without “device busy” errors.

Profit!

Now unRAID has a persistent home. Create a home directory for any user that would log in remotely:

root@unraid# mkdir /home/norseghost
chown norseghost:norseghost /home/norseghost.

And now you can add passwordless ssh login, shell profile customizations, or whatever.

Add a GPU to the DL380e

Because the CPU is for, uh, my thesis!

After observing PLeX streams churning up some CPU time, I decided to get a GPU to offload transcoding tasks. I went with a GTX 1050 Ti 4GB. This is more or less the same chip as the Quadro P2000, which is the current ~budget~, uh, low-powered darling. With a sneaky workaround, the artificial two-transcode limit is easily circumvented. For 5 times less than a P2000, I’ll take that deal!

GPU installation patch

To pass your GPU through to your PLeX docker, there are preparatory steps needed.

0. Install Community Applications

1. Install unRAID-nVidia

From Community Applications install unRAID-nVidia. Go to Settings → unRAID-nVidia, Select the nVidia build for your version of unRAID, and install.

2. Prepare patch

su
cd /boot
wget https://raw.githubusercontent.com/keylase/nvidia-patch/master/patch.sh
chmod +x patch.sh
mv patch.sh nvidia-patch.sh
cat /boot/nvidia-patch.sh >> config.go

3. Power down, and install GPU

Installing the card is straightforward. Remove the PCIe riser, insert the card in the 16x wide slot, reinsert, reboot.

The GPU fits!

Do note that the 16x slot is 8x electrical; but this does not matter for our purposes. My particular card did not need an extra GPU power cable. If yours does, you need the 10 pin to GPU power adapter from HP, or get this one from moddiy.com

4. Reboot, and configure the PLeX docker

…but first, go to Settings → unRAID-nVidia; and copy your GPU GUID somewhere convenient.

In the unRAID web UI navigate to Docker, and reconfigure the PLeX docker. Switch to advanced view, and under “Extra Parameters” add --runtime=nvidia. Under NVIDIA_VISIBLE_DEVICES add that GUID. Save, restarting the PLeX docker.

5. Enable hardware transcoding in PLeX

In the PLeX webui, go to settings for your server. Under “Transcoding”, select “Use hardware transcoding when available”

And there you go!

Pipe down, you!

Remember the fan control rain dance from the last entry in this series? And HPE’s agressive stance towards fan control? Well, HPE’s not gonna let you forget. After installing the GPU, my fans were running at an… excessive > 60 %. And my previous fan hack - just setting every fan baseline to 1 - didn’t work anymore. This reddit post could point me in the right direction, though.

This process is also a little involved, so buckle up.

1. Reset iLO.

There’s a bug in the fan-control hacked firmware that makes it not display command outputs in SSH sessions beyond the first after a reset. And this output is important for the next step.

2. Figure out which sensor is making iLO freak out

iLO is, as my son so eloquently put it, crying “stranger danger” on account of not recognizing the GPU. This can be illustrated by SSHing into iLO, and running the command fan info g.

A nice table like the following should be presented:

GROUPINGS
0: FASTEST Output:  63  [02*07 ...
1: FASTEST Output:  63  [02*07 ...
2: FASTEST Output:  35  [01 02*...
3: FASTEST Output:  36  [01 02 ...
4: FASTEST Output:  60  [01 03 ...
5: FASTEST Output:  60  [01 05 ...

(Example borrowed from the linked Reddit post, since I forgot to save my actual output)

Note that some numbers are marked with an *. This indicates that that is the sensor iLO is reading as the hottest - in my case, sensor 52.

I said, be QUIET!

To quiet down just that sensor, run fan pid 52 hi 300 or some other low number. And enjoy immediate relief, as your fans settle down somewhere around 10-15 %.

Results

Quick testing yielded 2 4K → 1080p transcodes, at ~1500 megabytes of GPU ram each; alongside one 1080p → 720p transcode. Realistically, I wont have much more than one 4K transcode at any given moment, if at all. Very nearly 0 CPU usage though, which was nice.

Feels good when a plan comes together.

A recent aquisition

My new NAS, remote workstation and so much more

Pursuant to my master’s in educational sociology I’ve been coding a fair bit of R recently. I’ve quickly run into resource bottlenecks though — the intersection of a fairly large dataset and a mere X1C6 turns out to… not be great. So… what better excuse to buy retired enterprise hardware? It’s for my degree!

The hardware

| Base     | HP DL380e gen 8         |
| Chassis  | 12 LFF bays, 1x750W PSU |
| CPU      | Dual Xeon E5-2450L      |
| RAM      | 96 GB (4x8 + 4x16)      |
| HBA(ish) | P420, B120i             |
| Data     | 5x 6TB HGST NL-SAS      |
| Cache    | 500GB WD Red NAS SSD    |

Went whole hog on the RAM, as that’s where my programming efforts are stymied. I’ve got room to grow - HPE reports I can increase up to 196 GB; while unRAID reports 384 GB max capacity.

The disks were used, and a steal at ~15 USD per gigabyte. All report A-OK.

Hacks

Extra 12v/5v power

To use the B120i for cache SSDs, I needed to find extra power somewhere. My 380e came without the rear drive cage option; but did come with the rear drive cage cable. Measurements yielded this pinout:

    |-|
+---------+  1: 8v ground    4: 12/8v (yellow)
| 1  2  3 |  2: empty        5: 12v/5v ground
| 4  5  6 |  3: 1v ground    6: 5v/1v (red)
+---------+

Also, the cable fits a female 6-pin PCI Express connector perfectly. So, a massacre of a 6-to-8 pin adapter as well as a molex extender later, we have power!

The yellow and red leads on the molex adapter go to pins 4 and 6, and the two black leads meet and go to pin 5.

The first of four planned cache drives

Fan Control

HP servers are notorious for having an… aggressive approach to fan profiles. This means they can be hard to share a small home with. But never fear! Nerds to the rescue — turns out, there’s a hack for that.

Do note: THIS IS A HACK. IT MAY NOT WORK. IT MAY BRICK YOUR SERVER.

I hate noise more than I have sense, and I was fine in the end. YMMV.

0. Actively cool the P420

The P420 is a hot chip, and largely responsible for the baseline fan levels. iLO does all it can to keep it at or below 85 degrees C; which means running the fans hard. I ziptied a Noctua 40mm fan (the A4-N20FLX) to the heatsink, with great results. Powered from the rear drive bay power cable (another adapter in the chain), this keeps the RAID card at a comfortable 65-67 degrees C, and iLO can stop worrying.

So. Many. Adapters

Except that, in their infinite wisdom, any PCI card detected means fans 3, 4, and 5 will run at a minimum of 35-40 %. More steps must be taken!

1. Install exploitable iLO firmware

wget https://downloads.hpe.com/pub/softlib2/software1/sc-linux-fw-ilo/p192122427/v112485/CP027911.scexe
CP027911.scexe --unpack=ilo

Install the 2.50 firmware however you like. I used the web interface.

2. Install hacked firmware

git clone git@github.com:airbus-seclab/ilo4_toolbox.git
yay -S keystone hexdump
cd ilo4_toolbox/scripts/iLO4/eploits
wget https://uc2e993615a24a6915b40d722b8c.dl.dropboxusercontent.com/cd/0/get/A1CIhVjQEhr9ukukz8Qw_dHKizKB0RGgnFjfrp6z1rUtvBFclCvn4t6LErPcGVl0At3NQKzgezKAb8eV9-W5eg1P_0lRnZ47R-d5u0r4VvTpbmRBuItsv5RL2b2aKbyY7_M/file?_download_id=16760008867236312560412850928972566356913390752513665509633372074&_notify_domain=www.dropbox.com&dl=1
python2 exploit_write_flash.py 250 ilo4_healthcommands.bin

I had the exploit stall the first time I ran it. Tried again, the planets were aligned.

3. Control fan speed:

This will reduce the base speed of the fans to a more bearable level cross the board; while allowing the firmware to respond as designed to high temperatures. It really quiets down around sensor 32; which, incidentally, is the P420. Adding disks can cause the fans to spin up; which requires a re-run of the command. Change user and iLO hostname to suit your environment. If private keys are not set up, add sshpass password after do.

for I in `seq 1 65`; do ssh -o KexAlgorithms=+diffie-hellman-group1-sha1 martin@nas-ilo "fan pid $I lo 125"; done

I decided to run this command periodically, in case the box gets confused and ramps up the fan profiles again.