Free Veeam linux backups

I didn’t feel like fighting against Nutanix CE any more so I copied my VMs out using qemu and rebuilt the entire cluster with ESXi 6.7 free. Seems to be ok so far with a few minor issues. Now that I’m back in operation I have a concern about having a single copy of my VMs on standalone hosts with RAID0 single disk configs. A bit of a roll of the dice but it works. So, I need a backup solution. ESXi doesn’t support API connections for backups so you have to use something else. I’m using Veeam Linux Agent free on each of my VMs now. I’m sending the backups to an NFS share on the Unraid server. Here’s my list of commands so I can do this again later and not wonder how I did it.

Make sure we’re updated before starting:

1
2
3
sudo apt-get update
sudo apt-get upgrade
sudo apt-get autoremove

SCP the Veeam download file to the server, which I think might just be a repository and user add script.

1
2
3
sudo dpkg -i veeam-release-deb_1.0.5_amd64.deb
sudo apt-get update
sudo apt-get install veeam

Next we’ll take care of the NFS share mounting.

1
2
3
sudo apt-get install nfs-common
sudo mkdir /mnt/backup
sudo nano /etc/fstab

Here’s my one fstab line for connecting to the Unraid NFS share:

1
192.168.169.8:/mnt/user/vmware /mnt/backup nfs auto 0  0

Mount the new share:

1
sudo mount -a

You should now be able to run the Veeam agent with “sudo veeam” which will launch the ncurses gui. Pretty obvious how to work through it from there, just select local storage and browse to the /mnt/backup location if that’s what you named it.

Nutanix CE, Dell H700 and disk detection

Not sure why I didn’t post this before, but here it is. When installing Nutanix CE on a Dell server (R710 in my case) with a PERC H700 card you can’t do a pass-through on the disks. You have to configure them as a single disk RAID0 in the PERC config.  Ctrl-R at boot for the H700.  That works, but then CE can’t detect the type of drive it is.  It will fail with a message about missing requirements.  You have to tell the install routine which drive is the “rotational” drive.

Exit the install process and login as root:nutanix/4u.

1
dmesg | grep sda

You might need to do this for sda, sdb, sdc until you find the HDD, not the SSD. The drive sizes should tell you which is the right one.

1
echo 0 > /sys/block/sd?/queue/rotational

Where sd? is replaced with the correct sd. Logout and log back in as install. Should be fine from there. This is being done with the img install, not the iso. I did get a similar error on the iso but didn’t try to work around it. I’ll see if it works the same way with that.

Ok, not bailing on Nutanix CE just yet

Since my last post I’ve been fighting to get the cluster back to some semblance of stability.  Doesn’t look like that’s going to happen.  However, there isn’t anything out there that approaches the ease of configuration, fault tolerance and hyperconverged storage I’m looking for.  So…I’m making sure I have a backup copy of my VMs (especially Grafana) that would be problematic to rebuild, and then I’ll blow away the entire cluster and start over from scratch.

First challenge is backups.  I found this link which describes what appears to be the simplest process:  Backup and restore VMs in Nutanix CE

I’ve had to modify this a bit for my particular circumstances, but it appears to be working well.  First, I’m lucky enough to have a Unraid NAS that holds all my media.  Plenty of space and it happens to support NFS.  With some trial and error and the instructions above I managed to work out the following command:

1
qemu-img convert -c nfs://127.0.0.1/default-container-46437110940265/.acropolis/vmdisk/be09f08c-56bf-472c-b8a6-16c022333ca5 -O qcow2 nfs://192.168.169.8/mnt/user/vmware/UnifiVideo.qcow2

Using this command and the previous instructions I was able to send the backup qcow2 image directly to Unraid for backup purposes.  It sure would be nice to have a checkbox option, but this will do for now.  Also, be sure to use the screen option in the other post because it’s likely your session will time out on modestly sized VMs.

Bailing on Nutanix CE

I’ve had a number of problems with CE lately that look like they’ll be difficult to fix.  Not sure I’ll have a better experience with a different platform, but I figure it’s about time to try.  If nothing else, I’ll have some backup copies of my VMs, which is not an obvious thing in CE.

So, here’s what I’ve had to go through to export VMs:

Log into a CVM and then “acli” to find the vmdisk_uuid.  Do a “vm.get [vm name] include_vmdisk_paths=1” to see a list of parameters.  Copy the vmdisk_uuid from about the middle of the output.  Exit the acli.

Run this:

1
qemu-img convert -c nfs://127.0.0.1/PrismCentral-CTR/.acropolis/vmdisk/[vmdisk_uuid from the previous step] -O qcow2 ./data/stargate-storage/disks/drive-scsi0-0-0-1/NameOfVM.qcow2

This command might be different depending on your setup.  I’m running PrismCentral so I had to use this location to find the disk.  The path is listed above the uuid in the acli command.  The output target will need to be adjusted depending on your space requirements.  If you leave out the target destination I believe it will save it in your current directory.  That might be ok or it might be too small for the VM.  I checked mine and decided to save it to the slow tier disk.  Depending on the disk size it might take a very long time.

Once it completes you can use SCP (not SFTP) to copy the file off.  I used WinSCP to connect to the same CVM.  The path for the above command is /home/nutanix/data/stargate-storage/disks/drive-scsi0-0-0-1.  The disk copy is in there for me and I can SCP it to somewhere else.  I tried sending it directly to an NFS share I have running on the network but it failed permissions, despite being whitelisted.  This process is cumbersome, but it works.  I’m sure there are better ways…

Grafana and Chromecasting to a TV

I’ve wanted to use simple Chromecast dongles for pumping a Grafana dashboard to a TV for a while now.  The challenge has been how to effectively manage the casting source.  Chromecasts can’t manage any of their own content, they can only be a casting target.  I don’t want a mobile device sitting in the rack with it’s sole purpose being the casting function.  Management of that would be difficult.  I also want to be able to cast to multiple Chromecasts with the same content or different content.

Google makes this difficult by limiting the signing certificate in the casting protocol.  However, some people have worked around it.  I’ve tried two different casting servers and I’m having success with:

https://mrothenbuecher.github.io/Chromecast-Kiosk/

I set up a dedicated VM with pretty light resources, installed Tomcat and then added the Kiosk server.  It works really well with one caveat.

The Chromecast dongles will arbitrarily decide if the TV is 720P or 1080P.  For most video content this doesn’t have a dramatic impact, but when you’re trying to display a dense Grafana dashboard it can make all the difference.  Unfortunately, this isn’t controllable in any way.  You have to test it against the TV and hope it works.

I now have a 32″ TV in the kitchen which is 1080P (also hard to find at 32″) and displaying a pretty dense Grafana dashboard.  I’ll try to add a picture here later.  I think this could be incredibly useful for business monitoring scenarios and is a lot less expensive than putting a PC on a TV.

Unraid shell script for getting stats into Grafana

Continuing the documentation effort.  This is a shell script you run from Unraid in a cron job to feed stats to InfluxDB.  You can then present them in Grafana.  Note about that, I was having a lot of trouble getting the Grafana graphs to present correctly for anything coming from this script.  I had to change the Fill from “null” to “none” in the graph.  Not sure why that’s happening, but “none” gets it to behave just like everything else.

## Assembled from this post: https://lime-technology.com/forum/index.php?topic=52220.msg512346#msg512346

## add to cron like:

## * * * * * sleep 10; /boot/custom/influxdb.sh > /dev/null 2>&1

## //0,10 * * * * /boot/custom/influxdb.sh > /dev/null 2>&1
#

# Set Vars

#

DBURL=http://192.168.x.x:8086 ## IP address of your InfluxDB server

DBNAME=dashboard ## Easier if you pick an existing DB

DEVICE=”UNRAID”

CURDATE=`date +%s`

# Current array assignment.

# I could pull the automatically from /var/local/emhttp/disks.ini

# Parsing it wouldnt be that easy though.

DISK_ARRAY=( sdn sdl sdf sdc sdj sde sdo sdh sdi sdd sdk sdm sdg sdp sdb )

DESCRIPTION=( parity disk1 disk2 disk3 disk4 disk5 disk6 disk7 disk8 disk9 disk10 disk11 disk12 disk13 cache )

#

# Added -n standby to the check so smartctl is not spinning up my drives

#

i=0

for DISK in “${DISK_ARRAY[@]}”

do

smartctl -n standby -A /dev/$DISK | grep “Temperature_Celsius” | awk ‘{print $10}’ | while read TEMP

do

curl -is -XPOST “$DBURL/write?db=$DBNAME” –data-binary “DiskTempStats,DEVICE=${DEVICE},DISK=${DESCRIPTION[$i]} Temperature=${TEMP} ${CURDATE}000000000” >/dev/null 2>&1

done

((i++))

done
# Had to increase to 10 samples because I was getting a spike each time I read it. This seems to smooth it out more

top -b -n 10 -d.2 | grep “Cpu” | tail -n 1 | awk ‘{print $2,$4,$6,$8,$10,$12,$14,$16}’ | while read CPUusr CPUsys CPUnic CPUidle CPUio CPUirq CPUsirq CPUst

do

top -bn1 | head -3 | awk ‘/load average/ {print $12,$13,$14}’ | sed ‘s/,//g’ | while read LAVG1 LAVG5 LAVG15

do

curl -is -XPOST “$DBURL/write?db=$DBNAME” –data-binary “cpuStats,Device=${DEVICE} CPUusr=${CPUusr},CPUsys=${CPUsys},CPUnic=${CPUnic},CPUidle=${CPUidle},CPUio=${CPUio},CPUirq=${CPUirq},

CPUsirq=${CPUsirq},CPUst=${CPUst},CPULoadAvg1m=${LAVG1},CPULoadAvg5m=${LAVG5},CPULoadAvg15m=${LAVG15} ${CURDATE}000000000” >/dev/null 2>&1

done

done
if [[ -f byteCount.tmp ]] ; then
# Read the last values from the tmpfile – Line “eth0”

grep “eth0” byteCount.tmp | while read dev lastBytesIn lastBytesOut

do

cat /proc/net/dev | grep “eth0” | grep -v “veth” | awk ‘{print $2, $10}’ | while read currentBytesIn currentBytesOut

do

# Write out the current stats to the temp file for the next read

echo “eth0” ${currentBytesIn} ${currentBytesOut} > byteCount.tmp
totalBytesIn=`expr ${currentBytesIn} – ${lastBytesIn}`

totalBytesOut=`expr ${currentBytesOut} – ${lastBytesOut}`
curl -is -XPOST “$DBURL/write?db=$DBNAME” –data-binary “interfaceStats,Interface=eth0,Device=${DEVICE} bytesIn=${totalBytesIn},bytesOut=${totalBytesOut} ${CURDATE}000000000” >/

dev/null 2>&1
done

done
else

# Write out blank file

echo “eth0 0 0” > byteCount.tmp

fi
# Gets the stats for boot, disk#, cache, user

#

df | grep “mnt/\|/boot\|docker” | grep -v “user0\|containers” | sed ‘s/\/mnt\///g’ | sed ‘s/%//g’ | sed ‘s/\/var\/lib\///g’| sed ‘s/\///g’ | while read MOUNT TOTAL USED FREE UTILIZATION DISK

do

if [ “${DISK}” = “user” ]; then

DISK=”array_total”

fi

curl -is -XPOST “$DBURL/write?db=$DBNAME” –data-binary “drive_spaceStats,Device=${DEVICE},Drive=${DISK} Free=${FREE},Used=${USED},Utilization=${UTILIZATION} ${CURDATE}000000000” >/dev/null 2>&

1

done

Telegraf mixed SNMP config

Following my previous post about Grafana, once everything is installed you’ll want to capture some data.  Otherwise, what’s the point.  Telegraf is a data gathering tool made by Influxdata.  It’s stupid simple to get working with InfluxDB.  After following the previous script, go to /etc/telegraf/ and edit telegraf.conf.  Near the top is the Output Plugins section.  Make sure that’s modified for your InfluxDB install.  From there, scroll down to Input Plugins.  There’s a ridiculous number of input plugins available.  We’re focused on SNMP today, but it’s worth looking through the list to see if a “need” can be solved with Telegraf before using some other custom script.

For me, I needed to add SNMP for my Ubiquiti ER-X firewall and my Nutanix CE cluster.  Here’s my SNMP config section with the obvious security bits redacted:

# # Retrieves SNMP values from remote agents
# [[inputs.snmp]]
[[inputs.snmp]]
agents = [ “192.168.x.x:161” ] ##Nutanix CE CVM IP
timeout = “5s”
version = 3

max_repetitions = 50

sec_name = “username”
auth_protocol = “SHA” # Values: “MD5”, “SHA”, “”
auth_password = “password”
sec_level = “authPriv” # Values: “noAuthNoPriv”, “authNoPriv”, “authPriv”

priv_protocol = “AES” # Values: “DES”, “AES”, “”
priv_password = “password”

name = “nutanix”
[[inputs.snmp.field]]
name = “host1CPU”
oid = “1.3.6.1.4.1.41263.9.1.6.1”
[[inputs.snmp.field]]
name = “host2CPU”
oid = “1.3.6.1.4.1.41263.9.1.6.2”
[[inputs.snmp.field]]
name = “host3CPU”
oid = “1.3.6.1.4.1.41263.9.1.6.3”
[[inputs.snmp.field]]
name = “host4CPU”
oid = “1.3.6.1.4.1.41263.9.1.6.4”
[[inputs.snmp.field]]
name = “ClusterIOPS”
oid = “1.3.6.1.4.1.41263.506.0”
[[inputs.snmp.field]]
name = “Host1MEM”
oid = “1.3.6.1.4.1.41263.9.1.8.1”
[[inputs.snmp.field]]
name = “Host2MEM”
oid = “1.3.6.1.4.1.41263.9.1.8.2”
[[inputs.snmp.field]]
name = “Host3MEM”
oid = “1.3.6.1.4.1.41263.9.1.8.3”
[[inputs.snmp.field]]
name = “Host4MEM”
oid = “1.3.6.1.4.1.41263.9.1.8.4”

[[inputs.snmp]]
agents = [ “192.168.0.1:161” ] ##Firewall IP
timeout = “5s”
retries = 3
version = 2
community = “RO_community_string”
max_repetitions = 10

name = “ERX”
[[inputs.snmp.field]]

name = “Bytes.Out”
oid = “1.3.6.1.2.1.2.2.1.10.2”
[[inputs.snmp.field]]
name = “Bytes.In”
oid = “1.3.6.1.2.1.2.2.1.16.2”

You’ll have to get Telegraf to read in the config again.  The sledgehammer method would be a reboot.  I think a Telegraf service restart would also do the trick.  Reboots for me take about 5 seconds (yep, really), so it’s useful to make sure it’s coming up clean on a reboot anyway.

Grafana on Ubuntu 16.04…easy, I think

Just went through setting up Grafana on Ubuntu 16.04 and thought I would grab the steps I went through.  I’m using a combination of Telegraf and some custom remote scripts to get data into InfluxDB.

curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add –
source /etc/lsb-release
echo “deb https://repos.influxdata.com/${DISTRIB_ID,,} ${DISTRIB_CODENAME} stable” | sudo tee /etc/apt/sources.list.d/influxdb.list
sudo apt-get update && sudo apt-get install influxdb
sudo service influxdb start
echo “deb https://packagecloud.io/grafana/testing/debian/ wheezy main” | sudo tee /etc/apt/sources.list.d/grafana.list
curl https://packagecloud.io/gpg.key | sudo apt-key add –
sudo apt-get update && sudo apt-get install grafana
sudo service grafana-server start
wget https://dl.influxdata.com/telegraf/releases/telegraf_1.2.1_amd64.deb
sudo dpkg -i telegraf_1.2.1_amd64.deb
telegraf -sample-config > telegraf.conf
nano telegraf.conf
telegraf -config telegraf.conf
sudo cp telegraf.conf /etc/telegraf/telegraf.conf
sudo systemctl enable grafana-server.service
sudo systemctl enable telegraf.service
sudo reboot

This gets things installed.  I’ll have another post to describe other configuration that’s required.

Grafana, Telegraf, Smokeping, oh my…

So, I’ve been working on something.  I keep seeing all of these very nice home lab dashboards on /r/homelab and I thought it would be useful to create one for myself.  I present to you, my home dashboard, which is hanging in the kitchen on an old iPad we weren’t using:

Getting to this point was not without challenges.  In fact, it was painful at times.  I’m going to try to document my setup here.  Because of all of the twists and turns along the way, I would say this is not a complete guide.  There are parts of this that you’ll have to figure out for yourself.  It also assumes some knowledge of linux, Ubuntu in particular.  If I get comments asking about specific sections, I’ll try to update the post with current info.

So, what do we have here?  The picture you see above is made up of a number of components.  InfluxDB is a time based DB, much like RRDTool or the original MRTG.  It’s designed to take in datapoints, tag them with a timestamp, and then move on.  It might be capable of more, but we’re not using it for anything else.  Grafana is the visualization tool that creates what you see above.  Grafana is very configurable, which I’ll dive into more in a bit.  The final piece of the puzzle is data collection.  There are a number of ways to get data into InfluxDB.  I’m using Telegraf and some interesting scripting.

Let’s start by getting some links in here.  I’ll update this as I update the post.

This is where it all started for me:

https://lkhill.com/using-influxdb-grafana-to-display-network-statistics/

This was useful for the Grafana configuration:

Setup a wicked Grafana Dashboard to monitor practically anything

InfluxData, which includes InfluxDB and Telegraf

https://www.influxdata.com/

Grafana for the visualization:

http://grafana.org/

The “SmokePing” stand-in:

https://hveem.no/visualizing-latency-variance-with-grafana

The Unraid tools:

https://lime-technology.com/forum/index.php?topic=52220.msg512346#msg512346

Ok, here we go…

First, I would start with the top link to lkhill’s instructions.  Use that to get up and running with InfluxDB and Grafana installed.  DO NOT follow that guide for the InfluxSNMP install.  Telegraf takes care of SNMP now.  If I recall, InfluxData wants your…data, in order to download InfluxDB.  It’s cool though, because they’ll send you some swanky stickers.  I believe these are still valid instructions for installing Telegraf:  https://docs.influxdata.com/telegraf/v1.1/introduction/installation/

I would suggest getting to this point with InfluxDB, Grafana and Telegraf installed and not throwing errors before you proceed with any configuration.  I know I’m skipping a lot of things that might not work without some tweaking.  Like I said, I’ll update this if I get feedback that these installations need to be detailed.  Add the data source as shown in lkhill’s instructions.

At this point you should have some data being populated for the localhost and the data source should have been available.  I would suggest diverting from lkhill’s instructions at this point.  Instead of adding a graph for SNMP stats (we have none at this point), let’s set up a graph of the local CPU utilization.  Add a new dashboard and then click on the small green square in the upper left.  Click on the “A” select statement and it’ll expand to show you options for finding the data.  Clicking on each of the fields will either give you a drop down list of options, or it might give you an X above the item.  For instance, if you click on mean() you’ll get the x above that.  Click the x to delete mean().  Clicking the + at the end of each row will give you a list of options to add from.  Try to get your selection to look like this:

Click the big X out on the right of the tab bar, past Time range, to close the edit and return to the dashboard.  Congrats, you just made your first dashboard!  Let’s get some useful data in there.

First thing to take care of is to add SNMP.  Go to /etc/telegraf/ and edit telegraf.conf.  If there’s not a conf file, there might be a template called dpkg-dist in there.  If not, you can create a new template.  I found this extremely helpful for working through Telegraf issues:  https://github.com/influxdata/telegraf  You can also go right to the SNMP readme at https://github.com/influxdata/telegraf/tree/master/plugins/inputs/snmp

You can see that Telegraf has quite a few plugins for gathering data.  SNMP is only one part of it.  Some configuration is necessary to start using Telegraf.  Near the top of the file are general settings that must be configured.  Make sure in the OutputPlugins section the urls, database and username/password are uncommented and correct.  The database can be called whatever you want, and you can have multiple databases in Grafana.  Find the “inputs.snmp” section and we’ll begin editing it.  Here’s mine:

# # Retrieves SNMP values from remote agents
[[inputs.snmp]]
agents = [ “192.x.x.x:161” ]
timeout = “5s”
version = 3

max_repetitions = 50

sec_name = “SNMPv3User”
auth_protocol = “SHA” # Values: “MD5”, “SHA”, “”
auth_password = “topsecret”
sec_level = “authPriv” # Values: “noAuthNoPriv”, “authNoPriv”, “authPriv”

priv_protocol = “AES” # Values: “DES”, “AES”, “”
priv_password = “alsotopsecret”

name = “nutanix”
[[inputs.snmp.field]]
name = “host1CPU”
oid = “1.3.6.1.4.1.41263.9.1.6.1”
[[inputs.snmp.field]]
name = “host2CPU”
oid = “1.3.6.1.4.1.41263.9.1.6.2”
[[inputs.snmp.field]]
name = “host3CPU”
oid = “1.3.6.1.4.1.41263.9.1.6.3”
[[inputs.snmp.field]]
name = “ClusterIOPS”
oid = “1.3.6.1.4.1.41263.506.0”
[[inputs.snmp.field]]

name = “Host1MEM”
oid = “1.3.6.1.4.1.41263.9.1.8.1”
[[inputs.snmp.field]]
name = “Host2MEM”
oid = “1.3.6.1.4.1.41263.9.1.8.2”
[[inputs.snmp.field]]
name = “Host3MEM”
oid = “1.3.6.1.4.1.41263.9.1.8.3”

[[inputs.snmp]]
agents = [ “192.x.x.x:161” ]
timeout = “5s”
retries = 3
version = 2
community = “topsecret”
max_repetitions = 10

name = “ERX”
[[inputs.snmp.field]]

name = “Bytes.Out”
oid = “1.3.6.1.2.1.2.2.1.10.2”
[[inputs.snmp.field]]
name = “Bytes.In”
oid = “1.3.6.1.2.1.2.2.1.16.2”

I’ve edited the IP addresses and security info, so make sure that matches whatever you have set up.  Oh yeah, you have to enable SNMP on your devices!  A couple of key points for this, you can have different SNMP versions or authentication methods defined by adding a new [[inputs.snmp]] for each one.  I’m also using the full OIDs, but you can see in the template that it’s possible to reference a MIB by name as well.  Save that and exit.  You can test the file with

telegraf –config telegraf.conf -test

This will give you lines for each device you’ve configured and show you what the response is.  If you don’t see data, something’s wrong with the snmp config.