Let me tell you about the still-not-defunct real-time log processing
pipeline we built at my now-defunct last job. It handled logs from a large
number of embedded devices that our ISP operated on behalf of residential
customers. (I wrote and presented previously about some of the cool wifi
diagnostics that were possible with this data set.)
Lately, I've had a surprisingly large number of conversations about logs
processing pipelines. I can find probably 10+ already-funded, seemingly
successful startups processing logs, and the Big Name Cloud providers all
have some kind of logs thingy, but still, people are not satisfied. It's
expensive and slow. And if you complain, you mostly get told that you
shouldn't be using unstructured logs anyway, you should be using
That advice is not wrong, but it's incomplete.
Instead of doing a survey of the whole unhappy landscape, let's just ignore
what other people suffer with and talk about what does work. You can
probably find, somewhere, something similar to each of the components I'm
going to talk about, but you probably can't find a single solution that
combines it all with good performance and super-low latency for a reasonable
price. At least, I haven't found it. I was a little surprised by this,
because I didn't think we were doing anything all that innovative.
Apparently I was incorrect.
The big picture
Let's get started. Here's a handy diagram of all the parts we're going to
The ISP where I worked has a bunch of embedded Linux devices (routers,
firewalls, wifi access points, and so on) that we wanted to monitor. The
number increased rapidly over time, but let's talk about a nice round
number, like 100,000 of them. Initially there were zero, then maybe 10 in
our development lab, and eventually we hit 100,000, and later there were
many more than that. Whatever. Let's work with 100,000. But keep in mind
that this architecture works pretty much the same with any number of
(It's a "distributed system" in the sense of
scalability, but it's also the simplest thing that really works for any
number of devices more than a handful, which makes it different from many
"distributed systems" where you could have solved the problem much more
simply if you didn't care about scaling. Since our logs are coming from
multiple sources, we can't make it non-distributed, but we can try to
minimize the number of parts that have to deal with the extra complexity.)
Now, these are devices we're monitoring, not apps or services or
containers or whatever. That means two things: we had to deal with lots of
weird problems (like compiler/kernel bugs and hardware failures), and most
of the software was off-the-shelf OS stuff we couldn't easily control (or
didn't want to rewrite).
(Here's the good news: because embedded devices have all the problems from
top to bottom, any solution that works for my masses of embedded devices will
work for any other log-pipeline problem you might have. If you're lucky, you
can just leave out some parts.)
That means the debate about "events" vs "logs" was kind of moot. We didn't
control all the parts in our system, so telling us to forget logs and use
only structured events doesn't help. udhcpd produces messages the way it
wants to produce messages, and that's life. Sometimes the kernel panics and
prints whatever it wants to print, and that's life. Move on.
Of course, we also had our own apps, which means we could also produce our
own structured events when it was relevant to our own apps. Our team had
whole never-ending debates about which is better, logs or events, structured
or unstructured. In fact, in a move only overfunded megacorporations can
afford, we actually implemented both and ran them both for a long time.
Thus, I can now tell you the final true answer, once and for all: you want
structured events in your database.
...but you need to be able to produce them from unstructured logs.
And once you can do that, exactly how those structured events are produced
(either from logs or directly from structured trace output) turns out to be
But we're getting ahead of ourselves a bit. Let's take our flow diagram,
one part at a time, from left to right.
Userspace and kernel messages, in a single stream
Some people who have been hacking on Linux for a while may know about
/proc/kmsg: that's the file good old (pre-systemd)
klogd reads kernel
messages from, and pumps them to
syslogd, which saves them to a file.
Nowadays systemd does roughly the same thing but with more d-bus and
more corrupted binary log files. Ahem. Anyway. When you run the
command, it reads the same messages (in a slightly different way).
What you might not know is that you can go the other direction. There's a
/dev and not
/proc) which, if you write
to it, produces messages into the kernel's buffer. Let's do that! For all
Wait, what? Am I crazy? Why do that?
Because we want strict sequencing of log messages between programs. And we
want that even if your kernel panics.
Imagine you have, say, a TV DVR running on an embedded Linux system, and
whenever you go to play a particular recorded video, the kernel panics
because your chipset vendor hates you. Hypothetically. (The feeling is,
hypothetically, mutual.) Ideally, you would like your logs to contain a
note that the user requested the video, the video is about to start playing,
we've opened the file, we're about to start streaming the file to the
proprietary and very buggy (hypothetical) video decoder... boom. Panic.
What now? Well, if you're writing the log messages to disk, the joke's on
you, because I bet you didn't fsync() after each one. (Once upon a time,
syslogd actually did fsync() after each one. It was insanely disk-grindy
and had very low throughput. Those days are gone.) Moreover, a kernel panic
kills the disk driver, so you have no chance to fsync() it after the panic,
unless you engage one of the more terrifying hacks like, after a panic,
booting into a secondary kernel whose only job is to stream the message
buffer into a file, hoping desperately that the disk driver isn't the thing
that panicked, that the disk itself hasn't fried, and that even if you do
manage to write to some disk blocks, they are the right ones because your
filesystem data structure is reasonably intact.
(I suddenly feel a lot of pity for myself after reading that
paragraph. I think I am more scars than person at this point.)
The kernel log buffer is in a fixed-size memory buffer in RAM. It defaults
to being kinda small (tends or hundreds of kBytes), but you can make it
bigger if you want. I suggest you do so.
By itself, this won't solve your kernel panic problems, because RAM is even
more volatile than disk, and you have to reboot after a kernel panic. So
the RAM is gone, right?
Well, no. Sort of. Not exactly.
Once upon a time, your PC BIOS would go through all your RAM at boot time
and run a memory test. I remember my ancient 386DX PC used to do this with
my amazingly robust and life-changing 4MB of RAM. It took quite a while.
You could press ESC to skip it if you were a valiant risk-taking rebel like
Now, memory is a lot faster than it used to be, but unfortunately it has
gotten bigger more quickly than it has gotten faster, especially if you
disable memory caching, which you certainly must do at boot time in order to
write the very specific patterns needed to see if there are any bit errors.
So... we don't. That ended years ago. If you reboot your system, the
memory mostly will contain the stuff it contained before you rebooted. The
OS kernel has to know that and zero out pages as they get used. (Sometimes
the kernel gets fancy and pre-zeroes some extra pages when it's not busy, so
it can hand out zero pages more quickly on demand. But it always has to
So, the pages are still around when the system reboots. What we want to
The system reboots automatically after a kernel panic. You can do this
by giving your kernel a boot parameter like "panic=1", which reboots it
after one second. (This is not nearly enough time for an end user to read
and contemplate the panic message. That's fine, because a) on a desktop PC,
X11 will have crashed in graphics mode so you can't see the panic message
anyway, and b) on an embedded system there is usually no display to put the
message on. End users don't care about panic messages. Our job is to
reboot, ASAP, so they don't try to "help" by power cycling the device, which
really does lose your memory.) (Advanced users will make it reboot after
zero seconds. I think panic=0 disables the reboot feature rather than doing
that, so you might have to patch the kernel. I forget. We did it, whatever
The kernel always initializes the dmesg buffer in the same spot in RAM.
The kernel notices that a previous dmesg buffer is already in that spot
in RAM (because of a valid signature or checksum or whatever) and decides to
append to that buffer instead of starting fresh.
In userspace, we pick up log processing where we left off. We can
capture the log messages starting before (and therefore including) the
And because we redirected userspace logs to the kernel message buffer,
we have also preserved the exact sequence of events that led up to the
If you want all this to happen, I have good news and bad news. The good
news is we open sourced all our code; the bad news is it didn't get
upstreamed anywhere so there are no batteries included and no documentation
and it probably doesn't quite work for your use case. Sorry.
Open source code:
for sending userspace logs to /dev/klogd. (It's logs... for the OS..
and it's logical... and it brings your logs back from the dead after a
reboot... get it? No? Oh well.) This includes two per-app token buckets
(burst and long-term) so that an out-of-control app won't overfill the
limited amount of dmesg space.
to make Linux reuse the dmesg buffer across reboots.
Even if you don't do any of the rest of this, everybody should use
PRINTK_PERSIST on every computer, virtual or physical. Seriously. It's so
(Note: room for improvement: it would be better if we could just redirect
app stdout/stderr directly to
/dev/kmsg, but that doesn't work as well as
we want. First, it doesn't auto-prefix incoming messages with the app name.
Second, libc functions like printf() actually write a few bytes at a time,
not one message per write() call, so they would end up producing more than
one dmesg entry per line. Third,
/dev/kmsg doesn't support the token
bucket rate control that logos does, which turns out to be essential,
because sometimes apps go crazy. So we'd have to further extend the kernel
API to make it work. It would be worthwhile, though, because the extra
userspace process causes an unavoidable delay between when a userspace
program prints something and when it actually gets into the kernel log.
That delay is enough time for a kernel to panic, and the userspace message
gets lost. Writing directly to
/dev/kmsg would take less CPU, leave
userspace latency unchanged, and ensure the message is safely written before
(In related news, all of syslogd is kinda extraneous for this reason. So is
whatever systemd does. Why do we make everything so complicated? Just
write directly to files or the kernel log buffer. It's cheap and easy.)
Uploading the logs
Next, we need to get the messages out of the kernel log buffer and into our
log processing server, wherever that might be.
(Note: if we do the above trick - writing userspace messages to the kernel
buffer - then we can't also use klogd to read them back into syslogd.
That would create an infinite loop, and would end badly. Ask me how I
So, no klogd -> syslogd -> file. Instead, we have something like syslogd ->
kmsg -> uploader or app -> kmsg -> uploader.
What is a log uploader? Well, it's a thing that reads messages from the
kernel kmsg buffer as they arrive, and uploads them to a server, perhaps
over https. It might be almost as simple as "dmesg | curl", like my
original prototype, but we can get a bit fancier:
Figure out which messages we've already uploaded (eg. from the persistent
buffer before we rebooted) and don't upload those again.
Log the current wall-clock time before uploading, giving us sync points
between monotonic time (
/dev/kmsg logs "microseconds since boot" by
default, which is very useful, but we also want to be able to correlate
that with "real" time so we can match messages between related machines).
Compress the file on the way out.
Somehow authenticate with the log server.
Bonus: if the log server is unavailable because of a network partition,
try to keep around the last few messages from before the partition, as
well as the recent messages once the partition is restored. If the
network partition was caused by the client - not too rare if you, like us,
were in the business of making routers and wifi access points - you really
would like to see the messages from right before the connectivity loss.
Luckily for you, we also open sourced our code for this. It's in C so it's
very small and low-overhead. We never quite got the code for the "bonus"
feature working quite right, though; we kinda got interrupted at the last
Open source code:
including an rsyslog plugin for Debian in case you don't want to use the
devcert, a tool (and Debian package) which auto-generates a self
signed "device certificate" wherever it's installed. The device
certificate is used by a device (or VM, container, whatever) to identify
itself to the log server, which can then decide how to classify and store
(or reject) its logs.
One thing we unfortunately didn't get around to doing was modifying the
logupload client to stream logs to the server. This is possible using
HTTP POST and Chunked
but our server at the time was unable to accept streaming POST requests due
to (I think now fixed) infrastructure limitations.
(Note: if you write load
balancing proxy servers or HTTP server frameworks, make sure they can start
processing a POST request as soon as all the headers have arrived, rather
than waiting for the entire blob to be complete! Then a log upload server
can just stream the bytes straight to the next stage even before the whole
request has finished.)
Because we lacked streaming in the client, we had to upload chunks of log
periodically, which leads to a tradeoff about what makes a good upload
period. We eventually settled on about 60 seconds, which ended up
accounting for almost all the end-to-end latency from message generation to
our monitoring console.
Most people probably think 60 seconds is not too bad. But thanks to the
awesome team I was working with, we managed to squeeze all the other
pipeline phases down to tens of milliseconds in total. So the remaining 60
seconds (technically: anywhere from 0 to 60 seconds after a message was
produced) was kinda embarrassing.
The log receiver
So okay, we're uploading the logs from client to some kind of server. What
does the server do?
This part is both the easiest and the most reliability-critical. The job is
this: receive an HTTP POST request, write the POST data to a file, and
return HTTP 200 OK. Anybody who has any server-side experience at all can
write this in their preferred language in about 10 minutes.
We intentionally want to make this phase as absolutely simplistic as
possible. This is the phase that accepts logs from the limited-size kmsg
buffer on the client and puts them somewhere persistent. It's nice to have
real-time alerts, but if I have to choose between somewhat delayed alerts or
randomly losing log messages when things get ugly, I'll have to accept the
delayed alerts. Don't lose log messages! You'll regret it.
The best way to not lose messages is to minimize the work done by your log
receiver. So we did. It receives the uploaded log file chunk and appends
it to a file, and that's it. The "file" is actually in a cloud storage
system that's more-or-less like S3. When I explained this to someone, they
asked why we didn't put it in a Bigtable-like thing or some other database,
because isn't a filesystem kinda cheesy? No, it's not cheesy, it's simple.
Simple things don't break. Our friends on the "let's use structured events
to make metrics" team streamed those events straight into a database, and it
broke all the time, because databases have configuration options and you
inevitably set those options wrong, and it'll fall over under heavy load,
and you won't find out until you're right in the middle of an emergency and
you really want to see those logs. Or events.
Of course, the file storage service we used was encrypted-at-rest,
heavily audited, and auto-deleted files after N days. When you're a
megacorporation, you have whole teams of people dedicated to making sure you
don't screw this up. They will find you. Best not to annoy them.
We had to add one extra feature, which was authentication. It's not
okay for random people on the Internet to be able to impersonate your
devices and spam your logs - at least without putting some work into it.
For device authentication, we used the rarely-used HTTP client-side
certificates option and the devcert program (linked above) so that the
client and server could mutually authenticate each other. The server didn't
check the certificates against a certification authority (CA), like web
clients usually do; instead, it had a database with a whitelist of exactly
which certs we're allowing today. So in case someone stole a device cert
and started screwing around, we could remove their cert from the whitelist
and not worry about CRL bugs and latencies and whatnot.
Unfortunately, because our log receiver was an internal app relying on
internal infrastructure, it wasn't open sourced. But there really wasn't
much there, honest. The first one was written in maybe 150 lines of python,
and the replacement was rewritten in slightly more lines of Go. No problem.
Retries and floods
Of course, things don't always go smoothly. If you're an ISP, the least
easy thing is dealing with cases where a whole neighbourhood gets
disconnected, either because of a power loss or because someone cut the
fiber Internet feed to the neighbourhood.
Now, disconnections are not such a big deal for logs processing - you don't
have any. But reconnection is a really big deal. Now you have tens or
hundreds of thousands of your devices coming back online at once, and a)
they have accumulated a lot more log messages than they usually do, since
they couldn't upload them, and b) they all want to talk to your server at
the same time. Uh oh.
Luckily, our system was designed carefully (uh... eventually it was), so it
could handle these situations pretty smoothly:
The log uploader uses a backoff timer so that if it's been trying to
upload for a while, it uploads less often. (However, the backoff timer was
limited to no more than the usual inter-upload interval. I don't know why
more people don't do this. It's rather silly for your system to wait longer
between uploads in a failure situation than it would in a success situation.
This is especially true with logs, where when things come back online, you
want a status update now. And clearly your servers have enough capacity
to handle uploads at the usual rate, because they usually don't crash.
Sorry if I sound defensive here, but I had to have this argument a few times
with a few SREs. I understand why limiting the backoff period isn't always
the right move. It's the right move here.)
Less obviously, even under normal conditions, the log uploader uses a
randomized interval between uploads. This avoids traffic spikes where,
after the Internet comes back online, everybody uploads again exactly 60
seconds later, and so on.
The log upload client understands the idea that the server can't accept
its request right now. It has to, anyway, because if the Internet goes
down, there's no server. So it treats server errors exactly like it treats
lack of connectivity. And luckily, log uploading is not really an
"interactive" priority task, so it's okay to sacrifice latency when things
get bad. Users won't notice. And apparently our network is down, so the
admins already noticed.
/dev/kmsg buffer was configured for the longest reasonable outage
we could expect, so that it wouldn't overflow during "typical" downtime. Of
course, there's a judgement call here. But the truth is, if you're having
system-wide downtime, what the individual devices were doing during that
downtime is not usually what you care about. So you only need to handle,
say, the 90th percentile of downtime. Ignore the black
swans for once.
The log receiver aggressively rejects requests that come faster than its
ability to write files to disk. Since the clients know how to retry with a
delay, this allows us to smooth out bursty traffic without needing to either
over-provision the servers or lose log messages.
(Pro tip: if you're
writing a log receiver in Go, don't do the obvious thing and fire off a
goroutine for every incoming request. You'll run out of memory. Define a
maximum number of threads you're willing to handle at once, and limit your
request handling to that. It's okay to set this value low, just to be safe:
remember, the uploader clients will come back later.)
Okay! Now our (unstructured) logs from all our 100,000 devices are sitting
safely in a big distributed filesystem. We have a little load-balanced,
multi-homed cluster of log receivers accepting the uploads, and they're so
simple that they should pretty much never die, and even if they do because
we did something dumb (treacherous, treacherous goroutines!), the
clients will try again.
What might not be obvious is this: our reliability, persistence, and scaling
problems are solved. Or rather, as long as we have enough log receiver
instances to handle all our devices, and enough disk quota to store all our
logs, we will never again lose a log message.
That means the rest of our pipeline can be best-effort, complicated, and
frequently exploding. And that's a good thing, because we're going to start
using more off-the-shelf stuff, we're going to let random developers
reconfigure the filtering rules, and we're not going to bother to configure
it with any redundancy.
Grinding the logs
The next step is to take our unstructured logs and try to understand them.
In other words, we want to add some structure. Basically we want to look
for lines that are "interesting" and parse out the "interesting" data and
produce a stream of events, each with a set of labels describing what
categories they apply to.
Note that, other than this phase, there is little difference between how
you'd design a structured event reporting pipeline and a log pipeline. You
still need to collect the events. You still (if you're like me) need to
persist your events across kernel panics. You still need to retry uploading
them if your network gets partitioned. You still need the receivers to
handle overloading, burstiness, and retries. You still would like to stream
them (if your infrastructure can handle it) rather than uploading every 60
seconds. You still want to be able to handle a high volume of them. You're
just uploading a structured blob instead of an unstructured blob.
Okay. Fine. If you want to upload structured blobs, go for it. It's just
an HTTP POST that appends to a file. Nobody's stopping you. Just please
try to follow my advice when designing the parts of the pipeline before and
after this phase, because otherwise I guarantee you'll be sad eventually.
Anyway, if you're staying with me, now we have to parse our unstructured
logs. What's really cool - what makes this a killer design compared to
starting with structured events in the first place - is that we can, at any
time, change our minds about how to parse the logs, without redeploying
all the software that produces them.
This turns out to be amazingly handy. It's so amazingly handy that nobody
believes me. Even I didn't believe me until I experienced it; I was sure,
in the beginning, that the unstructured logs were only temporary and we'd
initially use them to figure out what structured events we wanted to record,
and then modify the software to send those, then phase out the logs over
time. This never happened. We never settled down. Every week, or at least
every month, there was some new problem which the existing "structured"
events weren't configured to catch, but which, upon investigating, we
realized we could diagnose and measure from the existing log message stream.
And so we did!
Now, I have to put this in perspective. Someone probably told you that log
messages are too slow, or too big, or too hard to read, or too hard to use,
or you should use them while debugging and then delete them. All those
people were living in the past and they didn't have our fancy log
pipeline. Computers are really, really fast now. Storage is really, really
So we let it all out. Our devices produced an average of 50 MB of
(uncompressed) logs per day, each. For the baseline 100,000 devices that we
discussed above, that's about 5TB of logs per day. Ignoring compression,
how much does it cost to store, say, 60 days of logs in S3 at 5TB per day?
"Who cares," that's how much. You're amortizing it over 100,000 devices.
Heck, a lot of those devices were DVRs, each with 2TB of storage. With
100,000 DVRs, that's 200,000 TB of storage. Another 300 is literally a
rounding error (like, smaller than if I can't remember if it's really 2TB or
2TiB or what).
Our systems barfed up logs vigorously and continuously, like a non-drunken
non-sailor with seasickness. And it was beautiful.
(By the way, now would be a good time to mention some things we didn't
log: personally identifiable information or information about people's
Internet usage habits. These were diagnostic logs for running the network
and detecting hardware/software failures. We didn't track what you did with
the network. That was an intentional decision from day 1.)
(Also, this is why I think all those log processing services are so badly
overpriced. I wanna store 50 MB per device, for lots of devices. I need to
pay S3 rates for that, not a million dollars a gigabyte. If I have to
overpay for storage, I'll have to start writing fewer logs. I love my logs. I
need my logs. I know you're just storing it in S3 anyway. Let's be
But the grinding, though
Oh right. So the big box labeled "Grinder" in my diagram was, in fact, just
one single virtual machine, for a long time. It lasted like that for much
longer than we expected.
Whoa, how is that possible, you ask?
Well, at 5TB per day per 100,000 devices, that's an average of 57 MBytes per
second. And remember, burstiness has already been absorbed by our carefully
written log receivers and clients, so we'll just grind these logs as fast as
they arrive or as fast as we can, and if there are fluctuations, they'll
average out. Admittedly, some parts of the day are busier than others.
Let's say 80 MBytes per second at peak.
80 MBytes per second? My laptop can do that on its spinning disk. I don't
even need an SSD! 80 MBytes per second is a toy.
And of course, it's not just one spinning disk. The data itself is stored
on some fancy heavily-engineered distributed filesystem that I didn't have
to design. Assuming there are no, er, collossal, failures in provisioning
(no comment), there's no reason we shouldn't be able to read files at a rate
that saturates the network interface available to our machine. Surely
that's at least 10 Gbps (~1 GByte/sec) nowadays, which is 12.5 of those.
1.25 million devices, all processed by a single grinder.
Of course you'll probably need to use a few CPU cores. And the more work
you do per log entry, the slower it'll get. But these estimates aren't too
far off what we could handle.
And yeah, sometimes that VM gets randomly killed by the cluster's Star
Trek-esque hive mind for no reason. It doesn't matter, because the input
data was already persisted by the log receivers. Just start a new grinder
and pick up where you left off. You'll have to be able to handle process
restarts no matter what. And that's a lot easier than trying to make a
distributed system you didn't need.
As for what the grinder actually does? Anything you want. But it's
basically the "map" phase in a mapreduce. It reads the data in one side,
does some stuff to it, and writes out postprocessed stuff on the other side.
Use your imagination. And if you want to write more kinds of mappers, you
can run them, either alongside the original Grinder or downstream from it.
Our Grinder mostly just ran regexes and put out structures (technically
protobufs) that were basically sets of key-value pairs.
(For some reason, when I search the Internet for "streaming mapreduce," I
don't get programs that do this real-time processing of lots of files as
they get written. Instead, I seem to get batch-oriented mapreduce clones
that happen to read from stdin, which is a stream. I guess. But... well,
now you've wasted some perfectly good words that could have meant
something. So okay, too bad, it's a Grinder. Sue me.)
Reducers and Indexers
Once you have a bunch of structured events... well, I'm not going to explain
that in a lot of detail, because it's been written about a lot.
You probably want to aggregate them a bit - eg. to count up reboots across
multiple devices, rather than storing each event for each device separately
- and dump them into a time-series database. Perhaps you want to save and
postprocess the results in a monitoring system named after Queen Elizabeth
or her pet
Whatever. Plug in your favourite.
What you probably think you want to do, but it turns out you rarely need, is
full-text indexing. People just don't grep the logs across 100,000 devices
all that often. I mean, it's kinda nice to have. But it doesn't have to be
instantaneous. You can plug in your favourite full text indexer if you
like. But most of the time, just an occasional big parallel grep (perhaps
using your favourite mapreduce clone or something more modern... or possibly
just using grep) of a subset of the logs is sufficient.
(If you don't have too many devices, even a serial grep can be fine.
Remember, a decent cloud computer should be able to read through ~1
GByte/sec, no problem. How much are you paying for someone to run some
bloaty full-text indexer on all your logs, to save a few milliseconds per
I mean, run a full text indexer if you want. The files are right there.
Don't let me stop you.
On the other hand, being able to retrieve the exact series of logs - let's
call it the "narrative" - from a particular time period across a subset of
devices turns out to be super useful. A mini-indexer that just remembers
which logs from which devices ended up in which files at which offsets is
nice to have. Someone else on our team built one of those eventually (once
we grew so much that our parallel grep started taking minutes instead of
seconds), and it was very nice.
And then you can build your dashboards
Once you've reduced, aggregated, and indexed your events into your favourite
output files and databases, you can read those databases to build very
fast-running dashboards. They're fast because the data has been
preprocessed in mostly-real time.
As I mentioned above, we had our pipeline reading the input files as fast as
they could come in, so the receive+grind+reduce+index phase only took a few
tens of milliseconds. If your pipeline isn't that fast, ask somebody why.
I bet their program is written in java and/or has a lot of sleep()
Again here, I'm not going to recommend a dashboard tool. There are
millions of articles and blog posts about that. Pick one, or many.
Please, please, steal these ideas. Make your log and event processing as
stable as our small team made our log processing. Don't fight over
structured vs unstructured; if you can't agree, just log them both.
Don't put up with weird lags and limits in your infrastructure. We made
50MB/day/device work for a lot of devices, and real-time mapreduced them
all on a single VM. If we can do that, then you can make it work for a few
hundreds, or a few thousands, of container instances. Don't let anyone tell
you you can't. Do the math: of course you can.
Eventually our team's log processing infrastructure evolved to become the
primary monitoring and alerting infrastructure for our ISP. Rather than
alerting on behaviour of individual core routers, it turned out that the
end-to-end behaviour observed by devices in the field were a better way to
detect virtually any problem. Alert on symptoms, not causes, as the SREs
like to say. Who has the symptoms? End users.
We had our devices ping different internal servers periodically and log the
round trip times; in aggregate, we had an amazing view of overloading,
packet loss, bufferbloat, and poor backbone routing decisions, across the
entire fleet, across every port of every switch. We could tell which was
better, IPv4 or IPv6. (It's always IPv4. Almost everyone spends more time
optimizing their IPv4 routes and peering. Sorry, but it's true.)
We detected some weird configuration problems with the DNS servers in one
city by comparing the 90th percentile latency of DNS lookups across all the
devices in every city.
We diagnosed a manufacturing defect in a particular batch of devices, just
based on their CPU temperature curves and fan speeds.
We worked with our CPU vendor to find and work around a bug in their cache
coherency, because we spotted a kernel panic that would happen randomly
every 10,000 CPU-hours, but for every 100,000 devices, that's still 10 times
per hour of potential clues.
...and it sure was good for detecting power failures.
Anyway. Log more stuff. Collect those logs. Let it flow. Trust me.