Mix two parts achievement and two parts despair
We use Munin at work to monitor the health of various machines on the network. It works well, is reliable and is easy to configure. I decided I’d try it out at home.
I installed the server, which was easy, and put the client on another machine, which was also easy. So far so easy. I then went away and left it.
By yesterday afternoon it was happily producing pretty graphs. I was pleased and decided to add more machines to the configuration. This I did. Then when I got home from work and looked at my graphs I saw … nothing. The graphs had gone. Munin had decided to stop drawing them. It was still gathering data from the machines; I could see the RRD files changing over time. But it refused to update the graphs.
My investigation into this problem was cut short by a minor LDAP emergency. Every so often (that is to say quite often) when I would start a shell, either over SSH or in a terminal, on a particular machine, the shell would come up with a bogus PATH and would not act on any commands typed. A little digging revealed that each failed shell exactly coincided with a line appearing in the slapd log complaining about error 4: size limit exceeded.
I tried various things to solve the problem. I tweaked and reindexed slapd‘s indices. This was a good thing to do although it didn’t help. I tried pointing nss_ldap to another LDAP master. This didn’t work either. I tried downgrading OpenLDAP; I was already running the latest release on that machine but I hadn’t seen the error 4 messages on my other masters running older versions. This also didn’t help.
Then I remembered that I had recently upgraded the C library and recompiled a bunch of stuff. I tried compiling nss_ldap again. That didn’t help. Finally I reverted nss_ldap to the version compiled against my old C library. I’d upgraded it previously because I came across a problem where a statically-linked application would segfault when querying LDAP for netgroups (but not users or groups or other stuff). Of course this upgrade hadn’t solved that problem. The older nss_ldap was the key. My weird shell issue went away immediately.
At first I was all pleased with myself for solving a problem. It was one of those tricky ones where you have to do some digging and try a few off the wall ideas in order to succeed. Solving those problems is always satisfying.
Then I remembered that I had created the problem myself. I’d spent my whole evening battling with an issue of my own creating and made no further progress on my pretty graphs. Which, lest we forget, were at one time working.
Grrr.
I solved the munin graphs problem. Somehow /var/run/munin had become owned by root and the munin user couldn’t create lock files in there. So it gave up.
And when I say "somehow" I of course mean "due to my meddling" with a separate script.
Comment by iain — 2007-10-25 @ 16:48:22