night has quit [Read error: Connection reset by peer]
NightA has joined #nixos-on-your-router
nwspk has quit [Quit: nwspk]
nwspk has joined #nixos-on-your-router
teto has joined #nixos-on-your-router
<andi->
hpfr: what you can do is just restrict the SSH ciphers to very modern ones that you will have but that ancient CentOS based attacker will not have for the next 10y ;)
<andi->
IIRC the NixOS defaults are already pretty good in that regard.
<hpfr>
yeah, I'm all green on ssh-audit. I also moved to a different port as aranea suggested. the sshd journal appears to be clean so far haha
<hpfr>
unfortunately, slack's nebula mesh VPN seems to not be able to handle my CGNAT, which was the main purpose of this VPS charade
<andi->
What I am still looking for is having log (journal) excerpts in my prometheus alerting. Has anyone done something like that yet?
<andi->
hpfr: tell them to use IPv6? ;)
<q3k>
andi-: how do you want to tie alerts to log line even?
<hpfr>
I don't know what I should have expected from the ipv6 guy :D
<hpfr>
if only
<q3k>
andi-: do you generate metrics from logs?
<andi->
q3k: not like that, I know e.g. that if systemd service XYZ fails that I want the output of said jobs journal.
<q3k>
oh
<q3k>
hm
<andi->
I know that if my disk runs full I'd like a graph of the last ~12h or whatever.
<q3k>
i would generally just add context to my alert definitions
<q3k>
but mostly as links to other systems
<andi->
yeah
<q3k>
where whoever handles the alert can click
<q3k>
like the typical way for me is to first redirect to an oncall runbook
<andi->
(whoever = always me) :D
<q3k>
and then the runbook actually contains links
<q3k>
and reminders on what this alert means
<q3k>
and where to look
<q3k>
also my current personal infra alerting setup actually uses grafana alerts
<q3k>
so that gives me some linking to a graph
<q3k>
but the one that caused the alert, not necessarily the one you should be looking at
<q3k>
anyway, good question, no idea
<q3k>
if you're doing centralized log collection, you can attach an URL to a filtered view for the suspect service, i guess
<q3k>
but yeah if you want to automatically ship these off, or even get them on demand - no clue
<andi->
I am currently thinking of just writing the glue code in Nix + python/rust/bash/… to make it happen. I must look into adding more context to prometheus alerts. It surely is possible even if I've to proxy the entire alertmanager API.
<q3k>
this also ties into my philosophy of limiting how much you alert on causes
<q3k>
while things like disk space are an unfortunate reality and you should probably log that
<q3k>
s,log,monitor,
<q3k>
i try to not monitor just a systemd service being down
<q3k>
and instead have alerts like 'high request error rate', which might mean anything from service down to service bug or just networking issues
<q3k>
and in that case there's a much less clear direct culprit
<q3k>
so a runbook that fans you out into multiple places to look for logs, etc is then a much more useful approach anyway
<q3k>
i think that's why i never personally really felt the need to attach particular logs to a particular alert
<andi->
Well I've lots of systemd timers that are supposed to work and if those fail I need to be alerted. Those might not produce any measurable "sideffects" on my side.
<q3k>
i would instead try to alert on 'last time batch X run'
<q3k>
since you usually wouldn't care about when exactly it runs
<q3k>
but if it actually finished the thing it was supposed to do
<q3k>
that also catches more failures than something that systemd can pick up
<q3k>
then your runbook would say something like 'check if systemd actually ran the thing, and check its logs'
<q3k>
for instance, this is a good approach for backup jobs or statistics runs
<andi->
I do have both. I check the backup target if things have been uploaded (unix timstamp of files not > 24h etc..) but when I'm travelling/not at the computer/sick of this shit I'd like to know if it is worth running to the computer to fix it or if it'll just fix itself within another day or so.. That is where the logs idea comes from.
<andi->
In a corporate fashion I'd probably do it as you said with a proper documented procedure but I do not want to touch or fiddle with any of the systems unless I have to. I'd rather write more code that does that for me.
<q3k>
ah, isee
<q3k>
hm
<q3k>
yeah, those are not really incentives that i've worked with, so this might make sense for your scenario
<q3k>
the only immediate recommendation i have would be to making a non paging alert at 24h and a paging one at 72h :P
<q3k>
but yeah, not here to fix that problem, sorry
<andi->
Didn't expect anyone to have that fixed, just throwing around ideas :)
<q3k>
no what i meant, is i'm trying to approach this from an organizational perspective, while you're looking for some technical solution
<q3k>
i have this high horse ivory tower tendency to see everything as a XY problem cause by organizational issues :P
<andi->
I see.
<andi->
I don't really like runbooks or huge docs if you can mostly automate that stuff. e.g. right now I'm playing with some AWS resources and I've setup a cronjob to just nuke all resources on the account at 2am every day. I'd really like that to succeed as otherwise at some point I'll be billed stupid amounts of real-world coins.
teto has quit [Quit: WeeChat 2.9]
teto has joined #nixos-on-your-router
<hexa->
well, I have loki running, but the alerting story of loki is not quite there yet
<hexa->
and I don't see a possibilty to query stuff from within alertmanager
<hexa->
but that would be exactly what is most interesting about alerts, some early context