#nixos-on-your-router on 2020-09-08

2019-08-19 13:17 eyJhb changed the topic of #nixos-on-your-router to: NixOS on your Router || https://logs.nix.samueldr.com/nixos-on-your-router

01:47 night has quit [Read error: Connection reset by peer]

01:48 NightA has joined #nixos-on-your-router

02:42 nwspk has quit [Quit: nwspk]

02:43 nwspk has joined #nixos-on-your-router

08:27 teto has joined #nixos-on-your-router

11:08 <andi-> hpfr: what you can do is just restrict the SSH ciphers to very modern ones that you will have but that ancient CentOS based attacker will not have for the next 10y ;)

11:08 <andi-> IIRC the NixOS defaults are already pretty good in that regard.

11:09 <hpfr> yeah, I'm all green on ssh-audit. I also moved to a different port as aranea suggested. the sshd journal appears to be clean so far haha

11:10 <hpfr> unfortunately, slack's nebula mesh VPN seems to not be able to handle my CGNAT, which was the main purpose of this VPS charade

11:11 <andi-> What I am still looking for is having log (journal) excerpts in my prometheus alerting. Has anyone done something like that yet?

11:11 <andi-> hpfr: tell them to use IPv6? ;)

11:12 <q3k> andi-: how do you want to tie alerts to log line even?

11:12 <hpfr> I don't know what I should have expected from the ipv6 guy :D

11:12 <hpfr> if only

11:12 <q3k> andi-: do you generate metrics from logs?

11:13 <andi-> q3k: not like that, I know e.g. that if systemd service XYZ fails that I want the output of said jobs journal.

11:13 <q3k> oh

11:13 <q3k> hm

11:13 <andi-> I know that if my disk runs full I'd like a graph of the last ~12h or whatever.

11:14 <q3k> i would generally just add context to my alert definitions

11:14 <q3k> but mostly as links to other systems

11:14 <andi-> yeah

11:14 <q3k> where whoever handles the alert can click

11:14 <q3k> like the typical way for me is to first redirect to an oncall runbook

11:14 <andi-> (whoever = always me) :D

11:14 <q3k> and then the runbook actually contains links

11:14 <q3k> and reminders on what this alert means

11:14 <q3k> and where to look

11:15 <q3k> also my current personal infra alerting setup actually uses grafana alerts

11:15 <q3k> so that gives me some linking to a graph

11:15 <q3k> but the one that caused the alert, not necessarily the one you should be looking at

11:15 <q3k> anyway, good question, no idea

11:16 <q3k> if you're doing centralized log collection, you can attach an URL to a filtered view for the suspect service, i guess

11:16 <q3k> but yeah if you want to automatically ship these off, or even get them on demand - no clue

11:17 <andi-> I am currently thinking of just writing the glue code in Nix + python/rust/bash/… to make it happen. I must look into adding more context to prometheus alerts. It surely is possible even if I've to proxy the entire alertmanager API.

11:19 <q3k> this also ties into my philosophy of limiting how much you alert on causes

11:19 <q3k> while things like disk space are an unfortunate reality and you should probably log that

11:19 <q3k> s,log,monitor,

11:19 <q3k> i try to not monitor just a systemd service being down

11:19 <q3k> and instead have alerts like 'high request error rate', which might mean anything from service down to service bug or just networking issues

11:20 <q3k> and in that case there's a much less clear direct culprit

11:20 <q3k> so a runbook that fans you out into multiple places to look for logs, etc is then a much more useful approach anyway

11:20 <q3k> i think that's why i never personally really felt the need to attach particular logs to a particular alert

11:20 <andi-> Well I've lots of systemd timers that are supposed to work and if those fail I need to be alerted. Those might not produce any measurable "sideffects" on my side.

11:21 <q3k> i would instead try to alert on 'last time batch X run'

11:21 <q3k> since you usually wouldn't care about when exactly it runs

11:21 <q3k> but if it actually finished the thing it was supposed to do

11:21 <q3k> that also catches more failures than something that systemd can pick up

11:22 <q3k> then your runbook would say something like 'check if systemd actually ran the thing, and check its logs'

11:22 <q3k> for instance, this is a good approach for backup jobs or statistics runs

11:25 <andi-> I do have both. I check the backup target if things have been uploaded (unix timstamp of files not > 24h etc..) but when I'm travelling/not at the computer/sick of this shit I'd like to know if it is worth running to the computer to fix it or if it'll just fix itself within another day or so.. That is where the logs idea comes from.

11:25 <andi-> In a corporate fashion I'd probably do it as you said with a proper documented procedure but I do not want to touch or fiddle with any of the systems unless I have to. I'd rather write more code that does that for me.

11:26 <q3k> ah, isee

11:26 <q3k> hm

11:27 <q3k> yeah, those are not really incentives that i've worked with, so this might make sense for your scenario

11:28 <q3k> the only immediate recommendation i have would be to making a non paging alert at 24h and a paging one at 72h :P

11:28 <q3k> but yeah, not here to fix that problem, sorry

11:28 <andi-> Didn't expect anyone to have that fixed, just throwing around ideas :)

11:29 <q3k> no what i meant, is i'm trying to approach this from an organizational perspective, while you're looking for some technical solution

11:29 <q3k> i have this high horse ivory tower tendency to see everything as a XY problem cause by organizational issues :P

11:31 <andi-> I see.

11:37 <andi-> I don't really like runbooks or huge docs if you can mostly automate that stuff. e.g. right now I'm playing with some AWS resources and I've setup a cronjob to just nuke all resources on the account at 2am every day. I'd really like that to succeed as otherwise at some point I'll be billed stupid amounts of real-world coins.

12:58 teto has quit [Quit: WeeChat 2.9]

13:29 teto has joined #nixos-on-your-router

13:36 <hexa-> well, I have loki running, but the alerting story of loki is not quite there yet

13:37 <hexa-> and I don't see a possibilty to query stuff from within alertmanager

13:37 <hexa-> but that would be exactly what is most interesting about alerts, some early context

13:38 <hexa-> probably grafana will do that for you first, it already gives you graphs in e.g. slack https://grafana.com/docs/grafana/latest/alerting/notifications/#slack

14:49 teto has quit [Ping timeout: 264 seconds]

15:02 teto has joined #nixos-on-your-router

16:04 teto has quit [Ping timeout: 265 seconds]

17:48 <disasm> Church_: did you manage to get it working?

18:25 teto has joined #nixos-on-your-router

21:41 superherointj has joined #nixos-on-your-router

23:01 disasm has quit [Ping timeout: 240 seconds]

23:03 disasm has joined #nixos-on-your-router