#nixos-borg on 2020-04-16

2018-04-19 20:36 gchristensen changed the topic of #nixos-borg to: https://www.patreon.com/ofborg https://monitoring.nix.ci/dashboard/db/ofborg?refresh=10s&orgId=1&from=now-1h&to=now "I get to skip reviewing the PHP code and just wait until it is rewritten in something sane, like POSIX shell. || https://logs.nix.samueldr.com/nixos-borg

04:06 orivej has joined #nixos-borg

05:58 cole-h has quit [Quit: Goodbye]

09:42 hmpffff has joined #nixos-borg

13:57 orivej has quit [Ping timeout: 260 seconds]

13:57 orivej has joined #nixos-borg

15:15 cole-h has joined #nixos-borg

16:01 <LnL> gchristensen: re: spring cleaning

16:02 <gchristensen> hi :)

16:02 <LnL> do you think it would be ok to query hydra directly for that or probably not?

16:02 <gchristensen> I'd rather not

16:02 <gchristensen> this would be a good use case for that database export, or for being able to subscribe to build notifications

16:03 <LnL> yeah figured it's not a great idea, especially since it should check multiple historical builds

16:03 <LnL> do you have an idea on how to expose / access that?

16:03 <gchristensen> the events?

16:03 <gchristensen> or the database dump

16:04 <LnL> I think the database would probably be better for this

16:05 <gchristensen> so I actually have the data already

16:05 <gchristensen> it gets backed up to my machine every 5 minutes

16:06 <LnL> the safest first step I'm thinking of is checking if any build in the last x time has succeeded

16:06 <LnL> which is kind of hard to do with events

16:06 <gchristensen> yeah

16:06 <gchristensen> but you could have a timeseries database

16:06 <LnL> notifications of breakages is a different thing

16:06 <gchristensen> the thing I want to do next is load the database and do a .sql dump on a daily or weekly basis. this would validate the backup was good. secondary effect is letting other people access the .sql dump for queries like this :)

16:07 <gchristensen> (by events I mean like a rabbitmq or 0mq or whatever mechanism of publishing parsable event data)

16:08 <LnL> right

16:09 <gchristensen> I suppose if I'm loading and dumping the sql, it would not be a far throw to be able to have a list of queries executed and publish those query results too

16:10 <LnL> ah, so you're thinking more of publishing events periodically going over all packages?

16:11 <gchristensen> sorry, 2 different ideas :)

16:11 <gchristensen> the build event data is just purely: here is a firehose of events, do whatever you want, anybody can subscribe. good luck and god speed.

16:11 <gchristensen> the second idea is if we do it as a batch operation at the same time as validating the database backup

16:12 <gchristensen> since I'm loading the database and doing a pgdump anyway, might as well at the same time execute some list of queries and publish their results in the same place the pgdump data goes

16:13 <LnL> makes sense

16:15 <LnL> so what you'd want is a small deamon that just runs a bunch of queries over jobs, etc. and publishes events for those?

16:15 <gchristensen> ah, for sort of publishing I mostly mean just like, `aws s3 cp` to a bucket :) not really an event exactly

16:16 <LnL> hmm, a bit confused now

16:16 <gchristensen> sorry :/

16:16 <gchristensen> overloaded words

16:27 <LnL> making food first, then I'll make a diagram of what I was thinking

16:28 <gchristensen> cool

17:36 <srk> have you seen fedmsg infrastructure for Fedora?

17:37 * srk likes the idea of sql dumps, front-facing / staging sever for testing queries and so would be even better

17:37 <gchristensen> I've heard of fedmsg

17:37 <gchristensen> but I don't remember how it works

17:38 <gchristensen> we'd need to be able to send a fairly high number of events

17:38 <srk> not sure either, it is some messages, maybe even AMQP

17:38 <gchristensen> several each second

17:38 <srk> fedmsg is ZeroMQ

17:39 <gchristensen> I don't quite grok how that works

17:39 <srk> pyzmq

17:39 <LnL> gchristensen: https://files.daiderd.com/store/wzmlbyclxf2py3g3476vz7cgdl9ypm00-hydra-job-events.svg

17:39 <srk> hmm, zmq is like building blocks for queues and co, it handles bunch of low level stuff for you but it's not a fully-fledged message queue by itself

17:40 <gchristensen> this is beautiful LnL

17:40 <gchristensen> how did you make it?

17:40 <LnL> :p

17:40 <srk> Postgres Mirror <3

17:40 <LnL> but does it make any sense?

17:40 <LnL> omnigraffle

17:40 <gchristensen> nice, I love omnigraffle. I have a macos VM just for omnigraffle

17:41 <srk> lol! https://communityblog.fedoraproject.org/moving-from-fedmsg-to-fedora-messaging/

17:41 <gchristensen> LnL: I think this makes sense, but let me suggest a few edits

17:41 <srk> so it's AMQP now..

17:42 <gchristensen> Hydra sends me a ZFS filesystem diff every 5min, so on my system I'd take the current state of the filesystem, start posgresql, and make a dump from that

17:42 <gchristensen> the Selector would then operate on the same postgres server which the dump is made from

17:42 <srk> ZFS diff replication! mad

17:43 * srk was wondering how could you do that every 5 min

17:43 <srk> is there a backup hydra? :D

17:43 <gchristensen> nah, hehe

17:44 <LnL> right, the details of that don't really matter for the rest of the picture

17:44 <gchristensen> but yeah it uses snapshots for backups

17:44 <gchristensen> the arrow from Selector to build status is a set of queries, right?

17:45 <LnL> yeah

17:47 <LnL> for this probably first listing failed builds on trunk and then a query for each of those

17:47 <gchristensen> yeah

17:47 <LnL> which either results in an event or not

17:48 <LnL> or always send an event including the delta, whatever makes more sense

17:48 <gchristensen> yeah, so then the output of that could be a stream of "broken-forever" or "broken-recently" messages

17:48 <gchristensen> or a bulk blob of JSON containing that "report"

17:48 <gchristensen> which are you thinking?

17:49 <gchristensen> oh that is what you just said too haha

17:49 <LnL> probably an event for each, I bet the queries could be a bit heavy

17:49 <gchristensen> yeah

17:49 <gchristensen> I like that

17:50 <gchristensen> cool, I like this

17:50 evanjs has quit [Quit: ZNC 1.7.5 - https://znc.in]

17:50 <LnL> long term you might want to make it remember some of the stuff it did so it doesn't start with 0ad every time if it didn't complete a cycle, etc.

17:51 <srk> btw I have post-receive hook implemented for watching nixpkgs commits, that could be used as a source instead of webhook. it is a standalone thing for now which passes events to sever which sents them to clients over websocket to web face

17:52 <gchristensen> LnL: yeah, that sounds like a future thing we can deal with if we have to :P

17:53 <gchristensen> srk: github has post-receive hooks beyond their webhooks?

17:53 evanjs has joined #nixos-borg

17:54 <srk> gchristensen: no, it works by checking out mirror copy of the repo and fetching periodically then pushing to repo which has post-receive hooks

17:54 <srk> cause github doesn't make it easy :)

17:54 <gchristensen> ah

17:54 <gchristensen> we have the webhook setup on github's end

17:54 <srk> sure, but if I wanted to receive a stream? :)

17:55 <gchristensen> yeah, so the webhook goes right in to rabbitmq :)

17:55 <srk> with this you just run a websocket client

17:55 <gchristensen> https://github.com/NixOS/ofborg/blob/released/php/web/index.php

17:55 <srk> yup, that works as well. I've stopped relying on github functionality since it makes easy stuff like this difficult

17:55 <gchristensen> aye

17:55 <gchristensen> sounds like a cool thing you made

17:56 <LnL> gchristensen: cool, maybe I'll have a go at it this weekend

17:56 <LnL> would you want this as part of ofborg or something separate?

17:56 <gchristensen> LnL: that would be amazing!

17:57 <gchristensen> I would kind of like to do it as part of ofborg so it can easily reuse all the same infra

17:57 <gchristensen> I feel like ofborg is a bit mired by not-very-nice code

17:57 <srk> message queues are awesome for this!

17:57 <srk> you just need to commit and remove the remanining php files :D

17:58 <gchristensen> eh, I think I'll stick to PHP

17:58 <LnL> srk: check the channel description

17:58 <srk> (that hook forwarder is not that bad !!)

17:58 * srk runs

17:58 <srk> LnL: noticed ;)

18:06 <gchristensen> LnL: the part you'd be working on this weekend would include Selector but exclude Postgres mirror and Importer right?

18:07 <LnL> yeah

18:07 <gchristensen> cool, perfect

18:26 <gchristensen> it is cool that r-ryantm runs nixpkgs-review on its PRs now

18:27 <gchristensen> I'm feeling like there is a lot of pressure on me to make ofborg do a lot more all of a sudden

18:27 <gchristensen> stressful

18:27 <cole-h> tbh I feel like it's better for humans to do it when they review the PR

18:28 <cole-h> Because then they can also check functionality at the same time (if successful)

18:28 <MichaelRaskin> Or not do it at all (if realistic)

18:28 <gchristensen> oh?

18:28 <MichaelRaskin> Depends on the reviewer, sure

18:29 <MichaelRaskin> I mean, if ofborg ran full review on request, the share of full-review PRs would be way up

18:29 <MichaelRaskin> (not saying that this is worth the effort of getting enough build power)

18:30 <gchristensen> more important to me is throughput and QoS

18:30 <MichaelRaskin> And r-ryantm does have the benefit of being able to control throughput

18:30 <gchristensen> yeah exactly :P

18:30 <gchristensen> don't want a full review on gcc to gum up the works for everything

18:34 <flokli> gchristensen: ofborg is internal-erroring again?

18:35 <gchristensen> link?

18:35 <flokli> your favourite PR: https://github.com/NixOS/nixpkgs/pull/85334

18:35 <{^_^}> #85334 (by flokli, 20 hours ago, open): systemd: 243.7 -> 245

18:35 <gchristensen> nice

18:35 <gchristensen> it might have to do with me doing an aggressive GC

18:35 <flokli> it might be ofborg got confused as I pushed two times in a matter of 2mins or so

18:36 <flokli> but you might see something in the logs at least

18:36 <flokli> so I decided to ask ;-)

18:36 <gchristensen> nope

18:36 <gchristensen> github again

18:36 <gchristensen> out of all the fields you would think they wouldn't send back `status` records without the `status` field

18:37 <cole-h> Again???

18:37 <cole-h> ........

18:38 <flokli> brrr

18:38 <flokli> thanks github ;-)

18:38 * cole-h should figure out grafana alerting and send one to me whenever ofborg internal-errors

18:38 <gchristensen> oh that'd be cool

18:40 <srk> is it prometheus based?

18:40 <gchristensen> https://monitoring.nix.ci https://nix.ci/prometheus and there is a (private) loki instance

18:41 <srk> cool, thanks

18:41 <gchristensen> I can give out access to the loki instance as people want

18:42 <gchristensen> I just don't want it to be public-public in case some API key or whatever gets out

18:42 <srk> interesting, first time I hear about loki

18:42 * srk was using graylog before

18:42 <gchristensen> https://gsc.io/snaps/be91a74e-3471-4a54-ae2e-8ca3017214f2.png

18:43 <gchristensen> 105802 store paths deleted, 223739.34 MiB freed

18:43 <gchristensen> not bad

18:44 <srk> pretty

18:47 <cole-h> gchristensen: Are private folders a thing in Grafana?

18:48 <gchristensen> like personal folders?

18:48 <cole-h> Like how Loki is restricted right now

18:48 <gchristensen> ah

18:48 <gchristensen> no, it is restricted because the Explore tab is restricted

18:48 <gchristensen> I don't know about other stuff

18:48 <gchristensen> give it a go, try stuff :)

18:48 <cole-h> Yeah, I created a dashboard for internal errors

18:48 <gchristensen> cool

18:49 <gchristensen> link?

18:49 <cole-h> But I set the representation to be the logs because I figured that might be more useful than a bargraph with 1 tick

18:49 <cole-h> https://monitoring.nix.ci/d/t06kCM3Zz/ofborg-internal-errors?orgId=1

18:49 <cole-h> I think that should work

18:49 <gchristensen> nice

18:50 <cole-h> Ah, but the log view doesn't allow alerts x)

18:51 <cole-h> Aaaand "the datasource does not support alerting queries"

18:52 <gchristensen> ffs

18:52 <cole-h> https://github.com/grafana/loki/issues/1422#issuecomment-566006333 Might work, trying now

18:52 <gchristensen> pushover won't stop telling me stuff is broken, but the alerts system doesn't say anything is wrong

18:55 <samueldr> obviously, something's wrong with the alerts system if there is no alerts /s

18:55 <gchristensen> lol

18:58 <gchristensen> https://monitoring.nix.ci/d/000000002/ofborg?orgId=1&refresh=10s&fullscreen&panelId=1 that's the good stuff

19:02 <gchristensen> brb throwing my phone in to a blender

19:07 <cole-h> gchristensen: Can you add a data source for prometheus-based Loki? https://youtu.be/GdgX46KwKqo

19:08 <gchristensen> the problem seems to be Pushover

19:10 <LnL> howso?

19:10 <gchristensen> well I deleted the alert and stopped alertmanager

19:10 <gchristensen> and the alerts kept coming

19:10 <gchristensen> and then I saw this :) https://twitter.com/PushoverApp/status/1250654568066748416

19:10 <LnL> oh, is that why it stopped?

19:11 <gchristensen> stopped?

19:11 <gchristensen> mine isn't stopping

19:11 <cole-h> Haha

19:11 <LnL> I got a last [RESOLVED] one at 20:38

19:13 <LnL> you know it has a do not disturb button right?

19:14 <LnL> or did that also break

19:15 <gchristensen> yeah that is broken too

19:16 <gchristensen> ffs

19:16 <gchristensen> this is literally killing me

19:17 <LnL> oh wow :/

19:18 <gchristensen> ditched my phone on top of a pillow in another room so I don't have to hear it buzz every 2 minutes

20:58 <LnL> one more thing I didn't think about yet, broken builds won't show up as jobs in hydra

20:58 <gchristensen> hm

20:59 <LnL> so this also needs some other mechanism to receive/find all the broken packages

20:59 <gchristensen> specifically this is `meta.broken = true;` ones yeah?

20:59 <LnL> yeah

21:00 <gchristensen> hm

21:06 <LnL> packages marked as broken still evaluate right? at least enough for meta.available

21:07 <gchristensen> I think

21:07 <gchristensen> yeah

21:07 <gchristensen> ofborg requires even broken packages evaluate completely

21:08 <LnL> hmm, it evaluates with allowBroken?

21:08 <gchristensen> I htink so

21:09 <gchristensen> https://github.com/NixOS/ofborg/blob/released/ofborg/src/outpaths.nix#L21 ya

21:09 <LnL> indeed, interesting

21:09 <MichaelRaskin> So rebuild count includes broken?

21:10 <gchristensen> I hadn't considered that

21:10 <gchristensen> maybe so

21:14 <LnL> ok so I think that should make it possible to expose something like { hello = { broken = false; unfree = false; }; }

21:15 <LnL> which can then be used as input for what jobs to query

21:15 <gchristensen> cool

21:15 <gchristensen> yeah that is a cool idea

21:16 <LnL> unfree = skip and not broken but no jobs is unknown

21:16 <gchristensen> yeah

21:16 <gchristensen> nice sounds great

21:41 orivej has quit [Ping timeout: 258 seconds]

21:47 orivej has joined #nixos-borg

22:12 orivej has quit [Ping timeout: 258 seconds]