#nixos-borg on 2020-04-23

2018-04-19 20:36 gchristensen changed the topic of #nixos-borg to: https://www.patreon.com/ofborg https://monitoring.nix.ci/dashboard/db/ofborg?refresh=10s&orgId=1&from=now-1h&to=now "I get to skip reviewing the PHP code and just wait until it is rewritten in something sane, like POSIX shell. || https://logs.nix.samueldr.com/nixos-borg

00:51 orivej has quit [Ping timeout: 260 seconds]

05:42 cole-h has quit [Quit: Goodbye]

05:58 <LnL> https://gist.github.com/LnL7/6f29180b61fc683319032c46f55198de

06:01 <LnL> kind of odd but seems like everything stopped after the encoding error, the connrection errors restarted from what I can tell

06:16 orivej has joined #nixos-borg

06:30 <{^_^}> [ofborg] @ThomasMader opened pull request #466 → config.public.json: add @ThomasMader → https://git.io/JfICD

10:02 globin has joined #nixos-borg

11:04 <gchristensen> Unix-domain socket path "/var/lib/buildkite-agent-pgloadndump/builds/kif-pgloadndump-1/grahamc/postgres-load-dump/socket/.s.PGSQL.5432" is too long (maximum 107 bytes)

11:11 <gchristensen> LnL: https://buildkite.com/grahamc/postgres-load-dump/builds/33#f7503302-061d-4df6-9802-3524e8c45b96

11:16 <LnL> morning :)

11:16 <gchristensen> morning :) https://buildkite.com/grahamc/postgres-load-dump/builds/34#dd50a46c-b963-4424-89bd-ac9526dbd812

11:16 <LnL> yeah unix domain sockets need to be <120 characters

11:18 <gchristensen> github.com/grahamc/hydra-pg-load-dump/

11:20 <LnL> did you see the alerts?

11:20 <LnL> either something's still wrong or a utf-8 thing brought all the linux builds down

11:21 <gchristensen> uhoh

11:21 <gchristensen> I didn't :/

11:22 <gchristensen> interesting

11:23 <LnL> not exactly sure what's going on

11:24 <gchristensen> uhh...hrm

11:24 <LnL> but the answer to "does this still work?" on buildkite was no so couldn't do much

11:24 <gchristensen> ofborg-builder.service

11:24 <gchristensen> Active: active (running) since Wed 2020-04-22 21:16:52 UTC; 14h ago

11:26 <gchristensen> we should maybe revert and put our eggs in the lapin basket

11:28 <LnL> did none of them restart?

11:28 <LnL> definitively looked like some did to me

11:28 <gchristensen> they restarted and then got stuck for like 12h

11:30 <LnL> yeah, but 21:16 (probably +1) is before everything I linked

11:32 <LnL> nevermind it's +2 now

11:32 * LnL hates timezones

11:33 * gchristensen tooo

11:34 <LnL> was that eval-2?

11:35 <gchristensen> ye

11:35 <LnL> 2020-04-22 23:16:42thread '<unnamed>' panicked at 'cannot access stderr during shutdown', src/libcore/option.rs:1188:5

11:35 <LnL> ok, that matches perfectly then

11:39 <LnL> it did restart afterwards and still processed some stuff however

11:54 <LnL> hmm, did they die again?

11:54 <gchristensen> they're still running.... but dead

11:55 <gchristensen> LnL: I'm going to add your key so you can look :)

11:55 <gchristensen> LnL: root@147.75.65.23 I need to eat breakfast

11:56 <LnL> can't really look right now either, but go eat first :D

11:56 <gchristensen> ok :)

12:34 <gchristensen> these pages suck

12:43 <gchristensen> + pg_dump hydra --create --format=directory --exclude-table users --verbose --host /tmp/tmp.enqPWWTXz0

12:43 <gchristensen> pg_dump: [directory archiver] no output directory specified

12:43 <gchristensen> oeaeousoadetuhoesnuhonthunsoeauntoh

13:08 <gchristensen> LnL: I don't suppose you are familiar with postgres administration and could help with this error: https://buildkite.com/grahamc/postgres-load-dump/builds/38#c37fd019-1ac6-450e-986b-f27606fe6cbd

13:10 <LnL> the role? try -U postgresql

14:49 cole-h has joined #nixos-borg

15:14 <cole-h> So how, if at all, do we deal with these stalled builder alerts? It seems we've got 40 build jobs waiting (on x86_64-linux) because of it

15:18 <gchristensen> `I think we should revert the amqp update to ofborg and deploy that

15:18 <gchristensen> for now

15:19 <gchristensen> can you do that? sorry for allthe alerts :') my calendar has been shit today for doing ... things

15:19 <gchristensen> https://gsc.io/snaps/3078bc7a-f951-4a09-971b-e8ace645ed3b.png

15:22 <cole-h> Heh wow. I can open the revert, sure. I haven't been briefed on deployment yet, so I'll leave that to you

15:22 <gchristensen> sure, thanks

15:23 <cole-h> Oof, need to recompile openssl 1.0.2u before I can `cargo test`... Oh well, time to blindly open the PR

15:24 <gchristensen> sure

15:25 <cole-h> (also, small note: the old version of amqp we were using doesn't have the "Make the socket blocking during the initial connection" and "Set read timeout only after SSL is initialized" commits)

15:25 <{^_^}> [ofborg] @cole-h opened pull request #467 → Revert "Bump amqp" → https://git.io/JfIPy

15:39 <{^_^}> [ofborg] @grahamc merged pull request #467 → Revert "Bump amqp" → https://git.io/JfIPy

15:39 <{^_^}> [ofborg] @grahamc pushed 2 commits to released: https://git.io/JfIXM

15:40 <cole-h> Good thing we did this (relatively) early in the morning: travis was able to actually update the status :D

15:42 <LnL> I can take a look in a bit but not sure whether it's worth debugging this more

16:16 <LnL> https://gist.github.com/LnL7/a1ae5289936d70aa892e168c67ee7ac8

16:17 <LnL> cole-h: those commits you mentioned sound pretty suspicious in combination with these threads

16:18 <cole-h> Suspicious as in "we should have them" or "those might be causing problems"?

16:19 <cole-h> I was really just noting that ofborg hadn't picked up the two commits from 2019 -- reverting what we did yesterday went back to a commit from 2018

16:27 <LnL> actually this might be after since it recovered already

16:29 <gchristensen> LnL: https://buildkite.com/grahamc/postgres-load-dump/builds/41#0408ca3d-5ac4-49b9-b5aa-0166a447fda3 anxiously watching for this one to succeed :/ -U postgresql didn't work, now I'm trying -U hydra

16:30 <LnL> what did it fail with?

16:31 <gchristensen> unknown role I think

16:31 <LnL> hmm, hydra uses the hydra role but the postgres role should also be there

16:32 <cole-h> Maybe `-U postgresql` didn't work, but `-U postgres` will (slight typo?)

16:32 <gchristensen> oh!

16:32 <gchristensen> -U hydra is doing it!

16:32 <gchristensen> look at it go!

16:33 <LnL> \o/

16:34 <gchristensen> I skipped the "Figure out how to upload it" step

16:34 <LnL> the hydra role is guaranteed to be there so probably better to use that anyway :)

16:34 <gchristensen> :)

16:34 <cole-h> :)

16:34 <gchristensen> LnL: so I guess I should upload these artifacts to S3 or something

16:34 <gchristensen> and set a policy of delete after 3 days or whatever

16:35 <LnL> first question is how big this is

16:35 <gchristensen> $bigish

16:36 <gchristensen> let's hope it doens't clean up the working directory _after_ the build finishes, and only does the clean up _before_ the build starts

16:36 <LnL> actually exporting might also not be needed (at least for the analyzer stuff) if that could talk to it directly

16:37 <gchristensen> I want to do a full pgdump to validate the backup is good enough to complete a dump

16:38 <gchristensen> (and that the output is ~Xgb)

16:38 <gchristensen> but don't have to upload them

16:48 <MichaelRaskin> Are you talking of single-shot dump does «policy» imply periodic dumps?

16:49 <LnL> this builds table it going to take forever isn't it... :p

16:49 <cole-h> gchristensen: Woo, 4 x86_64-linux builders are up and running again :)

16:49 <gchristensen> nice!

16:49 <gchristensen> LnL: probably :)

16:50 <gchristensen> MichaelRaskin: so, every 5 minutes I get an incremental filesystem snapshot from the database server (so does Rob) and I'm thinking this load-and-dump job would run on a daily basis, and as part of that, uploading a database export

16:51 <LnL> btw I'd love to see if a partial copy is small enough to use as testing data

16:52 <gchristensen> yea

16:52 <MichaelRaskin> Like the whole Nixpkgs history export, or last year?

16:52 <LnL> just the last year

16:53 <LnL> COPY (SELECT * FROM builds as b JOIN jobsetevalmembers as m ON b.id = m.build JOIN jobsetevals as e ON e.id = m.eval WHERE e.project = 'nixpkgs' AND e.jobset = 'trunk' AND e.id > 1515735) TO '/tmp/builds.csv';

16:54 <gchristensen> LnL: maybe you could send a PR to that repo, adding that query as an export step?

16:55 <gchristensen> also I'm not sure about this `--format directory` choice, I was expecting the directory structure to be ... reasonable ...

16:56 <LnL> no idea what that looks like, I've only used -Fc

16:56 <gchristensen> https://gist.github.com/grahamc/105b50af8e7133e6d40ed452ccd14c01

16:57 <LnL> hmm

16:58 <MichaelRaskin> Looks like PostgreSQL internal data layout

16:59 <LnL> reminds me of the layout of timeseries dbs

17:00 <MichaelRaskin> e.id > 1515735 is intended to be date_part('epoch',now())-e.timestamp < 3.3e7, right?

17:00 <gchristensen> what's this?

17:00 <gchristensen> what's this?oh

17:00 <gchristensen> gotcha

17:00 <gchristensen> I was hoping it'd be a file-per-table which would be easy to get just some data from, but if it is going to be this useless, might as well just use -Fc :)

17:00 <LnL> that's a random eval

17:01 <LnL> https://hydra.nixos.org/eval/1515735

17:02 <MichaelRaskin> LnL: yes, but hardcoding a constant cutoff sounds like something that will be fine, then not fine

17:02 <MichaelRaskin> I wrote the condition «get current time as Unix timestamp, and go back from it by just a bit over a year»

17:02 <LnL> ah yeah, let's not put this in the "backup last year" script :D

17:03 <LnL> there's a unix timestamp in one of the tables

17:04 <MichaelRaskin> Yep, e is jobevals which carries a timestamp

17:04 <LnL> oh! thought this was just an example

17:04 <LnL> thanks ::D

17:11 <LnL> gchristensen: zfs clone -o canmount=noauto

17:12 <gchristensen> oh cool, that'll make the || true bit unneeded?

17:12 <LnL> I think so

17:13 <gchristensen> Let's give it a go :) want to include that in your PR?

17:15 arianvp has joined #nixos-borg

17:15 <gchristensen> hey arianvp

17:15 <arianvp> hello

17:15 <cole-h> o/

17:16 <cole-h> Welcome to the cool kids club

17:16 <gchristensen> moving back to my desk and I'll share some context / background

17:16 <arianvp> cool

17:17 <gchristensen> okay, back story arianvp

17:18 <gchristensen> every 5min an incremental zfs snapshot goes from hydra's db server to my server, where I want to, daily, start a postgresql server with that data and do a pg_dump, plus run some batch queries.

17:19 <gchristensen> I do it in a buildkite job, so the job runs as a buildkite user. since postgresql needs a mutable directory to work with, I have to clone the snapshot somewhere and mount it, then chown the contents to the buildkite user so it can write

17:20 <gchristensen> https://github.com/grahamc/hydra-pg-load-dump/blob/master/dump.sh

17:20 lordcirth__ has joined #nixos-borg

17:21 <gchristensen> the buildkite user can clone the snapshot and destroy the filesystem it clones to, the remaining issues are: mount, chown, unmount

17:21 lordcirth__ has left #nixos-borg ["Leaving"]

17:21 <gchristensen> I tried to put the filesystem in fstab with the "user" option set, but apparently / according to #zfsonlinux, that requires the mount.zfs helper to be setuid, which upstream doesn't really support

17:22 <arianvp> so wait you have a postgres server on your server and basically you want to have the same content there as on hydra-db ?

17:22 <arianvp> but not 'live' ?

17:23 <gchristensen> right. the main purpose here is validate the snapshot is actually a functional backup

17:25 <arianvp> And why not use Postgres's builtin WAL archiving or WAL streaming?

17:26 <gchristensen> because these are the tools I have

17:26 <gchristensen> besides, the snapshots are pretty good backups actually

17:26 <gchristensen> anyway

17:26 <arianvp> alright besides the point; was just wondering

17:26 <gchristensen> yeah :)

17:28 <gchristensen> https://gist.github.com/grahamc/208c104ef4a1ed8fe2a143281da7a472 this is what I have

17:29 <LnL> gchristensen: where should I put the data? postgres wants an absolute path

17:29 <gchristensen> uh

17:30 <arianvp> so the goal is: this unprivelged user need to be able to mount the snapshot right?

17:30 <gchristensen> maybe make a directory, $(pwd)/upload/SOMENAME.sql and I'll .gz each file in upload individually and upload in to a datestamped path, LnL?

17:30 <gchristensen> arianvp: mount it, have rw on the files, and then unmount it

17:31 <arianvp> You could use udisksd / udiskctl for this

17:31 <arianvp> to give unprivelged people the possibility to mount unmount

17:31 <gchristensen> cool

17:31 <arianvp> that's what e.g. Gnome uses to implement mounting in their Files app

17:31 <gchristensen> this is services.udisks2.enable?

17:32 <arianvp> yes

17:32 <gchristensen> I suppose it would be strictly better to do it that way than ... what I'm doing now..?

17:32 <arianvp> you can then use `udiskctl mount` to mount

17:32 <arianvp> more.. conventional? :P

17:32 <gchristensen> nice

17:32 <gchristensen> :P

17:32 <gchristensen> I dunno, I kind of like this :D

17:33 <gchristensen> in a sick and twisted way

17:33 <arianvp> yeh it's kinda simple

17:33 <arianvp> but udisks does basically this but instead of touching files it sends a DBUS message

17:33 <arianvp> :P

17:33 <gchristensen> gotcha

17:33 <LnL> btw, I have no idea how this works under the hood but you can zfs allow users to mount specific volumes

17:33 <gchristensen> LnL: it doesn't work on linux :(

17:33 <LnL> hmm, thought it did

17:34 <arianvp> ok time for some dinner

17:34 <gchristensen> thanks for the tips, arianvp

17:35 <gchristensen> LnL: you can't even give someone a kernel capability to do it, since the capability required is literally CAP_SYS_ADMIN :(

17:36 <LnL> assumed it would go through zed or something

17:36 <gchristensen> unfortunately not

17:37 <gchristensen> (though sort of fortunately so, makes it harder to create security bugs if things have to happen as the user)

17:48 <gchristensen> LnL: got final sizes

17:48 <gchristensen> the `directory` dump is 32g

17:49 <LnL> oh really? that's less than I was expecting

17:50 <cole-h> That's less than most AAA games!

17:50 <gchristensen> lol

17:50 <gchristensen> yeah

17:51 <gchristensen> it is like 300G+ imported in to postgres

17:52 <cole-h> That's slightly more than most AAA games!

17:52 <gchristensen> lol

17:52 <gchristensen> I think a lot of it is because there are tables with way too many indexes that are completely useless

18:08 <gchristensen> «recommendation engine mode» People in #nixos-borg may also be interested in #nixos-infra

18:09 <cole-h> Soon I won't be able to switch to these by hotkey anymore :(

18:14 <LnL> _another_ nix channel, when was that created?

18:14 <gchristensen> heh

18:14 <gchristensen> not so long ago

18:14 * cole-h is still in #nixos-baduk with manveru, all by our lonesomes

18:15 <gchristensen> it used to be that things like hydra admin stuff would polute #nixos-dev or be done in needlessly private places

18:19 <LnL> does anybody else miss substitute/substituteInPlace when doing stuff outside of nix?

18:19 <gchristensen> yes

18:19 <gchristensen> LnL: have you seen abathur's resholver?

18:19 <gchristensen> because wow

18:21 <LnL> whoa

18:21 <gchristensen> exactly

18:22 <gchristensen> https://github.com/NixOS/nixpkgs/pull/85827

18:22 <{^_^}> #85827 (by abathur, 18 hours ago, open): resholved: init at hopes and dreams

18:24 * cole-h cries in no fish support

18:25 <gchristensen> it is based on oil after all

18:25 <gchristensen> do people write fish shell scripts?

18:25 <cole-h> For plugins, yes :P

18:26 <gchristensen> ah

18:49 <LnL> urgh, doesn't want to import because projects have an owner :/

18:49 <gchristensen> huh?

18:51 <LnL> with the csv data the schema is separate

18:51 <gchristensen> right

19:17 <{^_^}> [ofborg] @emilazy opened pull request #468 → config.public.json: add emilazy to trusted_users → https://git.io/JfI7L

19:18 <{^_^}> [ofborg] @grahamc merged pull request #425 → config.public.json: add marsam to trusted_users → https://git.io/JeQwf

19:18 <{^_^}> [ofborg] @grahamc pushed 2 commits to released: https://git.io/JfI73

19:19 <{^_^}> [ofborg] @grahamc merged pull request #468 → config.public.json: add emilazy to trusted_users → https://git.io/JfI7L

19:19 <{^_^}> [ofborg] @grahamc pushed 2 commits to released: https://git.io/JfI7s

19:19 <{^_^}> [ofborg] @grahamc closed pull request #466 → config.public.json: add @ThomasMader → https://git.io/JfICD

19:19 <{^_^}> [ofborg] @grahamc closed pull request #439 → add symphorien to known users → https://git.io/JvPgx

19:20 <{^_^}> [ofborg] @grahamc closed pull request #427 → known-users: Add myself → https://git.io/Jebrr

19:49 <cole-h> Shocking: the first time GH shows me a unicorn and no ofborg internal errors show up later that day...

19:50 * cole-h knocks on wood

20:40 evanjs has quit [Quit: ZNC 1.7.5 - https://znc.in]

20:41 evanjs has joined #nixos-borg

20:47 <LnL> gchristensen: for my db this is actually bigger so I'm not sure if it's useful

21:18 <cole-h> gchristensen: Was the problem with the comment filter because the systemd service was in a failing state (after the panic)? Or was it still running, but doing nothing?

21:24 andi- has quit [Ping timeout: 256 seconds]

21:29 andi- has joined #nixos-borg

21:30 <gchristensen> still running doing nothing

21:31 <cole-h> Whacky

21:32 <gchristensen> a half-panicked state

21:33 <LnL> still issues after the revert?

21:34 <cole-h> Don't think so, was from way before the revert. Just didn't notice until emilazy pinged in -dev

21:35 <cole-h> No output from the filter service until I checked the "last 12 hours" time window in Loki, so I thought something was strange (since last output was a panic)

21:53 <gchristensen> LnL: zfs' permissions are very precise lol

21:53 <gchristensen> [root@kif:~]# sudo -u buildkite-agent-pgloadndump zfs clone -o canmount=noauto rpool/backups/nixos.org/haumea/safe/postgres@2020-04-23T21:40:00Z rpool/scratch/haumea-load-and-dump/target

21:53 <gchristensen> cannot create 'rpool/scratch/haumea-load-and-dump/target': permission denied

21:53 <gchristensen> why? buildkite-agent-pgloadndump doesn't have the permission to set canmount

21:53 <gchristensen> I dearly wish ZFS's error messages included what check failed

21:54 <LnL> ah, yeah

21:55 <LnL> zfs allow -u buildkite-agent-pgloadndump canmount?

21:55 <gchristensen> yeah :)

21:55 <gchristensen> well

21:55 <gchristensen> zfs allow -dl -u buildkite-agent-pgloadndump create,mount,destroy,canmount rpool/scratch/haumea-load-and-dump

21:57 andi- has quit [Quit: WeeChat 2.8]

21:57 <cole-h> As a zfs newb, what's the difference between mount and canmount?

21:57 andi- has joined #nixos-borg

21:57 <cole-h> (read: no experience with zfs yet)

21:58 <LnL> canmount is a flag, so the permission allows the user to set that

21:58 <gchristensen> https://docs.oracle.com/cd/E23824_01/html/821-1448/gazss.html#gdrcf

21:59 <cole-h> What does mount do, then?

21:59 <gchristensen> it mounts :)

22:00 <cole-h> Hm. So, `create,mount,destroy,canmount` creates, mounts, destroys, and allows the user to mount?

22:00 * cole-h scratches head

22:00 <gchristensen> oh

22:01 <gchristensen> those are permissions

22:01 <gchristensen> can: create, mount, destroy, and set the canmount property

22:01 <cole-h> Ohhh, that makes more sense.

22:01 <gchristensen> https://docs.oracle.com/cd/E36784_01/html/E36871/zfs-allow-1m.html

22:02 andi- has quit [Excess Flood]

22:02 <cole-h> I was thinking that was a list of things to do, which was why I was confused on why canmount was even there... lol

22:02 <gchristensen> :)

22:02 <cole-h> ✨ gchristensen ✨ LnL

22:02 <{^_^}> LnL's karma got increased to 35, gchristensen's karma got increased to 277

22:03 andi- has joined #nixos-borg

22:03 <LnL> it's pretty nice you can give more fine grained permissions for to users

22:03 <LnL> except for mounting aparently

22:04 <gchristensen> you tell ZFS to grant it but then it tries and linux says no no no :(

22:05 <cole-h> Am I hearing "upstream a kernel patch that adds CAP_SYS_CANMOUNT" (which was your original idea, I think)?

22:05 <LnL> but you can give a regular service user permissions to create/rotate snapshots for example

22:06 <gchristensen> cole-h: yeah I spent about 5 minutes looking at that before realizing I was nuts for trying

22:06 <gchristensen> I don't have the willpower to do it right now :P

22:07 <cole-h> The hardest part would probably be answering the question "why do you need this?". "Uh, so I can mount some ZFS disks without root..."

22:10 <gchristensen> and then "...."

22:10 <cole-h> And then more "...."

22:10 <cole-h> And finally "...No."

22:58 * cole-h wonders why there's no data for "disk free", "memory", "CPU", and "target branch evaluation failures" in the ofborg dashboard

22:58 <cole-h> (scratch that last one)

22:59 <gchristensen> you could fix those :)

23:02 <cole-h> Heh.

23:02 <gchristensen> hehe

23:02 <cole-h> What should CPU be displaying?

23:03 <gchristensen> what does the graph query for now+?

23:03 <cole-h> We got core throttles total, freq max hz, freq min hz, guest seconds total, package throttles total, scaling freq hz, scaling freq min/max hz, and seconds total

23:03 <cole-h> It queries for `(irate(node_cpu{job="node",mode="idle"}[5m]))`

23:04 <cole-h> (and some other stuff

23:04 <cole-h> )

23:04 <cole-h> Full query: 100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[5m])) * 100)

23:04 <gchristensen> hrm

23:04 <cole-h> Memory and disk free fixed (just had to `_bytes` all the things)

23:06 <gchristensen> I think we want to do this: https://prometheus.io/docs/guides/node-exporter/#exploring-node-exporter-metrics-through-the-prometheus-expression-browser

23:06 <gchristensen> see the table with Metric ...

23:06 <cole-h> Got it

23:08 <cole-h> Should be fixed. Take a peek and see if the graph looks like what you want to see?

23:08 <gchristensen> nice!!

23:08 <gchristensen> looks great!

23:08 <cole-h> Wew

23:08 <gchristensen> thanks!

23:08 <cole-h> And also: of course there's no data for "Target branch eval failures" lol

23:08 <cole-h> No eval failures right now :D

23:09 <gchristensen> :)

23:09 <cole-h> If you go to last 12 hours, though, you can see the blip (really, more than a blip) from earlier

23:09 <gchristensen> yeah :)

23:09 <gchristensen> (whoops)

23:09 <cole-h> Hehe

23:09 tilpner_ has joined #nixos-borg

23:09 <cole-h> I don't think that was out fault tho

23:09 <cole-h> Pretty sure that was because of the kernel stuff

23:09 tilpner has quit [Remote host closed the connection]

23:09 <gchristensen> yeah

23:10 tilpner_ is now known as tilpner