gchristensen changed the topic of #nixos-borg to: https://www.patreon.com/ofborg https://monitoring.nix.ci/dashboard/db/ofborg?refresh=10s&orgId=1&from=now-1h&to=now "I get to skip reviewing the PHP code and just wait until it is rewritten in something sane, like POSIX shell. || https://logs.nix.samueldr.com/nixos-borg
orivej has quit [Ping timeout: 260 seconds]
cole-h has quit [Quit: Goodbye]
<LnL> kind of odd but seems like everything stopped after the encoding error, the connrection errors restarted from what I can tell
orivej has joined #nixos-borg
<{^_^}> [ofborg] @ThomasMader opened pull request #466 → config.public.json: add @ThomasMader → https://git.io/JfICD
globin has joined #nixos-borg
<gchristensen> Unix-domain socket path "/var/lib/buildkite-agent-pgloadndump/builds/kif-pgloadndump-1/grahamc/postgres-load-dump/socket/.s.PGSQL.5432" is too long (maximum 107 bytes)
<LnL> morning :)
<LnL> yeah unix domain sockets need to be <120 characters
<gchristensen> github.com/grahamc/hydra-pg-load-dump/
<LnL> did you see the alerts?
<LnL> either something's still wrong or a utf-8 thing brought all the linux builds down
<gchristensen> uhoh
<gchristensen> I didn't :/
<gchristensen> interesting
<LnL> not exactly sure what's going on
<gchristensen> uhh...hrm
<LnL> but the answer to "does this still work?" on buildkite was no so couldn't do much
<gchristensen> ofborg-builder.service
<gchristensen> Active: active (running) since Wed 2020-04-22 21:16:52 UTC; 14h ago
<gchristensen> we should maybe revert and put our eggs in the lapin basket
<LnL> did none of them restart?
<LnL> definitively looked like some did to me
<gchristensen> they restarted and then got stuck for like 12h
<LnL> yeah, but 21:16 (probably +1) is before everything I linked
<LnL> nevermind it's +2 now
* LnL hates timezones
* gchristensen tooo
<LnL> was that eval-2?
<gchristensen> ye
<LnL> 2020-04-22 23:16:42thread '<unnamed>' panicked at 'cannot access stderr during shutdown', src/libcore/option.rs:1188:5
<LnL> ok, that matches perfectly then
<LnL> it did restart afterwards and still processed some stuff however
<LnL> hmm, did they die again?
<gchristensen> they're still running.... but dead
<gchristensen> LnL: I'm going to add your key so you can look :)
<gchristensen> LnL: root@ I need to eat breakfast
<LnL> can't really look right now either, but go eat first :D
<gchristensen> ok :)
<gchristensen> these pages suck
<gchristensen> + pg_dump hydra --create --format=directory --exclude-table users --verbose --host /tmp/tmp.enqPWWTXz0
<gchristensen> pg_dump: [directory archiver] no output directory specified
<gchristensen> oeaeousoadetuhoesnuhonthunsoeauntoh
<gchristensen> LnL: I don't suppose you are familiar with postgres administration and could help with this error: https://buildkite.com/grahamc/postgres-load-dump/builds/38#c37fd019-1ac6-450e-986b-f27606fe6cbd
<LnL> the role? try -U postgresql
cole-h has joined #nixos-borg
<cole-h> So how, if at all, do we deal with these stalled builder alerts? It seems we've got 40 build jobs waiting (on x86_64-linux) because of it
<gchristensen> `I think we should revert the amqp update to ofborg and deploy that
<gchristensen> for now
<gchristensen> can you do that? sorry for allthe alerts :') my calendar has been shit today for doing ... things
<cole-h> Heh wow. I can open the revert, sure. I haven't been briefed on deployment yet, so I'll leave that to you
<gchristensen> sure, thanks
<cole-h> Oof, need to recompile openssl 1.0.2u before I can `cargo test`... Oh well, time to blindly open the PR
<gchristensen> sure
<cole-h> (also, small note: the old version of amqp we were using doesn't have the "Make the socket blocking during the initial connection" and "Set read timeout only after SSL is initialized" commits)
<{^_^}> [ofborg] @cole-h opened pull request #467 → Revert "Bump amqp" → https://git.io/JfIPy
<{^_^}> [ofborg] @grahamc merged pull request #467 → Revert "Bump amqp" → https://git.io/JfIPy
<{^_^}> [ofborg] @grahamc pushed 2 commits to released: https://git.io/JfIXM
<cole-h> Good thing we did this (relatively) early in the morning: travis was able to actually update the status :D
<LnL> I can take a look in a bit but not sure whether it's worth debugging this more
<LnL> cole-h: those commits you mentioned sound pretty suspicious in combination with these threads
<cole-h> Suspicious as in "we should have them" or "those might be causing problems"?
<cole-h> I was really just noting that ofborg hadn't picked up the two commits from 2019 -- reverting what we did yesterday went back to a commit from 2018
<LnL> actually this might be after since it recovered already
<gchristensen> LnL: https://buildkite.com/grahamc/postgres-load-dump/builds/41#0408ca3d-5ac4-49b9-b5aa-0166a447fda3 anxiously watching for this one to succeed :/ -U postgresql didn't work, now I'm trying -U hydra
<LnL> what did it fail with?
<gchristensen> unknown role I think
<LnL> hmm, hydra uses the hydra role but the postgres role should also be there
<cole-h> Maybe `-U postgresql` didn't work, but `-U postgres` will (slight typo?)
<gchristensen> oh!
<gchristensen> -U hydra is doing it!
<gchristensen> look at it go!
<LnL> \o/
<gchristensen> I skipped the "Figure out how to upload it" step
<LnL> the hydra role is guaranteed to be there so probably better to use that anyway :)
<gchristensen> :)
<cole-h> :)
<gchristensen> LnL: so I guess I should upload these artifacts to S3 or something
<gchristensen> and set a policy of delete after 3 days or whatever
<LnL> first question is how big this is
<gchristensen> $bigish
<gchristensen> let's hope it doens't clean up the working directory _after_ the build finishes, and only does the clean up _before_ the build starts
<LnL> actually exporting might also not be needed (at least for the analyzer stuff) if that could talk to it directly
<gchristensen> I want to do a full pgdump to validate the backup is good enough to complete a dump
<gchristensen> (and that the output is ~Xgb)
<gchristensen> but don't have to upload them
<MichaelRaskin> Are you talking of single-shot dump does «policy» imply periodic dumps?
<LnL> this builds table it going to take forever isn't it... :p
<cole-h> gchristensen: Woo, 4 x86_64-linux builders are up and running again :)
<gchristensen> nice!
<gchristensen> LnL: probably :)
<gchristensen> MichaelRaskin: so, every 5 minutes I get an incremental filesystem snapshot from the database server (so does Rob) and I'm thinking this load-and-dump job would run on a daily basis, and as part of that, uploading a database export
<LnL> btw I'd love to see if a partial copy is small enough to use as testing data
<gchristensen> yea
<MichaelRaskin> Like the whole Nixpkgs history export, or last year?
<LnL> just the last year
<LnL> COPY (SELECT * FROM builds as b JOIN jobsetevalmembers as m ON b.id = m.build JOIN jobsetevals as e ON e.id = m.eval WHERE e.project = 'nixpkgs' AND e.jobset = 'trunk' AND e.id > 1515735) TO '/tmp/builds.csv';
<gchristensen> LnL: maybe you could send a PR to that repo, adding that query as an export step?
<gchristensen> also I'm not sure about this `--format directory` choice, I was expecting the directory structure to be ... reasonable ...
<LnL> no idea what that looks like, I've only used -Fc
<LnL> hmm
<MichaelRaskin> Looks like PostgreSQL internal data layout
<LnL> reminds me of the layout of timeseries dbs
<MichaelRaskin> e.id > 1515735 is intended to be date_part('epoch',now())-e.timestamp < 3.3e7, right?
<gchristensen> what's this?
<gchristensen> what's this?oh
<gchristensen> gotcha
<gchristensen> I was hoping it'd be a file-per-table which would be easy to get just some data from, but if it is going to be this useless, might as well just use -Fc :)
<LnL> that's a random eval
<MichaelRaskin> LnL: yes, but hardcoding a constant cutoff sounds like something that will be fine, then not fine
<MichaelRaskin> I wrote the condition «get current time as Unix timestamp, and go back from it by just a bit over a year»
<LnL> ah yeah, let's not put this in the "backup last year" script :D
<LnL> there's a unix timestamp in one of the tables
<MichaelRaskin> Yep, e is jobevals which carries a timestamp
<LnL> oh! thought this was just an example
<LnL> thanks ::D
<LnL> gchristensen: zfs clone -o canmount=noauto
<gchristensen> oh cool, that'll make the || true bit unneeded?
<LnL> I think so
<gchristensen> Let's give it a go :) want to include that in your PR?
arianvp has joined #nixos-borg
<gchristensen> hey arianvp
<arianvp> hello
<cole-h> o/
<cole-h> Welcome to the cool kids club
<gchristensen> moving back to my desk and I'll share some context / background
<arianvp> cool
<gchristensen> okay, back story arianvp
<gchristensen> every 5min an incremental zfs snapshot goes from hydra's db server to my server, where I want to, daily, start a postgresql server with that data and do a pg_dump, plus run some batch queries.
<gchristensen> I do it in a buildkite job, so the job runs as a buildkite user. since postgresql needs a mutable directory to work with, I have to clone the snapshot somewhere and mount it, then chown the contents to the buildkite user so it can write
lordcirth__ has joined #nixos-borg
<gchristensen> the buildkite user can clone the snapshot and destroy the filesystem it clones to, the remaining issues are: mount, chown, unmount
lordcirth__ has left #nixos-borg ["Leaving"]
<gchristensen> I tried to put the filesystem in fstab with the "user" option set, but apparently / according to #zfsonlinux, that requires the mount.zfs helper to be setuid, which upstream doesn't really support
<arianvp> so wait you have a postgres server on your server and basically you want to have the same content there as on hydra-db ?
<arianvp> but not 'live' ?
<gchristensen> right. the main purpose here is validate the snapshot is actually a functional backup
<arianvp> And why not use Postgres's builtin WAL archiving or WAL streaming?
<gchristensen> because these are the tools I have
<gchristensen> besides, the snapshots are pretty good backups actually
<gchristensen> anyway
<arianvp> alright besides the point; was just wondering
<gchristensen> yeah :)
<LnL> gchristensen: where should I put the data? postgres wants an absolute path
<gchristensen> uh
<arianvp> so the goal is: this unprivelged user need to be able to mount the snapshot right?
<gchristensen> maybe make a directory, $(pwd)/upload/SOMENAME.sql and I'll .gz each file in upload individually and upload in to a datestamped path, LnL?
<gchristensen> arianvp: mount it, have rw on the files, and then unmount it
<arianvp> You could use udisksd / udiskctl for this
<arianvp> to give unprivelged people the possibility to mount unmount
<gchristensen> cool
<arianvp> that's what e.g. Gnome uses to implement mounting in their Files app
<gchristensen> this is services.udisks2.enable?
<arianvp> yes
<gchristensen> I suppose it would be strictly better to do it that way than ... what I'm doing now..?
<arianvp> you can then use `udiskctl mount` to mount
<arianvp> more.. conventional? :P
<gchristensen> nice
<gchristensen> :P
<gchristensen> I dunno, I kind of like this :D
<gchristensen> in a sick and twisted way
<arianvp> yeh it's kinda simple
<arianvp> but udisks does basically this but instead of touching files it sends a DBUS message
<arianvp> :P
<gchristensen> gotcha
<LnL> btw, I have no idea how this works under the hood but you can zfs allow users to mount specific volumes
<gchristensen> LnL: it doesn't work on linux :(
<LnL> hmm, thought it did
<arianvp> ok time for some dinner
<gchristensen> thanks for the tips, arianvp
<gchristensen> LnL: you can't even give someone a kernel capability to do it, since the capability required is literally CAP_SYS_ADMIN :(
<LnL> assumed it would go through zed or something
<gchristensen> unfortunately not
<gchristensen> (though sort of fortunately so, makes it harder to create security bugs if things have to happen as the user)
<gchristensen> LnL: got final sizes
<gchristensen> the `directory` dump is 32g
<LnL> oh really? that's less than I was expecting
<cole-h> That's less than most AAA games!
<gchristensen> lol
<gchristensen> yeah
<gchristensen> it is like 300G+ imported in to postgres
<cole-h> That's slightly more than most AAA games!
<gchristensen> lol
<gchristensen> I think a lot of it is because there are tables with way too many indexes that are completely useless
<gchristensen> «recommendation engine mode» People in #nixos-borg may also be interested in #nixos-infra
<cole-h> Soon I won't be able to switch to these by hotkey anymore :(
<LnL> _another_ nix channel, when was that created?
<gchristensen> heh
<gchristensen> not so long ago
* cole-h is still in #nixos-baduk with manveru, all by our lonesomes
<gchristensen> it used to be that things like hydra admin stuff would polute #nixos-dev or be done in needlessly private places
<LnL> does anybody else miss substitute/substituteInPlace when doing stuff outside of nix?
<gchristensen> yes
<gchristensen> LnL: have you seen abathur's resholver?
<gchristensen> because wow
<LnL> whoa
<gchristensen> exactly
<{^_^}> #85827 (by abathur, 18 hours ago, open): resholved: init at hopes and dreams
* cole-h cries in no fish support
<gchristensen> it is based on oil after all
<gchristensen> do people write fish shell scripts?
<cole-h> For plugins, yes :P
<gchristensen> ah
<LnL> urgh, doesn't want to import because projects have an owner :/
<gchristensen> huh?
<LnL> with the csv data the schema is separate
<gchristensen> right
<{^_^}> [ofborg] @emilazy opened pull request #468 → config.public.json: add emilazy to trusted_users → https://git.io/JfI7L
<{^_^}> [ofborg] @grahamc merged pull request #425 → config.public.json: add marsam to trusted_users → https://git.io/JeQwf
<{^_^}> [ofborg] @grahamc pushed 2 commits to released: https://git.io/JfI73
<{^_^}> [ofborg] @grahamc merged pull request #468 → config.public.json: add emilazy to trusted_users → https://git.io/JfI7L
<{^_^}> [ofborg] @grahamc pushed 2 commits to released: https://git.io/JfI7s
<{^_^}> [ofborg] @grahamc closed pull request #466 → config.public.json: add @ThomasMader → https://git.io/JfICD
<{^_^}> [ofborg] @grahamc closed pull request #439 → add symphorien to known users → https://git.io/JvPgx
<{^_^}> [ofborg] @grahamc closed pull request #427 → known-users: Add myself → https://git.io/Jebrr
<cole-h> Shocking: the first time GH shows me a unicorn and no ofborg internal errors show up later that day...
* cole-h knocks on wood
evanjs has quit [Quit: ZNC 1.7.5 - https://znc.in]
evanjs has joined #nixos-borg
<LnL> gchristensen: for my db this is actually bigger so I'm not sure if it's useful
<cole-h> gchristensen: Was the problem with the comment filter because the systemd service was in a failing state (after the panic)? Or was it still running, but doing nothing?
andi- has quit [Ping timeout: 256 seconds]
andi- has joined #nixos-borg
<gchristensen> still running doing nothing
<cole-h> Whacky
<gchristensen> a half-panicked state
<LnL> still issues after the revert?
<cole-h> Don't think so, was from way before the revert. Just didn't notice until emilazy pinged in -dev
<cole-h> No output from the filter service until I checked the "last 12 hours" time window in Loki, so I thought something was strange (since last output was a panic)
<gchristensen> LnL: zfs' permissions are very precise lol
<gchristensen> [root@kif:~]# sudo -u buildkite-agent-pgloadndump zfs clone -o canmount=noauto rpool/backups/nixos.org/haumea/safe/postgres@2020-04-23T21:40:00Z rpool/scratch/haumea-load-and-dump/target
<gchristensen> cannot create 'rpool/scratch/haumea-load-and-dump/target': permission denied
<gchristensen> why? buildkite-agent-pgloadndump doesn't have the permission to set canmount
<gchristensen> I dearly wish ZFS's error messages included what check failed
<LnL> ah, yeah
<LnL> zfs allow -u buildkite-agent-pgloadndump canmount?
<gchristensen> yeah :)
<gchristensen> well
<gchristensen> zfs allow -dl -u buildkite-agent-pgloadndump create,mount,destroy,canmount rpool/scratch/haumea-load-and-dump
andi- has quit [Quit: WeeChat 2.8]
<cole-h> As a zfs newb, what's the difference between mount and canmount?
andi- has joined #nixos-borg
<cole-h> (read: no experience with zfs yet)
<LnL> canmount is a flag, so the permission allows the user to set that
<cole-h> What does mount do, then?
<gchristensen> it mounts :)
<cole-h> Hm. So, `create,mount,destroy,canmount` creates, mounts, destroys, and allows the user to mount?
* cole-h scratches head
<gchristensen> oh
<gchristensen> those are permissions
<gchristensen> can: create, mount, destroy, and set the canmount property
<cole-h> Ohhh, that makes more sense.
andi- has quit [Excess Flood]
<cole-h> I was thinking that was a list of things to do, which was why I was confused on why canmount was even there... lol
<gchristensen> :)
<cole-h> ✨ gchristensen ✨ LnL
<{^_^}> LnL's karma got increased to 35, gchristensen's karma got increased to 277
andi- has joined #nixos-borg
<LnL> it's pretty nice you can give more fine grained permissions for to users
<LnL> except for mounting aparently
<gchristensen> you tell ZFS to grant it but then it tries and linux says no no no :(
<cole-h> Am I hearing "upstream a kernel patch that adds CAP_SYS_CANMOUNT" (which was your original idea, I think)?
<LnL> but you can give a regular service user permissions to create/rotate snapshots for example
<gchristensen> cole-h: yeah I spent about 5 minutes looking at that before realizing I was nuts for trying
<gchristensen> I don't have the willpower to do it right now :P
<cole-h> The hardest part would probably be answering the question "why do you need this?". "Uh, so I can mount some ZFS disks without root..."
<gchristensen> and then "...."
<cole-h> And then more "...."
<cole-h> And finally "...No."
* cole-h wonders why there's no data for "disk free", "memory", "CPU", and "target branch evaluation failures" in the ofborg dashboard
<cole-h> (scratch that last one)
<gchristensen> you could fix those :)
<cole-h> Heh.
<gchristensen> hehe
<cole-h> What should CPU be displaying?
<gchristensen> what does the graph query for now+?
<cole-h> We got core throttles total, freq max hz, freq min hz, guest seconds total, package throttles total, scaling freq hz, scaling freq min/max hz, and seconds total
<cole-h> It queries for `(irate(node_cpu{job="node",mode="idle"}[5m]))`
<cole-h> (and some other stuff
<cole-h> )
<cole-h> Full query: 100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[5m])) * 100)
<gchristensen> hrm
<cole-h> Memory and disk free fixed (just had to `_bytes` all the things)
<gchristensen> see the table with Metric ...
<cole-h> Got it
<cole-h> Should be fixed. Take a peek and see if the graph looks like what you want to see?
<gchristensen> nice!!
<gchristensen> looks great!
<cole-h> Wew
<gchristensen> thanks!
<cole-h> And also: of course there's no data for "Target branch eval failures" lol
<cole-h> No eval failures right now :D
<gchristensen> :)
<cole-h> If you go to last 12 hours, though, you can see the blip (really, more than a blip) from earlier
<gchristensen> yeah :)
<gchristensen> (whoops)
<cole-h> Hehe
tilpner_ has joined #nixos-borg
<cole-h> I don't think that was out fault tho
<cole-h> Pretty sure that was because of the kernel stuff
tilpner has quit [Remote host closed the connection]
<gchristensen> yeah
tilpner_ is now known as tilpner