00:00:57imer quits [Client Quit]
00:17:01AmAnd0A quits [Ping timeout: 265 seconds]
00:18:00AmAnd0A joins
00:25:13imer (imer) joins
00:28:28railen63 quits [Remote host closed the connection]
00:28:43railen63 joins
00:31:28imer quits [Client Quit]
00:49:49AmAnd0A quits [Read error: Connection reset by peer]
00:50:10AmAnd0A joins
00:52:36AmAnd0A quits [Read error: Connection reset by peer]
00:52:53AmAnd0A joins
01:05:04imer (imer) joins
01:10:11@rewby quits [Ping timeout: 252 seconds]
01:11:41BigBrain quits [Ping timeout: 245 seconds]
01:12:25BigBrain (bigbrain) joins
01:28:55AlsoHP_Archivist joins
01:32:11HP_Archivist quits [Ping timeout: 252 seconds]
01:35:29AmAnd0A quits [Ping timeout: 252 seconds]
01:35:46AmAnd0A joins
02:13:57rewby (rewby) joins
02:13:57@ChanServ sets mode: +o rewby
02:14:36JohnnyJ joins
02:26:28AmAnd0A quits [Ping timeout: 252 seconds]
02:27:26AmAnd0A joins
02:54:41dumbgoy_ quits [Ping timeout: 252 seconds]
03:19:43AmAnd0A quits [Read error: Connection reset by peer]
03:19:59AmAnd0A joins
04:03:16Barto quits [Ping timeout: 252 seconds]
04:51:38Barto (Barto) joins
05:18:21BigBrain quits [Ping timeout: 245 seconds]
05:20:21BigBrain (bigbrain) joins
05:36:59nfriedly quits [Ping timeout: 265 seconds]
05:37:32nfriedly joins
05:41:33nicolas17 quits [Remote host closed the connection]
05:41:53nfriedly quits [Ping timeout: 252 seconds]
05:50:37BlueMaxima quits [Client Quit]
05:56:54nicolas17 joins
06:05:01Island quits [Read error: Connection reset by peer]
06:13:04lennier1 quits [Ping timeout: 252 seconds]
06:21:04lennier1 (lennier1) joins
06:27:11nfriedly joins
06:33:54bf_ joins
06:35:11datechnoman quits [Quit: The Lounge - https://thelounge.chat]
06:35:46datechnoman (datechnoman) joins
06:37:31BigBrain quits [Ping timeout: 245 seconds]
07:16:29lennier1 quits [Ping timeout: 252 seconds]
07:37:41hitgrr8 joins
07:42:43<le0n>34
07:58:11lennier1 (lennier1) joins
07:59:21<@OrIdow6> arkiver: Is current best practice for a site (Egloos) with external links still to get them in the grab, or is there a way I can send them to #// somehow?
08:03:40BigBrain (bigbrain) joins
08:05:47BigBrain quits [Remote host closed the connection]
08:07:47BigBrain (bigbrain) joins
08:09:46<@arkiver>OrIdow6: best practice is to send outlinks to #// !
08:10:03<@arkiver>just add them to a table and i'll add a backfeed key for it
08:14:06nighthnh099_ joins
08:14:07<@OrIdow6>arkiver: Alright, thanks
08:16:50<nighthnh099_>what would be the best wget settings for warc files? I'm trying to mirror something that needs cookies and I want to test out mirroring them with warc so I can go through urls faster then just submit them here later
08:23:10<masterx244|m>nighthnh099_: use grab-site, wget is bugged
08:26:54<nighthnh099_>oh okay, from what I'm reading this is entirely warc? I'll stick to wget for personal mirrors I guess
08:27:56<masterx244|m>yes, but you can extract WARCs later or use a tool like replayweb.page. wpull/grabsite has advantages over wget on queue management, too
08:28:22<masterx244|m>wget does its retries immediately when a request fails, wpull puts them at the end (useful if something 500s and some time waiting unsticks it)
08:29:03<masterx244|m>and wget keeps the entire queue in memroy, have fun with large forum crawls with many URLs, that can get a few gigabytes of RAM usage just for the queue. wpull uses a sqlite db for that
08:50:40<@arkiver>hmm
08:51:14<@arkiver>JAA: masterx244|m: do we have a list anywhere of what would make Wget-AT more usable for the regular user outside of Warrior projects?
08:51:48<@arkiver>I could try to make changes to Wget-AT to support it, or see if we can make some general Lua script that can support this.
08:52:03<@arkiver>I read:
08:52:11<@arkiver> - retries should be different
08:52:21<@arkiver> - wget should not keep entire queue in memory
08:52:33<@arkiver>perhaps we can compile a list of this?
08:56:11<masterx244|m>not sure if wget has the ignores-from-files feature, too
08:56:41<masterx244|m>being able to adjust ignores on-the-fly is really useful when you see a rabbithole that appeared mid-crawl like a buggy link-extraction
08:57:16<masterx244|m>had a forum once where a :// was goofed up and that caused grab-site to cause endless link mess until i dished out some well-defined ignores
08:57:33<masterx244|m>arkiver: ^
08:58:24<masterx244|m>the queue-as-db with ignored urls in it, too is also useful for when you manaually need to do some url extraction.
09:03:00<@arkiver>thanks a lot masterx244|m
09:03:14<@arkiver>yeah any ideas anyone have, please dump them here! we might compile them into a document later on
09:05:37<masterx244|m>the grab-site interface is also useful once you got multiple crawls running, switching screen-s on linux is also annoying.
09:17:15nicolas17 quits [Client Quit]
09:21:22lennier2 joins
09:22:26lennier1 quits [Ping timeout: 252 seconds]
09:22:32lennier2 is now known as lennier1
09:34:28beario joins
10:00:02railen63 quits [Remote host closed the connection]
10:00:18railen63 joins
10:15:19Naruyoko quits [Remote host closed the connection]
10:15:40Naruyoko joins
10:34:49Naruyoko5 joins
10:37:24Naruyoko quits [Client Quit]
10:38:24<@arkiver>OrIdow6: not sure what you qualify as "outlinks", but I usually take any correct URL found that we would not get in the specific warrior, and queue that to #//
10:38:24railen63 quits [Remote host closed the connection]
10:38:41railen63 joins
10:38:42<@arkiver>that might also include certain embeds, etc., that we would not get in the project itself
10:43:45<h2ibot>Arkiver edited GitLab (+724, Merge edit by [[Special:Contributions/Nemo…): https://wiki.archiveteam.org/?diff=49879&oldid=48787
10:43:55<@arkiver>finally fixed that merge conflict
10:45:45<h2ibot>Arkiver edited Deathwatch (+131, Merge edit by [[Special:Contributions/Taka|Taka]]): https://wiki.archiveteam.org/?diff=49880&oldid=49877
10:46:45<masterx244|m>caught some outdated data on reddit project page
10:46:46<h2ibot>MasterX244 edited Reddit (+26, Resync'd archiving status): https://wiki.archiveteam.org/?diff=49881&oldid=49878
10:47:11<masterx244|m>not sure if we should switch reddit to "endangered" on the wiki
10:47:46<h2ibot>Arkiver edited Zippyshare (+1109, Fix conflict): https://wiki.archiveteam.org/?diff=49882&oldid=49671
10:48:06<@arkiver>masterx244|m: i'd say it's no endangered
10:48:10<@arkiver>not*
10:49:18<@arkiver>masterx244|m: you're now an automoderated user, your edits will automatically be applied
10:49:46<h2ibot>Arkiver changed the user rights of User:MasterX244
10:50:02<masterx244|m>we don't know whats happening after the blackout... and fingers crossed that we don't run into target limits on reddit
10:50:31<@arkiver>some of the content on reddit is endangered yes
10:50:45<@arkiver>but reddit itself, I'm not sure, I don't think they're close to running out of money
10:51:26<@arkiver>they seem to just be trying to increase revenue ahead of the IPO
10:51:55<@arkiver>unlike twitter - where there were serious money concerns, as also noted by messages from elon musk
10:54:44<masterx244|m>reddit description on frontpage is outdated, too. it still shows "planned" on old reddit posts
11:04:06decky_e quits [Remote host closed the connection]
11:39:13Ruthalas5 quits [Client Quit]
11:39:29Ruthalas5 (Ruthalas) joins
11:41:52VerifiedJ quits [Quit: The Lounge - https://thelounge.chat]
11:42:18VerifiedJ (VerifiedJ) joins
12:01:50eroc1990 quits [Quit: The Lounge - https://thelounge.chat]
12:02:28eroc1990 (eroc1990) joins
12:16:45Mateon1 quits [Remote host closed the connection]
12:16:48Mateon1 joins
12:36:52sonick quits [Client Quit]
12:55:42Unholy23611 (Unholy2361) joins
12:58:56lennier2 joins
12:59:08Unholy2361 quits [Ping timeout: 252 seconds]
12:59:08Unholy23611 is now known as Unholy2361
13:03:06lennier1 quits [Ping timeout: 265 seconds]
13:03:09lennier2 is now known as lennier1
13:21:42Naruyoko joins
13:24:51Naruyoko5 quits [Ping timeout: 259 seconds]
13:39:17AmAnd0A quits [Ping timeout: 252 seconds]
13:39:28AmAnd0A joins
14:22:34icedice quits [Ping timeout: 252 seconds]
14:35:11AlsoHP_Archivist quits [Client Quit]
14:35:32HP_Archivist (HP_Archivist) joins
14:39:50spirit quits [Quit: Leaving]
14:55:02icedice (icedice) joins
15:28:21BigBrain quits [Ping timeout: 245 seconds]
15:33:33Megame (Megame) joins
15:36:13Island joins
15:43:20decky_e (decky_e) joins
15:49:05BigBrain (bigbrain) joins
15:51:18nighthnh099_ quits [Client Quit]
15:52:23Naruyoko quits [Ping timeout: 252 seconds]
15:54:14decky_e quits [Ping timeout: 252 seconds]
15:54:36decky_e (decky_e) joins
16:16:39imer quits [Client Quit]
16:27:48imer (imer) joins
16:38:06<@JAA>arkiver: Two more things come to mind: setting and adjusting request delays, and easier compilation including different zstd versions (which I know you're already aware of).
16:39:29<@JAA>Actually, I guess delays can already be done via the existing hooks.
16:40:16<@JAA>But if we count that, reading and applying ignores from a file is also possible.
16:50:13<@arkiver>JAA: on zstd versions - just install a different zstd version and compile against that?
16:50:20<@arkiver>with*
16:50:59<@JAA>arkiver: Probably yes, but needs auditing that it actually works correctly across some reasonable range of versions.
16:54:46<@arkiver>right yeah
16:55:05<@JAA>(Ideally with a proper test suite on the upcoming CI.)
16:55:10<@arkiver>well i made this yesterday https://github.com/ArchiveTeam/wget-lua/issues/15
16:55:25<@JAA>Yeah, I saw, that's why I added those brackets above. :-)
16:55:38<@arkiver>test suites are not strong suite
16:55:51<@arkiver>but yeah i guess
16:56:46<@JAA>Yeah, don't worry about that yet, CI needs to be running first anyway. It shouldn't be too difficult to just do some simple integration tests on it, retrieving a couple pages with and without a custom dict, verifying that the produced file has one frame per record etc.
16:57:04<@arkiver>yep
16:57:07<@JAA>But it would allow us to continuously test against newly released zstd versions, which would be nice.
16:57:07<@arkiver>that sounds good
16:57:13<@arkiver>indeed!
16:57:38<@arkiver>i've never looked much into all the github automation (testing/building/whatever), so some examples and help there would be welcome
16:58:00<@JAA>GitHub Actions is meh, we'll have something self-hosted soon.
16:58:13<@arkiver>(github automation is the wrong word - i mean git repo hosting services automation i guess)
16:58:27<@JAA>The logs are not publicly accessible and get wiped after a couple months.
16:58:39<@arkiver>right
16:59:06<@JAA>Very annoying when you come across an old open issue about 'something went wrong in this run: <link>' which just goes 404.
17:00:43<@JAA>Summary of how that automation works: webhook notifies the CI, which then pulls from GitHub and runs whatever's configured, reporting the status back to GitHub so it can display pass/fail.
17:01:37<@JAA>Anyway, soon™!
17:01:49<@arkiver>well let's get something up when we reach that point
17:02:02<@arkiver>meanwhile i still have to release proper FTP archiving support for Wget-AT
17:02:26<@arkiver>... which is only stuck on that FTP conversation record order thing
17:02:36<@arkiver>so will make a decision on that this week and just push it out
17:03:37<@arkiver>in short - going purely with WARC specs we'll need a third record to note order of FTP conversation records, if we don't go with the WARC specs we'll only need a new header
17:03:53<@arkiver>i'm leaning towards going with following WARC specs
17:04:02<@arkiver>(not making up a new header)
17:04:09<@JAA>-dev?
17:04:43<@arkiver>copied to there
17:12:39decky_e quits [Remote host closed the connection]
17:19:08railen63 quits [Remote host closed the connection]
17:22:25railen63 joins
17:24:08<fireonlive>ooh there's a -dev
17:24:12<fireonlive>sounds nerdy and fun
17:25:10AmAnd0A quits [Ping timeout: 252 seconds]
17:25:25AmAnd0A joins
17:26:11Arachnophine quits [Remote host closed the connection]
17:27:07Arachnophine (Arachnophine) joins
17:28:32Arachnophine quits [Remote host closed the connection]
17:30:27Arachnophine (Arachnophine) joins
17:54:51Gereon quits [Quit: The Lounge - https://thelounge.chat]
18:08:22decky_e (decky_e) joins
18:10:07Gereon (Gereon) joins
18:20:10decky_e quits [Ping timeout: 252 seconds]
18:20:50decky_e (decky_e) joins
18:55:49Naruyoko joins
19:01:37nicolas17 joins
19:13:24that_lurker quits [Quit: my throat's getting sore from humming modem tones into my phone]
19:18:52that_lurker (that_lurker) joins
19:29:38decky_e quits [Ping timeout: 252 seconds]
19:30:06decky_e (decky_e) joins
19:31:52AmAnd0A quits [Read error: Connection reset by peer]
19:32:09AmAnd0A joins
19:57:47HP_Archivist quits [Client Quit]
20:00:37railen63 quits [Remote host closed the connection]
20:04:05railen63 joins
20:05:02railen63 quits [Remote host closed the connection]
20:05:15railen63 joins
20:10:53Megame quits [Ping timeout: 252 seconds]
20:35:01beario quits [Remote host closed the connection]
20:53:22dumbgoy_ joins
20:54:47hitgrr8 quits [Client Quit]
21:16:36icedice quits [Client Quit]
21:17:06rageear quits [Quit: Leaving]
21:40:41Naruyoko quits [Read error: Connection reset by peer]
21:41:04Naruyoko joins
21:41:43icedice (icedice) joins
21:52:15decky_e quits [Read error: Connection reset by peer]
21:52:44decky_e (decky_e) joins
21:58:50decky_e quits [Read error: Connection reset by peer]
21:59:22decky_e (decky_e) joins
22:02:22bf_ quits [Ping timeout: 252 seconds]
22:05:58<h2ibot>TheTechRobo edited ArchiveTeam Warrior (+81, Clarify that "Project code is out of date"…): https://wiki.archiveteam.org/?diff=49883&oldid=49631
22:10:01BPCZ quits [Quit: eh???]
22:12:26AmAnd0A quits [Ping timeout: 252 seconds]
22:12:51AmAnd0A joins
22:21:06EvanBoehs|m joins
22:30:09<EvanBoehs|m>Heya. Many months ago, I downloaded about a million posts and 500k comments from an old, public forum (est. 2008) who's admin hasn't been active for years. It's exhaustive as of 3 months ago. I've been storing it in a local PgSQL database on my computer, but I don't trust myself to do that responsibly... My computer's falling apart. What is the proper way to offload this (where can it be stored, what format should it be stored in, ect)
22:34:01AmAnd0A quits [Read error: Connection reset by peer]
22:34:20AmAnd0A joins
22:37:00<pokechu22>Probably the best thing to do is upload whatever you have to archive.org, even if it's not the most convenient format, so that even if something goes wrong at least something's available
22:39:32<pokechu22>as in, just upload the database to archive.org and put it in the community data collection (I assume PgSQL database files are fairly portable?)
22:40:37<EvanBoehs|m>pokechu22: No, but conversion to sqlite is possible, and I wouldn't upload it raw anyway because the API is horrendous and exposes IPs
22:40:56<EvanBoehs|m>* the API that I pulled from is horrendous
22:42:08<nicolas17>also is the forum still up?
22:42:24<EvanBoehs|m>nicolas17: Yes it is, I was just worried about it
22:42:33<pokechu22>Oh, one thing that would also be useful is if you could extract imgur links for our imgur archival project (see https://wiki.archiveteam.org/index.php/Imgur) - if you can get a list of things that look vaguely like imgur links then we can feed them to a bot that will queue valid links for archival
22:42:50<EvanBoehs|m>pokechu22: I can do that for you :)
22:43:08<EvanBoehs|m>Though I doubt there's much imgur links
22:43:21<nicolas17>EvanBoehs|m: in that case, in addition to your download, maybe we can archive it (again) in website form so it's usable on the wayback machine
22:43:35<pokechu22>There's also a mediafire project (https://wiki.archiveteam.org/index.php/MediaFire)
22:44:49<EvanBoehs|m>nicolas17: Hmm. Maybe... Is it easier just to reconstruct the page for each URL? That would be trivial enough
22:45:22<nicolas17>no, wayback machine wants the exact http response from the server for it to count as preservation
22:46:06<EvanBoehs|m>Got it. And as a result of this preservation, anyone can enter URLs into the wayback machine?
22:46:12<EvanBoehs|m>Like the "save page" does but in bulk
22:46:38<EvanBoehs|m>* wayback machine to see the historic pages?
22:47:38<pokechu22>We've got archivebot which lets us recurse over pages on a site and then everything that saves ends up in the wayback machine
22:48:26<pokechu22>but, files (in the WARC format) saved by random people generally aren't added to web.archive.org. If you've got a list of URLs though there are tools that can save each URL in the list and then that will end up on the wayback machine
22:48:52<EvanBoehs|m>pokechu22: Oh interesting, so you guys have special permission
22:48:53<EvanBoehs|m>* special permission?
22:49:18<pokechu22>Yeah
22:51:50<EvanBoehs|m>I see. It's probably easier because I have a list of all the URLs with content (like the imgur project) as opposed to recursively downloading something like 910000 pages. I'd quite like them all in the way back machine, so would the best first step be to program a warrior or...
22:55:26BigBrain quits [Ping timeout: 245 seconds]
22:55:57BigBrain (bigbrain) joins
23:04:36BlueMaxima joins
23:05:49BPCZ (BPCZ) joins
23:09:16gjhgf joins
23:09:33<gjhgf>Sex slave
23:09:45gjhgf quits [Remote host closed the connection]
23:18:38AmAnd0A quits [Read error: Connection reset by peer]
23:18:40AmAnd0A joins
23:33:17AmAnd0A quits [Read error: Connection reset by peer]
23:34:26AmAnd0A joins
23:37:41wessel1512 quits [Ping timeout: 252 seconds]
23:46:25AmAnd0A quits [Ping timeout: 265 seconds]
23:46:48AmAnd0A joins
23:59:29Iki1 quits [Read error: Connection reset by peer]
23:59:32benjins2_ quits [Read error: Connection reset by peer]
23:59:35mr_sarge quits [Read error: Connection reset by peer]
23:59:38benjins quits [Read error: Connection reset by peer]
23:59:47Justin[home] joins
23:59:51Iki1 joins
23:59:58BlueMaxima_ joins