| 00:00:57 | | imer quits [Client Quit] |
| 00:17:01 | | AmAnd0A quits [Ping timeout: 265 seconds] |
| 00:18:00 | | AmAnd0A joins |
| 00:25:13 | | imer (imer) joins |
| 00:28:28 | | railen63 quits [Remote host closed the connection] |
| 00:28:43 | | railen63 joins |
| 00:31:28 | | imer quits [Client Quit] |
| 00:49:49 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 00:50:10 | | AmAnd0A joins |
| 00:52:36 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 00:52:53 | | AmAnd0A joins |
| 01:05:04 | | imer (imer) joins |
| 01:10:11 | | @rewby quits [Ping timeout: 252 seconds] |
| 01:11:41 | | BigBrain quits [Ping timeout: 245 seconds] |
| 01:12:25 | | BigBrain (bigbrain) joins |
| 01:28:55 | | AlsoHP_Archivist joins |
| 01:32:11 | | HP_Archivist quits [Ping timeout: 252 seconds] |
| 01:35:29 | | AmAnd0A quits [Ping timeout: 252 seconds] |
| 01:35:46 | | AmAnd0A joins |
| 02:13:57 | | rewby (rewby) joins |
| 02:13:57 | | @ChanServ sets mode: +o rewby |
| 02:14:36 | | JohnnyJ joins |
| 02:26:28 | | AmAnd0A quits [Ping timeout: 252 seconds] |
| 02:27:26 | | AmAnd0A joins |
| 02:54:41 | | dumbgoy_ quits [Ping timeout: 252 seconds] |
| 03:19:43 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 03:19:59 | | AmAnd0A joins |
| 04:03:16 | | Barto quits [Ping timeout: 252 seconds] |
| 04:51:38 | | Barto (Barto) joins |
| 05:18:21 | | BigBrain quits [Ping timeout: 245 seconds] |
| 05:20:21 | | BigBrain (bigbrain) joins |
| 05:36:59 | | nfriedly quits [Ping timeout: 265 seconds] |
| 05:37:32 | | nfriedly joins |
| 05:41:33 | | nicolas17 quits [Remote host closed the connection] |
| 05:41:53 | | nfriedly quits [Ping timeout: 252 seconds] |
| 05:50:37 | | BlueMaxima quits [Client Quit] |
| 05:56:54 | | nicolas17 joins |
| 06:05:01 | | Island quits [Read error: Connection reset by peer] |
| 06:13:04 | | lennier1 quits [Ping timeout: 252 seconds] |
| 06:21:04 | | lennier1 (lennier1) joins |
| 06:27:11 | | nfriedly joins |
| 06:33:54 | | bf_ joins |
| 06:35:11 | | datechnoman quits [Quit: The Lounge - https://thelounge.chat] |
| 06:35:46 | | datechnoman (datechnoman) joins |
| 06:37:31 | | BigBrain quits [Ping timeout: 245 seconds] |
| 07:16:29 | | lennier1 quits [Ping timeout: 252 seconds] |
| 07:37:41 | | hitgrr8 joins |
| 07:42:43 | <le0n> | 34 |
| 07:58:11 | | lennier1 (lennier1) joins |
| 07:59:21 | <@OrIdow6> | arkiver: Is current best practice for a site (Egloos) with external links still to get them in the grab, or is there a way I can send them to #// somehow? |
| 08:03:40 | | BigBrain (bigbrain) joins |
| 08:05:47 | | BigBrain quits [Remote host closed the connection] |
| 08:07:47 | | BigBrain (bigbrain) joins |
| 08:09:46 | <@arkiver> | OrIdow6: best practice is to send outlinks to #// ! |
| 08:10:03 | <@arkiver> | just add them to a table and i'll add a backfeed key for it |
| 08:14:06 | | nighthnh099_ joins |
| 08:14:07 | <@OrIdow6> | arkiver: Alright, thanks |
| 08:16:50 | <nighthnh099_> | what would be the best wget settings for warc files? I'm trying to mirror something that needs cookies and I want to test out mirroring them with warc so I can go through urls faster then just submit them here later |
| 08:23:10 | <masterx244|m> | nighthnh099_: use grab-site, wget is bugged |
| 08:26:54 | <nighthnh099_> | oh okay, from what I'm reading this is entirely warc? I'll stick to wget for personal mirrors I guess |
| 08:27:56 | <masterx244|m> | yes, but you can extract WARCs later or use a tool like replayweb.page. wpull/grabsite has advantages over wget on queue management, too |
| 08:28:22 | <masterx244|m> | wget does its retries immediately when a request fails, wpull puts them at the end (useful if something 500s and some time waiting unsticks it) |
| 08:29:03 | <masterx244|m> | and wget keeps the entire queue in memroy, have fun with large forum crawls with many URLs, that can get a few gigabytes of RAM usage just for the queue. wpull uses a sqlite db for that |
| 08:50:40 | <@arkiver> | hmm |
| 08:51:14 | <@arkiver> | JAA: masterx244|m: do we have a list anywhere of what would make Wget-AT more usable for the regular user outside of Warrior projects? |
| 08:51:48 | <@arkiver> | I could try to make changes to Wget-AT to support it, or see if we can make some general Lua script that can support this. |
| 08:52:03 | <@arkiver> | I read: |
| 08:52:11 | <@arkiver> | - retries should be different |
| 08:52:21 | <@arkiver> | - wget should not keep entire queue in memory |
| 08:52:33 | <@arkiver> | perhaps we can compile a list of this? |
| 08:56:11 | <masterx244|m> | not sure if wget has the ignores-from-files feature, too |
| 08:56:41 | <masterx244|m> | being able to adjust ignores on-the-fly is really useful when you see a rabbithole that appeared mid-crawl like a buggy link-extraction |
| 08:57:16 | <masterx244|m> | had a forum once where a :// was goofed up and that caused grab-site to cause endless link mess until i dished out some well-defined ignores |
| 08:57:33 | <masterx244|m> | arkiver: ^ |
| 08:58:24 | <masterx244|m> | the queue-as-db with ignored urls in it, too is also useful for when you manaually need to do some url extraction. |
| 09:03:00 | <@arkiver> | thanks a lot masterx244|m |
| 09:03:14 | <@arkiver> | yeah any ideas anyone have, please dump them here! we might compile them into a document later on |
| 09:05:37 | <masterx244|m> | the grab-site interface is also useful once you got multiple crawls running, switching screen-s on linux is also annoying. |
| 09:17:15 | | nicolas17 quits [Client Quit] |
| 09:21:22 | | lennier2 joins |
| 09:22:26 | | lennier1 quits [Ping timeout: 252 seconds] |
| 09:22:32 | | lennier2 is now known as lennier1 |
| 09:34:28 | | beario joins |
| 10:00:02 | | railen63 quits [Remote host closed the connection] |
| 10:00:18 | | railen63 joins |
| 10:15:19 | | Naruyoko quits [Remote host closed the connection] |
| 10:15:40 | | Naruyoko joins |
| 10:34:49 | | Naruyoko5 joins |
| 10:37:24 | | Naruyoko quits [Client Quit] |
| 10:38:24 | <@arkiver> | OrIdow6: not sure what you qualify as "outlinks", but I usually take any correct URL found that we would not get in the specific warrior, and queue that to #// |
| 10:38:24 | | railen63 quits [Remote host closed the connection] |
| 10:38:41 | | railen63 joins |
| 10:38:42 | <@arkiver> | that might also include certain embeds, etc., that we would not get in the project itself |
| 10:43:45 | <h2ibot> | Arkiver edited GitLab (+724, Merge edit by [[Special:Contributions/Nemo…): https://wiki.archiveteam.org/?diff=49879&oldid=48787 |
| 10:43:55 | <@arkiver> | finally fixed that merge conflict |
| 10:45:45 | <h2ibot> | Arkiver edited Deathwatch (+131, Merge edit by [[Special:Contributions/Taka|Taka]]): https://wiki.archiveteam.org/?diff=49880&oldid=49877 |
| 10:46:45 | <masterx244|m> | caught some outdated data on reddit project page |
| 10:46:46 | <h2ibot> | MasterX244 edited Reddit (+26, Resync'd archiving status): https://wiki.archiveteam.org/?diff=49881&oldid=49878 |
| 10:47:11 | <masterx244|m> | not sure if we should switch reddit to "endangered" on the wiki |
| 10:47:46 | <h2ibot> | Arkiver edited Zippyshare (+1109, Fix conflict): https://wiki.archiveteam.org/?diff=49882&oldid=49671 |
| 10:48:06 | <@arkiver> | masterx244|m: i'd say it's no endangered |
| 10:48:10 | <@arkiver> | not* |
| 10:49:18 | <@arkiver> | masterx244|m: you're now an automoderated user, your edits will automatically be applied |
| 10:49:46 | <h2ibot> | Arkiver changed the user rights of User:MasterX244 |
| 10:50:02 | <masterx244|m> | we don't know whats happening after the blackout... and fingers crossed that we don't run into target limits on reddit |
| 10:50:31 | <@arkiver> | some of the content on reddit is endangered yes |
| 10:50:45 | <@arkiver> | but reddit itself, I'm not sure, I don't think they're close to running out of money |
| 10:51:26 | <@arkiver> | they seem to just be trying to increase revenue ahead of the IPO |
| 10:51:55 | <@arkiver> | unlike twitter - where there were serious money concerns, as also noted by messages from elon musk |
| 10:54:44 | <masterx244|m> | reddit description on frontpage is outdated, too. it still shows "planned" on old reddit posts |
| 11:04:06 | | decky_e quits [Remote host closed the connection] |
| 11:39:13 | | Ruthalas5 quits [Client Quit] |
| 11:39:29 | | Ruthalas5 (Ruthalas) joins |
| 11:41:52 | | VerifiedJ quits [Quit: The Lounge - https://thelounge.chat] |
| 11:42:18 | | VerifiedJ (VerifiedJ) joins |
| 12:01:50 | | eroc1990 quits [Quit: The Lounge - https://thelounge.chat] |
| 12:02:28 | | eroc1990 (eroc1990) joins |
| 12:16:45 | | Mateon1 quits [Remote host closed the connection] |
| 12:16:48 | | Mateon1 joins |
| 12:36:52 | | sonick quits [Client Quit] |
| 12:55:42 | | Unholy23611 (Unholy2361) joins |
| 12:58:56 | | lennier2 joins |
| 12:59:08 | | Unholy2361 quits [Ping timeout: 252 seconds] |
| 12:59:08 | | Unholy23611 is now known as Unholy2361 |
| 13:03:06 | | lennier1 quits [Ping timeout: 265 seconds] |
| 13:03:09 | | lennier2 is now known as lennier1 |
| 13:21:42 | | Naruyoko joins |
| 13:24:51 | | Naruyoko5 quits [Ping timeout: 259 seconds] |
| 13:39:17 | | AmAnd0A quits [Ping timeout: 252 seconds] |
| 13:39:28 | | AmAnd0A joins |
| 14:22:34 | | icedice quits [Ping timeout: 252 seconds] |
| 14:35:11 | | AlsoHP_Archivist quits [Client Quit] |
| 14:35:32 | | HP_Archivist (HP_Archivist) joins |
| 14:39:50 | | spirit quits [Quit: Leaving] |
| 14:55:02 | | icedice (icedice) joins |
| 15:28:21 | | BigBrain quits [Ping timeout: 245 seconds] |
| 15:33:33 | | Megame (Megame) joins |
| 15:36:13 | | Island joins |
| 15:43:20 | | decky_e (decky_e) joins |
| 15:49:05 | | BigBrain (bigbrain) joins |
| 15:51:18 | | nighthnh099_ quits [Client Quit] |
| 15:52:23 | | Naruyoko quits [Ping timeout: 252 seconds] |
| 15:54:14 | | decky_e quits [Ping timeout: 252 seconds] |
| 15:54:36 | | decky_e (decky_e) joins |
| 16:16:39 | | imer quits [Client Quit] |
| 16:27:48 | | imer (imer) joins |
| 16:38:06 | <@JAA> | arkiver: Two more things come to mind: setting and adjusting request delays, and easier compilation including different zstd versions (which I know you're already aware of). |
| 16:39:29 | <@JAA> | Actually, I guess delays can already be done via the existing hooks. |
| 16:40:16 | <@JAA> | But if we count that, reading and applying ignores from a file is also possible. |
| 16:50:13 | <@arkiver> | JAA: on zstd versions - just install a different zstd version and compile against that? |
| 16:50:20 | <@arkiver> | with* |
| 16:50:59 | <@JAA> | arkiver: Probably yes, but needs auditing that it actually works correctly across some reasonable range of versions. |
| 16:54:46 | <@arkiver> | right yeah |
| 16:55:05 | <@JAA> | (Ideally with a proper test suite on the upcoming CI.) |
| 16:55:10 | <@arkiver> | well i made this yesterday https://github.com/ArchiveTeam/wget-lua/issues/15 |
| 16:55:25 | <@JAA> | Yeah, I saw, that's why I added those brackets above. :-) |
| 16:55:38 | <@arkiver> | test suites are not strong suite |
| 16:55:51 | <@arkiver> | but yeah i guess |
| 16:56:46 | <@JAA> | Yeah, don't worry about that yet, CI needs to be running first anyway. It shouldn't be too difficult to just do some simple integration tests on it, retrieving a couple pages with and without a custom dict, verifying that the produced file has one frame per record etc. |
| 16:57:04 | <@arkiver> | yep |
| 16:57:07 | <@JAA> | But it would allow us to continuously test against newly released zstd versions, which would be nice. |
| 16:57:07 | <@arkiver> | that sounds good |
| 16:57:13 | <@arkiver> | indeed! |
| 16:57:38 | <@arkiver> | i've never looked much into all the github automation (testing/building/whatever), so some examples and help there would be welcome |
| 16:58:00 | <@JAA> | GitHub Actions is meh, we'll have something self-hosted soon. |
| 16:58:13 | <@arkiver> | (github automation is the wrong word - i mean git repo hosting services automation i guess) |
| 16:58:27 | <@JAA> | The logs are not publicly accessible and get wiped after a couple months. |
| 16:58:39 | <@arkiver> | right |
| 16:59:06 | <@JAA> | Very annoying when you come across an old open issue about 'something went wrong in this run: <link>' which just goes 404. |
| 17:00:43 | <@JAA> | Summary of how that automation works: webhook notifies the CI, which then pulls from GitHub and runs whatever's configured, reporting the status back to GitHub so it can display pass/fail. |
| 17:01:37 | <@JAA> | Anyway, soon™! |
| 17:01:49 | <@arkiver> | well let's get something up when we reach that point |
| 17:02:02 | <@arkiver> | meanwhile i still have to release proper FTP archiving support for Wget-AT |
| 17:02:26 | <@arkiver> | ... which is only stuck on that FTP conversation record order thing |
| 17:02:36 | <@arkiver> | so will make a decision on that this week and just push it out |
| 17:03:37 | <@arkiver> | in short - going purely with WARC specs we'll need a third record to note order of FTP conversation records, if we don't go with the WARC specs we'll only need a new header |
| 17:03:53 | <@arkiver> | i'm leaning towards going with following WARC specs |
| 17:04:02 | <@arkiver> | (not making up a new header) |
| 17:04:09 | <@JAA> | -dev? |
| 17:04:43 | <@arkiver> | copied to there |
| 17:12:39 | | decky_e quits [Remote host closed the connection] |
| 17:19:08 | | railen63 quits [Remote host closed the connection] |
| 17:22:25 | | railen63 joins |
| 17:24:08 | <fireonlive> | ooh there's a -dev |
| 17:24:12 | <fireonlive> | sounds nerdy and fun |
| 17:25:10 | | AmAnd0A quits [Ping timeout: 252 seconds] |
| 17:25:25 | | AmAnd0A joins |
| 17:26:11 | | Arachnophine quits [Remote host closed the connection] |
| 17:27:07 | | Arachnophine (Arachnophine) joins |
| 17:28:32 | | Arachnophine quits [Remote host closed the connection] |
| 17:30:27 | | Arachnophine (Arachnophine) joins |
| 17:54:51 | | Gereon quits [Quit: The Lounge - https://thelounge.chat] |
| 18:08:22 | | decky_e (decky_e) joins |
| 18:10:07 | | Gereon (Gereon) joins |
| 18:20:10 | | decky_e quits [Ping timeout: 252 seconds] |
| 18:20:50 | | decky_e (decky_e) joins |
| 18:55:49 | | Naruyoko joins |
| 19:01:37 | | nicolas17 joins |
| 19:13:24 | | that_lurker quits [Quit: my throat's getting sore from humming modem tones into my phone] |
| 19:18:52 | | that_lurker (that_lurker) joins |
| 19:29:38 | | decky_e quits [Ping timeout: 252 seconds] |
| 19:30:06 | | decky_e (decky_e) joins |
| 19:31:52 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 19:32:09 | | AmAnd0A joins |
| 19:57:47 | | HP_Archivist quits [Client Quit] |
| 20:00:37 | | railen63 quits [Remote host closed the connection] |
| 20:04:05 | | railen63 joins |
| 20:05:02 | | railen63 quits [Remote host closed the connection] |
| 20:05:15 | | railen63 joins |
| 20:10:53 | | Megame quits [Ping timeout: 252 seconds] |
| 20:35:01 | | beario quits [Remote host closed the connection] |
| 20:53:22 | | dumbgoy_ joins |
| 20:54:47 | | hitgrr8 quits [Client Quit] |
| 21:16:36 | | icedice quits [Client Quit] |
| 21:17:06 | | rageear quits [Quit: Leaving] |
| 21:40:41 | | Naruyoko quits [Read error: Connection reset by peer] |
| 21:41:04 | | Naruyoko joins |
| 21:41:43 | | icedice (icedice) joins |
| 21:52:15 | | decky_e quits [Read error: Connection reset by peer] |
| 21:52:44 | | decky_e (decky_e) joins |
| 21:58:50 | | decky_e quits [Read error: Connection reset by peer] |
| 21:59:22 | | decky_e (decky_e) joins |
| 22:02:22 | | bf_ quits [Ping timeout: 252 seconds] |
| 22:05:58 | <h2ibot> | TheTechRobo edited ArchiveTeam Warrior (+81, Clarify that "Project code is out of date"…): https://wiki.archiveteam.org/?diff=49883&oldid=49631 |
| 22:10:01 | | BPCZ quits [Quit: eh???] |
| 22:12:26 | | AmAnd0A quits [Ping timeout: 252 seconds] |
| 22:12:51 | | AmAnd0A joins |
| 22:21:06 | | EvanBoehs|m joins |
| 22:30:09 | <EvanBoehs|m> | Heya. Many months ago, I downloaded about a million posts and 500k comments from an old, public forum (est. 2008) who's admin hasn't been active for years. It's exhaustive as of 3 months ago. I've been storing it in a local PgSQL database on my computer, but I don't trust myself to do that responsibly... My computer's falling apart. What is the proper way to offload this (where can it be stored, what format should it be stored in, ect) |
| 22:34:01 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 22:34:20 | | AmAnd0A joins |
| 22:37:00 | <pokechu22> | Probably the best thing to do is upload whatever you have to archive.org, even if it's not the most convenient format, so that even if something goes wrong at least something's available |
| 22:39:32 | <pokechu22> | as in, just upload the database to archive.org and put it in the community data collection (I assume PgSQL database files are fairly portable?) |
| 22:40:37 | <EvanBoehs|m> | pokechu22: No, but conversion to sqlite is possible, and I wouldn't upload it raw anyway because the API is horrendous and exposes IPs |
| 22:40:56 | <EvanBoehs|m> | * the API that I pulled from is horrendous |
| 22:42:08 | <nicolas17> | also is the forum still up? |
| 22:42:24 | <EvanBoehs|m> | nicolas17: Yes it is, I was just worried about it |
| 22:42:33 | <pokechu22> | Oh, one thing that would also be useful is if you could extract imgur links for our imgur archival project (see https://wiki.archiveteam.org/index.php/Imgur) - if you can get a list of things that look vaguely like imgur links then we can feed them to a bot that will queue valid links for archival |
| 22:42:50 | <EvanBoehs|m> | pokechu22: I can do that for you :) |
| 22:43:08 | <EvanBoehs|m> | Though I doubt there's much imgur links |
| 22:43:21 | <nicolas17> | EvanBoehs|m: in that case, in addition to your download, maybe we can archive it (again) in website form so it's usable on the wayback machine |
| 22:43:35 | <pokechu22> | There's also a mediafire project (https://wiki.archiveteam.org/index.php/MediaFire) |
| 22:44:49 | <EvanBoehs|m> | nicolas17: Hmm. Maybe... Is it easier just to reconstruct the page for each URL? That would be trivial enough |
| 22:45:22 | <nicolas17> | no, wayback machine wants the exact http response from the server for it to count as preservation |
| 22:46:06 | <EvanBoehs|m> | Got it. And as a result of this preservation, anyone can enter URLs into the wayback machine? |
| 22:46:12 | <EvanBoehs|m> | Like the "save page" does but in bulk |
| 22:46:38 | <EvanBoehs|m> | * wayback machine to see the historic pages? |
| 22:47:38 | <pokechu22> | We've got archivebot which lets us recurse over pages on a site and then everything that saves ends up in the wayback machine |
| 22:48:26 | <pokechu22> | but, files (in the WARC format) saved by random people generally aren't added to web.archive.org. If you've got a list of URLs though there are tools that can save each URL in the list and then that will end up on the wayback machine |
| 22:48:52 | <EvanBoehs|m> | pokechu22: Oh interesting, so you guys have special permission |
| 22:48:53 | <EvanBoehs|m> | * special permission? |
| 22:49:18 | <pokechu22> | Yeah |
| 22:51:50 | <EvanBoehs|m> | I see. It's probably easier because I have a list of all the URLs with content (like the imgur project) as opposed to recursively downloading something like 910000 pages. I'd quite like them all in the way back machine, so would the best first step be to program a warrior or... |
| 22:55:26 | | BigBrain quits [Ping timeout: 245 seconds] |
| 22:55:57 | | BigBrain (bigbrain) joins |
| 23:04:36 | | BlueMaxima joins |
| 23:05:49 | | BPCZ (BPCZ) joins |
| 23:09:16 | | gjhgf joins |
| 23:09:33 | <gjhgf> | Sex slave |
| 23:09:45 | | gjhgf quits [Remote host closed the connection] |
| 23:18:38 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 23:18:40 | | AmAnd0A joins |
| 23:33:17 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 23:34:26 | | AmAnd0A joins |
| 23:37:41 | | wessel1512 quits [Ping timeout: 252 seconds] |
| 23:46:25 | | AmAnd0A quits [Ping timeout: 265 seconds] |
| 23:46:48 | | AmAnd0A joins |
| 23:59:29 | | Iki1 quits [Read error: Connection reset by peer] |
| 23:59:32 | | benjins2_ quits [Read error: Connection reset by peer] |
| 23:59:35 | | mr_sarge quits [Read error: Connection reset by peer] |
| 23:59:38 | | benjins quits [Read error: Connection reset by peer] |
| 23:59:47 | | Justin[home] joins |
| 23:59:47 | | Justin[home] is now authenticated as DopefishJustin |
| 23:59:51 | | Iki1 joins |
| 23:59:58 | | BlueMaxima_ joins |