| 00:00:10 | | Unholy2361 (Unholy2361) joins |
| 00:01:44 | | AmAnd0A quits [Ping timeout: 252 seconds] |
| 00:02:14 | | AmAnd0A joins |
| 00:03:53 | | sonick (sonick) joins |
| 00:04:20 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 00:04:37 | | AmAnd0A joins |
| 00:05:23 | | bf_ quits [Ping timeout: 265 seconds] |
| 00:07:14 | <@JAA> | nicolas17: Sure it can. --warcdedupe |
| 00:07:34 | <@JAA> | I don't remember why it's not enabled by default. |
| 00:07:45 | <@JAA> | However, it only dedupes within one process. |
| 00:08:32 | | jtagcat quits [Quit: Bye!] |
| 00:08:55 | | jtagcat (jtagcat) joins |
| 00:11:38 | | fullpwnmedia joins |
| 00:11:40 | | AmAnd0A quits [Ping timeout: 265 seconds] |
| 00:13:38 | | AmAnd0A joins |
| 00:14:44 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 00:15:02 | | AmAnd0A joins |
| 00:32:43 | <nicolas17> | JAA: I tried that and it deduplicated same payload across different URLs, works great |
| 00:33:09 | <nicolas17> | but I mean like, rerun it tomorrow to request the same URLs, and deduplicate if same URL gives same payload |
| 00:33:50 | | AmAnd0A quits [Ping timeout: 252 seconds] |
| 00:34:00 | <@JAA> | Yeah, that's not currently supported. |
| 00:34:37 | | AmAnd0A joins |
| 00:35:02 | <@JAA> | You could insert a blocking item that only finishes tomorrow and have one long-running process, but even then, eventually you'll hit the memory issues (probably heap fragmentation) and have to restart the process. |
| 00:35:54 | <@JAA> | I've had such constructs for continuously archiving closing forums until minutes before their shutdown, but that's a limited time frame still, even though some of these ran for months. |
| 00:36:22 | <@JAA> | Also, if you have any large downloads, you'll hit the memory issues much sooner. |
| 00:40:05 | <mgrandi> | https://www.irccloud.com/pastebin/OiNSTRkz |
| 00:40:31 | <mgrandi> | Oh I guess it did a paste whoops, tldr blaseball is ending, I'll post a list of websites involved tonight |
| 01:11:01 | | Mateon2 joins |
| 01:12:41 | | Mateon1 quits [Ping timeout: 252 seconds] |
| 01:12:41 | | Mateon2 is now known as Mateon1 |
| 01:25:33 | | Jonimus quits [Quit: WeeChat 3.3] |
| 01:33:31 | | Mateon2 joins |
| 01:34:08 | | Mateon1 quits [Ping timeout: 252 seconds] |
| 01:34:08 | | Mateon2 is now known as Mateon1 |
| 01:37:41 | | pabs ... at #archiveteam - several million .ga domains to be deleted on June 7 |
| 01:37:57 | | tbc1887 (tbc1887) joins |
| 01:38:31 | <pabs> | perhaps AT should consider all ccTLDs as endangered :) |
| 01:45:48 | | za3k quits [Client Quit] |
| 01:48:52 | <anarcat> | what's happening in gabon? |
| 01:51:59 | <pabs> | <benjins> .ga domain names control switching over to ANINF on on June 7: https://www.afnic.fr/wp-media/uploads/2023/05/ga-domain-names-soon-to-return-to-Gabonese-management-1.pdf |
| 01:51:59 | <pabs> | <benjins> "As part of this |
| 01:51:59 | <pabs> | <benjins> switch-over operation, several million domain names will be deleted as the |
| 01:51:59 | <pabs> | <benjins> previous operator has not provided the data that concern them" |
| 01:52:08 | <pabs> | (pasted from #archiveteam) |
| 01:52:18 | | dumbgoy_ quits [Ping timeout: 252 seconds] |
| 01:52:18 | | dumbgoy__ joins |
| 01:52:20 | <nicolas17> | :| |
| 01:59:46 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 02:02:08 | | AmAnd0A joins |
| 02:12:02 | <anarcat> | damn |
| 02:12:11 | <anarcat> | DNS is such a brittle thing |
| 02:12:34 | <anarcat> | https://en.wikipedia.org/wiki/.ga doesn't seem aware of this |
| 02:12:35 | <tomodachi94> | Well... that's an... _interesting_ move. |
| 02:12:57 | <tomodachi94> | @anarcat I'm going to update it as soon as I find a reliable source |
| 02:15:02 | <@JAA> | http://aninf.ga/communique-nouvelle-gestion-internationale-du-domaine-de-premier-niveau-ga/ |
| 02:18:19 | | lflare quits [Remote host closed the connection] |
| 02:28:04 | | icedice quits [Client Quit] |
| 02:28:28 | | icedice (icedice) joins |
| 02:34:01 | <tomodachi94> | @JAA @anarcat: updated the article. |
| 02:40:33 | <pabs> | ah, its Freenom that isn't handing over the data :( |
| 03:03:48 | | systwi quits [Ping timeout: 252 seconds] |
| 03:05:31 | <anarcat> | cool |
| 03:12:52 | | systwi (systwi) joins |
| 03:13:13 | <benjins> | Last year I dumped a bunch of domains from the SSL Cert Transparency logs. Here's a list of all the .ga domains in it: https://transfer.archivete.am/pC3HI/bns_gabon_domains_ct_partial_dump_01.txt |
| 03:13:37 | <benjins> | Note that a manual check of some of them shows that a lot of them are dead, and what remains is mostly spam |
| 03:18:17 | <pabs> | throw them in #// ? |
| 04:00:01 | | treora quits [Client Quit] |
| 04:01:31 | | treora joins |
| 04:04:35 | | BigBrain quits [Remote host closed the connection] |
| 04:04:54 | | BigBrain (bigbrain) joins |
| 04:06:34 | | Naruyoko joins |
| 05:17:26 | | Megame quits [Client Quit] |
| 06:01:59 | | JackThompson3 quits [Ping timeout: 252 seconds] |
| 06:19:03 | | hitgrr8 joins |
| 07:16:41 | | c3manu (c3manu) joins |
| 07:48:10 | | zhongfu quits [Quit: cya losers] |
| 07:49:56 | | Island quits [Read error: Connection reset by peer] |
| 07:52:29 | | zhongfu (zhongfu) joins |
| 07:56:02 | | zhongfu quits [Client Quit] |
| 07:58:02 | | tbc1887 quits [Ping timeout: 252 seconds] |
| 07:59:59 | | tbc1887 (tbc1887) joins |
| 08:00:11 | | zhongfu (zhongfu) joins |
| 08:11:47 | | decky_e quits [Ping timeout: 252 seconds] |
| 08:12:13 | | decky_e (decky_e) joins |
| 08:23:16 | <Hans5958> | Can anyone please accept my edits on URLTeam? I have to fix a syntax error that causes the tables on the official shorteners went funky |
| 08:25:32 | | decky_e quits [Ping timeout: 252 seconds] |
| 08:26:00 | | decky_e (decky_e) joins |
| 08:46:12 | | lumidify quits [Quit: leaving] |
| 08:48:33 | | spirit quits [Client Quit] |
| 08:59:02 | | lumidify (lumidify) joins |
| 09:01:24 | | Naruyoko5 joins |
| 09:03:08 | | Naruyoko quits [Ping timeout: 252 seconds] |
| 09:19:25 | | BlueMaxima joins |
| 09:37:37 | | railen63 quits [Remote host closed the connection] |
| 09:37:53 | | railen63 joins |
| 09:46:50 | | dumbgoy__ quits [Ping timeout: 265 seconds] |
| 10:13:29 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 10:23:41 | | coro joins |
| 10:35:12 | | systwi__ (systwi) joins |
| 10:35:39 | | systwi quits [Ping timeout: 265 seconds] |
| 10:55:39 | | decky_e quits [Remote host closed the connection] |
| 12:05:04 | | Ruthalas5 quits [Ping timeout: 265 seconds] |
| 12:11:30 | | Ruthalas5 (Ruthalas) joins |
| 12:32:52 | | icedice quits [Client Quit] |
| 12:45:29 | | spirit joins |
| 12:46:27 | | icedice (icedice) joins |
| 12:52:57 | | icedice quits [Client Quit] |
| 12:55:31 | | icedice (icedice) joins |
| 13:28:09 | | HP_Archivist (HP_Archivist) joins |
| 14:22:50 | <c3manu> | I think the Bang Face website should be archived. It contains a good bunch of photos and info about their past events and goes back 20 years. Would that be too large for the bot? https://bangface.com/ |
| 14:41:04 | | za3k joins |
| 14:47:03 | | sec^nd quits [Remote host closed the connection] |
| 14:47:25 | | sec^nd (second) joins |
| 15:00:03 | | BigBrain quits [Remote host closed the connection] |
| 15:00:19 | | BigBrain (bigbrain) joins |
| 15:01:27 | | lflare (lflare) joins |
| 15:49:12 | | HP_Archivist quits [Read error: Connection reset by peer] |
| 15:49:51 | | HP_Archivist (HP_Archivist) joins |
| 15:55:32 | | dumbgoy__ joins |
| 16:05:53 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 16:07:37 | | AmAnd0A joins |
| 16:12:19 | <icedice> | c3manu: Depends on how much free space is on the HDD that that archivation job uses |
| 16:13:23 | <icedice> | I once tried archiving a site that had collected every Sailor Moon manga, anime, live action, radio drama, and so on, both in Japanese and English iirc |
| 16:13:53 | <icedice> | Got 800GB out of that site before the HDD the archivation job was using filled up and the archivation job ended |
| 16:14:48 | <icedice> | If it's just a few hundred GB it should be fine |
| 16:15:32 | <c3manu> | Is there a way I can make a reasonable estimate? |
| 16:31:55 | <h2ibot> | Yts98 edited URLTeam (+2806, Fix formatting, add links, and remove redundant…): https://wiki.archiveteam.org/?diff=49871&oldid=49865 |
| 16:31:56 | <h2ibot> | Yts98 edited ArchiveBot/Educational institutions/list (+3675): https://wiki.archiveteam.org/?diff=49872&oldid=49663 |
| 16:40:38 | | hackbug quits [Quit: Lost terminal] |
| 16:41:18 | <icedice> | Probably not |
| 16:41:41 | <icedice> | Not without downloading it yourself using grab-site |
| 16:42:02 | <icedice> | Which I don't think is Wayback Machine-approved |
| 16:42:06 | <icedice> | Not sure though |
| 16:44:51 | | hackbug (hackbug) joins |
| 16:50:52 | <c3manu> | wait is there a list of wayback machine approved formats or sth like that? |
| 16:51:25 | <c3manu> | i mean the photos are not really high resolution either, i wouldn’t think it could be that much |
| 17:06:18 | <icedice> | WARC is what they want |
| 17:06:32 | <icedice> | grab-site produces WARCs |
| 17:06:48 | <icedice> | I'm just not sure if it has all the metadata and whatnot |
| 17:06:59 | <icedice> | It might be fine, I just haven't looked into it |
| 17:14:27 | <c3manu> | i thought WARC uploads as some regular user don’t automatically end up in the WBM anyways and require manual admin approval or something |
| 17:16:27 | <spirit> | if you are not looking for your archive to end up in the wayback machine, do anything you like. keep each item at ~=50 GB to make it easy to host for IA |
| 17:17:37 | <c3manu> | i do, which is why i would like to submit it via archivebot :) |
| 17:41:57 | | AlsoHP_Archivist joins |
| 17:41:57 | | HP_Archivist quits [Ping timeout: 265 seconds] |
| 17:48:34 | | AmAnd0A quits [Ping timeout: 252 seconds] |
| 17:49:03 | | AmAnd0A joins |
| 17:55:45 | | spirit quits [Client Quit] |
| 18:02:52 | | spirit joins |
| 18:05:16 | | thuban quits [Read error: Connection reset by peer] |
| 18:05:48 | | thuban joins |
| 18:07:29 | <fireonlive> | freenom seems like such a good and well run company |
| 18:31:13 | <joepie91|m> | fireonlive: here, you dropped a /s |
| 18:31:17 | <joepie91|m> | :p |
| 18:31:21 | <fireonlive> | :p |
| 18:42:58 | <@JAA> | icedice, c3manu: Total data size doesn't matter much, number of URLs does. |
| 18:43:13 | <@JAA> | Re ArchiveBot limits etc. |
| 18:44:21 | <h2ibot> | AvantApres edited Andriasang (+139, Added note about Wikipedia): https://wiki.archiveteam.org/?diff=49873&oldid=49732 |
| 18:44:34 | <@JAA> | Hans5958: I rejected the edit a couple hours ago as it conflicted with Yts98's, which had already fixed that error in another way earlier. |
| 18:50:22 | <h2ibot> | Pokechu22 edited Deathwatch (-1, stray bracket): https://wiki.archiveteam.org/?diff=49874&oldid=49868 |
| 19:02:01 | | za3k quits [Client Quit] |
| 19:03:51 | | decky_e (decky_e) joins |
| 19:18:25 | | that_lurker quits [Client Quit] |
| 19:18:44 | | that_lurker (that_lurker) joins |
| 19:38:53 | | Kraken joins |
| 19:47:39 | | railen63 quits [Remote host closed the connection] |
| 19:47:52 | | railen63 joins |
| 19:51:44 | | Megame (Megame) joins |
| 19:52:17 | | jtagcat quits [Client Quit] |
| 19:59:46 | | jtagcat (jtagcat) joins |
| 20:05:22 | | cdreimanu (c3manu) joins |
| 20:05:36 | | cdreimanu quits [Remote host closed the connection] |
| 20:07:55 | | c3manu quits [Ping timeout: 265 seconds] |
| 20:12:25 | | sonick quits [Client Quit] |
| 20:12:31 | | c3manu (c3manu) joins |
| 20:36:23 | <TheTechRobo> | icedice: I think grab-site works fine for the WBM. The problem is that items need to go into special collections to be ingested into the WBM. Because allowing anyone to ingest WARCs allows anyone to fake WBM snapshots, you need special permissions for those collections. |
| 20:36:27 | <TheTechRobo> | At least thats how I understand it. |
| 20:47:46 | | Kraken quits [Client Quit] |
| 20:59:09 | <@JAA> | Correct |
| 21:00:47 | <masterX244> | Luckily WARCs are still useful even if not in the WBM for those that search further so WARC is better than "loose files" |
| 21:01:59 | <@JAA> | Absolutely |
| 21:03:07 | <@JAA> | And they can also be added to the WBM later after vetting etc. My first WARC uploads were just random items, then they got moved to the right place and I got the relevant access to upload there directly. |
| 21:05:13 | <masterX244> | got a fresh batch for moving to WARCzone waiting btw |
| 21:05:52 | <masterX244> | (for context: WARCzone is where any external warcs that got noticed are stored) |
| 21:07:26 | <c3manu> | what’s your preferred way to create WARCs? just wget or something special? I like to use the browsertrix-crawler which creates WACZ files, but I have no idea whether IA likes those or not |
| 21:08:37 | <masterX244> | NO WGET! wpull (used by grab-site) is better, Wget got bugs |
| 21:09:16 | <masterX244> | browsertrix is from the webrecorder devs, there are WARC bugs, too |
| 21:09:28 | <masterX244> | their code is only good for reading/viewing WARCs |
| 21:10:08 | <icedice> | Ah, ok |
| 21:12:12 | <nicolas17> | "please upload WARCs, but 90% of tools are buggy and you shouldn't use them" |
| 21:16:08 | | pokechu22 quits [Quit: Performing electrical work, will be back in a few hours probably] |
| 21:16:16 | <c3manu> | masterX244: haha thanks, i’m only 1-2 weeks deep into this archiving rabbit hole ^^ |
| 21:29:53 | <masterX244> | c3manu: wget also got a disadvantage that it does its retries immediately and it keeps it todo in RAM. sometimes crawls get really long with long lists (had one recently where i had a 10GB wpull sqlite) and retries at back are useful if something was gummed up since it can unstick if it was a temporary thing |
| 21:30:16 | <masterX244> | sucks when you get a page that does fuckery on pagination though where it desyncs the URL from the real page |
| 21:32:50 | <fireonlive> | nicolas17: garbage in, garbage out |
| 21:32:53 | <fireonlive> | jsut ask my parents |
| 21:32:58 | <nicolas17> | x_x |
| 21:33:49 | <c3manu> | i’m definitely playing around with wpull/grab-site next :) |
| 21:34:10 | <masterX244> | i wonder if there are other sites than only planetminecraft that don't allow jumping pagination pages by tweaking the parameter in the URL |
| 21:36:46 | <fireonlive> | there's a wget2 that's certainly 2 times better, right? |
| 21:37:14 | <@JAA> | No WARC support in Wget2. |
| 21:37:25 | <fireonlive> | ah boo |
| 21:37:34 | <masterX244> | wget also didnt take the Pullrequests for wget1 on warc compliance |
| 21:38:20 | <@JAA> | Well, the whole story's a bit more complicated, but yeah, still hasn't been fixed. |
| 21:38:27 | <@JAA> | Might get done soonish though. |
| 21:38:54 | <@JAA> | We're working on getting our changes merged upstream, and they're open to accepting them in general. |
| 21:39:04 | <fireonlive> | :) |
| 21:39:26 | <fireonlive> | ignore my knock by the way; was seeing if that still existed lol |
| 21:39:33 | | hitgrr8 quits [Client Quit] |
| 21:39:45 | | sec^nd quits [Remote host closed the connection] |
| 21:40:23 | <masterX244> | the limitation of RAM only still stays (and the wpull db keeping even iognored URLs is useful for when you need to postprocess some URLs) |
| 21:41:20 | <masterX244> | (or when you fuck up a first sweep with a overly broad ignore) |
| 21:42:06 | <fireonlive> | i thoguht i had to append like .* to the end of every ignore :| |
| 21:42:11 | <fireonlive> | apparently not! |
| 21:42:32 | <fireonlive> | /groups/ vs /groups/.* |
| 21:52:44 | | Megame quits [Client Quit] |
| 21:54:27 | | c3manu quits [Client Quit] |
| 21:55:55 | | sec^nd (second) joins |
| 22:49:50 | | decky_e quits [Ping timeout: 265 seconds] |
| 22:50:26 | | decky_e (decky_e) joins |
| 23:08:04 | | Miori quits [Remote host closed the connection] |
| 23:19:31 | | AlsoHP_Archivist quits [Client Quit] |
| 23:20:03 | | HP_Archivist (HP_Archivist) joins |
| 23:38:05 | | sonick (sonick) joins |
| 23:39:07 | | BlueMaxima joins |
| 23:50:44 | | decky_e quits [Ping timeout: 265 seconds] |
| 23:51:12 | | decky_e (decky_e) joins |