00:00:10Unholy2361 (Unholy2361) joins
00:01:44AmAnd0A quits [Ping timeout: 252 seconds]
00:02:14AmAnd0A joins
00:03:53sonick (sonick) joins
00:04:20AmAnd0A quits [Read error: Connection reset by peer]
00:04:37AmAnd0A joins
00:05:23bf_ quits [Ping timeout: 265 seconds]
00:07:14<@JAA>nicolas17: Sure it can. --warcdedupe
00:07:34<@JAA>I don't remember why it's not enabled by default.
00:07:45<@JAA>However, it only dedupes within one process.
00:08:32jtagcat quits [Quit: Bye!]
00:08:55jtagcat (jtagcat) joins
00:11:38fullpwnmedia joins
00:11:40AmAnd0A quits [Ping timeout: 265 seconds]
00:13:38AmAnd0A joins
00:14:44AmAnd0A quits [Read error: Connection reset by peer]
00:15:02AmAnd0A joins
00:32:43<nicolas17>JAA: I tried that and it deduplicated same payload across different URLs, works great
00:33:09<nicolas17>but I mean like, rerun it tomorrow to request the same URLs, and deduplicate if same URL gives same payload
00:33:50AmAnd0A quits [Ping timeout: 252 seconds]
00:34:00<@JAA>Yeah, that's not currently supported.
00:34:37AmAnd0A joins
00:35:02<@JAA>You could insert a blocking item that only finishes tomorrow and have one long-running process, but even then, eventually you'll hit the memory issues (probably heap fragmentation) and have to restart the process.
00:35:54<@JAA>I've had such constructs for continuously archiving closing forums until minutes before their shutdown, but that's a limited time frame still, even though some of these ran for months.
00:36:22<@JAA>Also, if you have any large downloads, you'll hit the memory issues much sooner.
00:40:05<mgrandi>https://www.irccloud.com/pastebin/OiNSTRkz
00:40:31<mgrandi>Oh I guess it did a paste whoops, tldr blaseball is ending, I'll post a list of websites involved tonight
01:11:01Mateon2 joins
01:12:41Mateon1 quits [Ping timeout: 252 seconds]
01:12:41Mateon2 is now known as Mateon1
01:25:33Jonimus quits [Quit: WeeChat 3.3]
01:33:31Mateon2 joins
01:34:08Mateon1 quits [Ping timeout: 252 seconds]
01:34:08Mateon2 is now known as Mateon1
01:37:41pabs ... at #archiveteam - several million .ga domains to be deleted on June 7
01:37:57tbc1887 (tbc1887) joins
01:38:31<pabs>perhaps AT should consider all ccTLDs as endangered :)
01:45:48za3k quits [Client Quit]
01:48:52<anarcat>what's happening in gabon?
01:51:59<pabs><benjins> .ga domain names control switching over to ANINF on on June 7: https://www.afnic.fr/wp-media/uploads/2023/05/ga-domain-names-soon-to-return-to-Gabonese-management-1.pdf
01:51:59<pabs><benjins> "As part of this
01:51:59<pabs><benjins> switch-over operation, several million domain names will be deleted as the
01:51:59<pabs><benjins> previous operator has not provided the data that concern them"
01:52:08<pabs>(pasted from #archiveteam)
01:52:18dumbgoy_ quits [Ping timeout: 252 seconds]
01:52:18dumbgoy__ joins
01:52:20<nicolas17>:|
01:59:46AmAnd0A quits [Read error: Connection reset by peer]
02:02:08AmAnd0A joins
02:12:02<anarcat>damn
02:12:11<anarcat>DNS is such a brittle thing
02:12:34<anarcat>https://en.wikipedia.org/wiki/.ga doesn't seem aware of this
02:12:35<tomodachi94>Well... that's an... _interesting_ move.
02:12:57<tomodachi94>@anarcat I'm going to update it as soon as I find a reliable source
02:15:02<@JAA>http://aninf.ga/communique-nouvelle-gestion-internationale-du-domaine-de-premier-niveau-ga/
02:18:19lflare quits [Remote host closed the connection]
02:28:04icedice quits [Client Quit]
02:28:28icedice (icedice) joins
02:34:01<tomodachi94>@JAA @anarcat: updated the article.
02:40:33<pabs>ah, its Freenom that isn't handing over the data :(
03:03:48systwi quits [Ping timeout: 252 seconds]
03:05:31<anarcat>cool
03:12:52systwi (systwi) joins
03:13:13<benjins>Last year I dumped a bunch of domains from the SSL Cert Transparency logs. Here's a list of all the .ga domains in it: https://transfer.archivete.am/pC3HI/bns_gabon_domains_ct_partial_dump_01.txt
03:13:37<benjins>Note that a manual check of some of them shows that a lot of them are dead, and what remains is mostly spam
03:18:17<pabs>throw them in #// ?
04:00:01treora quits [Client Quit]
04:01:31treora joins
04:04:35BigBrain quits [Remote host closed the connection]
04:04:54BigBrain (bigbrain) joins
04:06:34Naruyoko joins
05:17:26Megame quits [Client Quit]
06:01:59JackThompson3 quits [Ping timeout: 252 seconds]
06:19:03hitgrr8 joins
07:16:41c3manu (c3manu) joins
07:48:10zhongfu quits [Quit: cya losers]
07:49:56Island quits [Read error: Connection reset by peer]
07:52:29zhongfu (zhongfu) joins
07:56:02zhongfu quits [Client Quit]
07:58:02tbc1887 quits [Ping timeout: 252 seconds]
07:59:59tbc1887 (tbc1887) joins
08:00:11zhongfu (zhongfu) joins
08:11:47decky_e quits [Ping timeout: 252 seconds]
08:12:13decky_e (decky_e) joins
08:23:16<Hans5958>Can anyone please accept my edits on URLTeam? I have to fix a syntax error that causes the tables on the official shorteners went funky
08:25:32decky_e quits [Ping timeout: 252 seconds]
08:26:00decky_e (decky_e) joins
08:46:12lumidify quits [Quit: leaving]
08:48:33spirit quits [Client Quit]
08:59:02lumidify (lumidify) joins
09:01:24Naruyoko5 joins
09:03:08Naruyoko quits [Ping timeout: 252 seconds]
09:19:25BlueMaxima joins
09:37:37railen63 quits [Remote host closed the connection]
09:37:53railen63 joins
09:46:50dumbgoy__ quits [Ping timeout: 265 seconds]
10:13:29BlueMaxima quits [Read error: Connection reset by peer]
10:23:41coro joins
10:35:12systwi__ (systwi) joins
10:35:39systwi quits [Ping timeout: 265 seconds]
10:55:39decky_e quits [Remote host closed the connection]
12:05:04Ruthalas5 quits [Ping timeout: 265 seconds]
12:11:30Ruthalas5 (Ruthalas) joins
12:32:52icedice quits [Client Quit]
12:45:29spirit joins
12:46:27icedice (icedice) joins
12:52:57icedice quits [Client Quit]
12:55:31icedice (icedice) joins
13:28:09HP_Archivist (HP_Archivist) joins
14:22:50<c3manu>I think the Bang Face website should be archived. It contains a good bunch of photos and info about their past events and goes back 20 years. Would that be too large for the bot? https://bangface.com/
14:41:04za3k joins
14:47:03sec^nd quits [Remote host closed the connection]
14:47:25sec^nd (second) joins
15:00:03BigBrain quits [Remote host closed the connection]
15:00:19BigBrain (bigbrain) joins
15:01:27lflare (lflare) joins
15:49:12HP_Archivist quits [Read error: Connection reset by peer]
15:49:51HP_Archivist (HP_Archivist) joins
15:55:32dumbgoy__ joins
16:05:53AmAnd0A quits [Read error: Connection reset by peer]
16:07:37AmAnd0A joins
16:12:19<icedice>c3manu: Depends on how much free space is on the HDD that that archivation job uses
16:13:23<icedice>I once tried archiving a site that had collected every Sailor Moon manga, anime, live action, radio drama, and so on, both in Japanese and English iirc
16:13:53<icedice>Got 800GB out of that site before the HDD the archivation job was using filled up and the archivation job ended
16:14:48<icedice>If it's just a few hundred GB it should be fine
16:15:32<c3manu>Is there a way I can make a reasonable estimate?
16:31:55<h2ibot>Yts98 edited URLTeam (+2806, Fix formatting, add links, and remove redundant…): https://wiki.archiveteam.org/?diff=49871&oldid=49865
16:31:56<h2ibot>Yts98 edited ArchiveBot/Educational institutions/list (+3675): https://wiki.archiveteam.org/?diff=49872&oldid=49663
16:40:38hackbug quits [Quit: Lost terminal]
16:41:18<icedice>Probably not
16:41:41<icedice>Not without downloading it yourself using grab-site
16:42:02<icedice>Which I don't think is Wayback Machine-approved
16:42:06<icedice>Not sure though
16:44:51hackbug (hackbug) joins
16:50:52<c3manu>wait is there a list of wayback machine approved formats or sth like that?
16:51:25<c3manu>i mean the photos are not really high resolution either, i wouldn’t think it could be that much
17:06:18<icedice>WARC is what they want
17:06:32<icedice>grab-site produces WARCs
17:06:48<icedice>I'm just not sure if it has all the metadata and whatnot
17:06:59<icedice>It might be fine, I just haven't looked into it
17:14:27<c3manu>i thought WARC uploads as some regular user don’t automatically end up in the WBM anyways and require manual admin approval or something
17:16:27<spirit>if you are not looking for your archive to end up in the wayback machine, do anything you like. keep each item at ~=50 GB to make it easy to host for IA
17:17:37<c3manu>i do, which is why i would like to submit it via archivebot :)
17:41:57AlsoHP_Archivist joins
17:41:57HP_Archivist quits [Ping timeout: 265 seconds]
17:48:34AmAnd0A quits [Ping timeout: 252 seconds]
17:49:03AmAnd0A joins
17:55:45spirit quits [Client Quit]
18:02:52spirit joins
18:05:16thuban quits [Read error: Connection reset by peer]
18:05:48thuban joins
18:07:29<fireonlive>freenom seems like such a good and well run company
18:31:13<joepie91|m>fireonlive: here, you dropped a /s
18:31:17<joepie91|m>:p
18:31:21<fireonlive>:p
18:42:58<@JAA>icedice, c3manu: Total data size doesn't matter much, number of URLs does.
18:43:13<@JAA>Re ArchiveBot limits etc.
18:44:21<h2ibot>AvantApres edited Andriasang (+139, Added note about Wikipedia): https://wiki.archiveteam.org/?diff=49873&oldid=49732
18:44:34<@JAA>Hans5958: I rejected the edit a couple hours ago as it conflicted with Yts98's, which had already fixed that error in another way earlier.
18:50:22<h2ibot>Pokechu22 edited Deathwatch (-1, stray bracket): https://wiki.archiveteam.org/?diff=49874&oldid=49868
19:02:01za3k quits [Client Quit]
19:03:51decky_e (decky_e) joins
19:18:25that_lurker quits [Client Quit]
19:18:44that_lurker (that_lurker) joins
19:38:53Kraken joins
19:47:39railen63 quits [Remote host closed the connection]
19:47:52railen63 joins
19:51:44Megame (Megame) joins
19:52:17jtagcat quits [Client Quit]
19:59:46jtagcat (jtagcat) joins
20:05:22cdreimanu (c3manu) joins
20:05:36cdreimanu quits [Remote host closed the connection]
20:07:55c3manu quits [Ping timeout: 265 seconds]
20:12:25sonick quits [Client Quit]
20:12:31c3manu (c3manu) joins
20:36:23<TheTechRobo>icedice: I think grab-site works fine for the WBM. The problem is that items need to go into special collections to be ingested into the WBM. Because allowing anyone to ingest WARCs allows anyone to fake WBM snapshots, you need special permissions for those collections.
20:36:27<TheTechRobo>At least thats how I understand it.
20:47:46Kraken quits [Client Quit]
20:59:09<@JAA>Correct
21:00:47<masterX244>Luckily WARCs are still useful even if not in the WBM for those that search further so WARC is better than "loose files"
21:01:59<@JAA>Absolutely
21:03:07<@JAA>And they can also be added to the WBM later after vetting etc. My first WARC uploads were just random items, then they got moved to the right place and I got the relevant access to upload there directly.
21:05:13<masterX244>got a fresh batch for moving to WARCzone waiting btw
21:05:52<masterX244>(for context: WARCzone is where any external warcs that got noticed are stored)
21:07:26<c3manu>what’s your preferred way to create WARCs? just wget or something special? I like to use the browsertrix-crawler which creates WACZ files, but I have no idea whether IA likes those or not
21:08:37<masterX244>NO WGET! wpull (used by grab-site) is better, Wget got bugs
21:09:16<masterX244>browsertrix is from the webrecorder devs, there are WARC bugs, too
21:09:28<masterX244>their code is only good for reading/viewing WARCs
21:10:08<icedice>Ah, ok
21:12:12<nicolas17>"please upload WARCs, but 90% of tools are buggy and you shouldn't use them"
21:16:08pokechu22 quits [Quit: Performing electrical work, will be back in a few hours probably]
21:16:16<c3manu>masterX244: haha thanks, i’m only 1-2 weeks deep into this archiving rabbit hole ^^
21:29:53<masterX244>c3manu: wget also got a disadvantage that it does its retries immediately and it keeps it todo in RAM. sometimes crawls get really long with long lists (had one recently where i had a 10GB wpull sqlite) and retries at back are useful if something was gummed up since it can unstick if it was a temporary thing
21:30:16<masterX244>sucks when you get a page that does fuckery on pagination though where it desyncs the URL from the real page
21:32:50<fireonlive>nicolas17: garbage in, garbage out
21:32:53<fireonlive>jsut ask my parents
21:32:58<nicolas17>x_x
21:33:49<c3manu>i’m definitely playing around with wpull/grab-site next :)
21:34:10<masterX244>i wonder if there are other sites than only planetminecraft that don't allow jumping pagination pages by tweaking the parameter in the URL
21:36:46<fireonlive>there's a wget2 that's certainly 2 times better, right?
21:37:14<@JAA>No WARC support in Wget2.
21:37:25<fireonlive>ah boo
21:37:34<masterX244>wget also didnt take the Pullrequests for wget1 on warc compliance
21:38:20<@JAA>Well, the whole story's a bit more complicated, but yeah, still hasn't been fixed.
21:38:27<@JAA>Might get done soonish though.
21:38:54<@JAA>We're working on getting our changes merged upstream, and they're open to accepting them in general.
21:39:04<fireonlive>:)
21:39:26<fireonlive>ignore my knock by the way; was seeing if that still existed lol
21:39:33hitgrr8 quits [Client Quit]
21:39:45sec^nd quits [Remote host closed the connection]
21:40:23<masterX244>the limitation of RAM only still stays (and the wpull db keeping even iognored URLs is useful for when you need to postprocess some URLs)
21:41:20<masterX244>(or when you fuck up a first sweep with a overly broad ignore)
21:42:06<fireonlive>i thoguht i had to append like .* to the end of every ignore :|
21:42:11<fireonlive>apparently not!
21:42:32<fireonlive>/groups/ vs /groups/.*
21:52:44Megame quits [Client Quit]
21:54:27c3manu quits [Client Quit]
21:55:55sec^nd (second) joins
22:49:50decky_e quits [Ping timeout: 265 seconds]
22:50:26decky_e (decky_e) joins
23:08:04Miori quits [Remote host closed the connection]
23:19:31AlsoHP_Archivist quits [Client Quit]
23:20:03HP_Archivist (HP_Archivist) joins
23:38:05sonick (sonick) joins
23:39:07BlueMaxima joins
23:50:44decky_e quits [Ping timeout: 265 seconds]
23:51:12decky_e (decky_e) joins