#archiveteam-bs log for 2023-06-03

Home Search Previous day Next day

00:00:10		Unholy2361 (Unholy2361) joins
00:01:44		AmAnd0A quits [Ping timeout: 252 seconds]
00:02:14		AmAnd0A joins
00:03:53		sonick (sonick) joins
00:04:20		AmAnd0A quits [Read error: Connection reset by peer]
00:04:37		AmAnd0A joins
00:05:23		bf_ quits [Ping timeout: 265 seconds]
00:07:14	<@JAA>	nicolas17: Sure it can. --warcdedupe
00:07:34	<@JAA>	I don't remember why it's not enabled by default.
00:07:45	<@JAA>	However, it only dedupes within one process.
00:08:32		jtagcat quits [Quit: Bye!]
00:08:55		jtagcat (jtagcat) joins
00:11:38		fullpwnmedia joins
00:11:40		AmAnd0A quits [Ping timeout: 265 seconds]
00:13:38		AmAnd0A joins
00:14:44		AmAnd0A quits [Read error: Connection reset by peer]
00:15:02		AmAnd0A joins
00:32:43	<nicolas17>	JAA: I tried that and it deduplicated same payload across different URLs, works great
00:33:09	<nicolas17>	but I mean like, rerun it tomorrow to request the same URLs, and deduplicate if same URL gives same payload
00:33:50		AmAnd0A quits [Ping timeout: 252 seconds]
00:34:00	<@JAA>	Yeah, that's not currently supported.
00:34:37		AmAnd0A joins
00:35:02	<@JAA>	You could insert a blocking item that only finishes tomorrow and have one long-running process, but even then, eventually you'll hit the memory issues (probably heap fragmentation) and have to restart the process.
00:35:54	<@JAA>	I've had such constructs for continuously archiving closing forums until minutes before their shutdown, but that's a limited time frame still, even though some of these ran for months.
00:36:22	<@JAA>	Also, if you have any large downloads, you'll hit the memory issues much sooner.
00:40:05	<mgrandi>	https://www.irccloud.com/pastebin/OiNSTRkz
00:40:31	<mgrandi>	Oh I guess it did a paste whoops, tldr blaseball is ending, I'll post a list of websites involved tonight
01:11:01		Mateon2 joins
01:12:41		Mateon1 quits [Ping timeout: 252 seconds]
01:12:41		Mateon2 is now known as Mateon1
01:25:33		Jonimus quits [Quit: WeeChat 3.3]
01:33:31		Mateon2 joins
01:34:08		Mateon1 quits [Ping timeout: 252 seconds]
01:34:08		Mateon2 is now known as Mateon1
01:37:41		pabs ... at #archiveteam - several million .ga domains to be deleted on June 7
01:37:57		tbc1887 (tbc1887) joins
01:38:31	<pabs>	perhaps AT should consider all ccTLDs as endangered :)
01:45:48		za3k quits [Client Quit]
01:48:52	<anarcat>	what's happening in gabon?
01:51:59	<pabs>	<benjins> .ga domain names control switching over to ANINF on on June 7: https://www.afnic.fr/wp-media/uploads/2023/05/ga-domain-names-soon-to-return-to-Gabonese-management-1.pdf
01:51:59	<pabs>	<benjins> "As part of this
01:51:59	<pabs>	<benjins> switch-over operation, several million domain names will be deleted as the
01:51:59	<pabs>	<benjins> previous operator has not provided the data that concern them"
01:52:08	<pabs>	(pasted from #archiveteam)
01:52:18		dumbgoy_ quits [Ping timeout: 252 seconds]
01:52:18		dumbgoy__ joins
01:52:20	<nicolas17>	:\|
01:59:46		AmAnd0A quits [Read error: Connection reset by peer]
02:02:08		AmAnd0A joins
02:12:02	<anarcat>	damn
02:12:11	<anarcat>	DNS is such a brittle thing
02:12:34	<anarcat>	https://en.wikipedia.org/wiki/.ga doesn't seem aware of this
02:12:35	<tomodachi94>	Well... that's an... _interesting_ move.
02:12:57	<tomodachi94>	@anarcat I'm going to update it as soon as I find a reliable source
02:15:02	<@JAA>	http://aninf.ga/communique-nouvelle-gestion-internationale-du-domaine-de-premier-niveau-ga/
02:18:19		lflare quits [Remote host closed the connection]
02:28:04		icedice quits [Client Quit]
02:28:28		icedice (icedice) joins
02:34:01	<tomodachi94>	@JAA @anarcat: updated the article.
02:40:33	<pabs>	ah, its Freenom that isn't handing over the data :(
03:03:48		systwi quits [Ping timeout: 252 seconds]
03:05:31	<anarcat>	cool
03:12:52		systwi (systwi) joins
03:13:13	<benjins>	Last year I dumped a bunch of domains from the SSL Cert Transparency logs. Here's a list of all the .ga domains in it: https://transfer.archivete.am/pC3HI/bns_gabon_domains_ct_partial_dump_01.txt
03:13:37	<benjins>	Note that a manual check of some of them shows that a lot of them are dead, and what remains is mostly spam
03:18:17	<pabs>	throw them in #// ?
04:00:01		treora quits [Client Quit]
04:01:31		treora joins
04:04:35		BigBrain quits [Remote host closed the connection]
04:04:54		BigBrain (bigbrain) joins
04:06:34		Naruyoko joins
05:17:26		Megame quits [Client Quit]
06:01:59		JackThompson3 quits [Ping timeout: 252 seconds]
06:19:03		hitgrr8 joins
07:16:41		c3manu (c3manu) joins
07:48:10		zhongfu quits [Quit: cya losers]
07:49:56		Island quits [Read error: Connection reset by peer]
07:52:29		zhongfu (zhongfu) joins
07:56:02		zhongfu quits [Client Quit]
07:58:02		tbc1887 quits [Ping timeout: 252 seconds]
07:59:59		tbc1887 (tbc1887) joins
08:00:11		zhongfu (zhongfu) joins
08:11:47		decky_e quits [Ping timeout: 252 seconds]
08:12:13		decky_e (decky_e) joins
08:23:16	<Hans5958>	Can anyone please accept my edits on URLTeam? I have to fix a syntax error that causes the tables on the official shorteners went funky
08:25:32		decky_e quits [Ping timeout: 252 seconds]
08:26:00		decky_e (decky_e) joins
08:46:12		lumidify quits [Quit: leaving]
08:48:33		spirit quits [Client Quit]
08:59:02		lumidify (lumidify) joins
09:01:24		Naruyoko5 joins
09:03:08		Naruyoko quits [Ping timeout: 252 seconds]
09:19:25		BlueMaxima joins
09:37:37		railen63 quits [Remote host closed the connection]
09:37:53		railen63 joins
09:46:50		dumbgoy__ quits [Ping timeout: 265 seconds]
10:13:29		BlueMaxima quits [Read error: Connection reset by peer]
10:23:41		coro joins
10:35:12		systwi__ (systwi) joins
10:35:39		systwi quits [Ping timeout: 265 seconds]
10:55:39		decky_e quits [Remote host closed the connection]
12:05:04		Ruthalas5 quits [Ping timeout: 265 seconds]
12:11:30		Ruthalas5 (Ruthalas) joins
12:32:52		icedice quits [Client Quit]
12:45:29		spirit joins
12:46:27		icedice (icedice) joins
12:52:57		icedice quits [Client Quit]
12:55:31		icedice (icedice) joins
13:28:09		HP_Archivist (HP_Archivist) joins
14:22:50	<c3manu>	I think the Bang Face website should be archived. It contains a good bunch of photos and info about their past events and goes back 20 years. Would that be too large for the bot? https://bangface.com/
14:41:04		za3k joins
14:47:03		sec^nd quits [Remote host closed the connection]
14:47:25		sec^nd (second) joins
15:00:03		BigBrain quits [Remote host closed the connection]
15:00:19		BigBrain (bigbrain) joins
15:01:27		lflare (lflare) joins
15:49:12		HP_Archivist quits [Read error: Connection reset by peer]
15:49:51		HP_Archivist (HP_Archivist) joins
15:55:32		dumbgoy__ joins
16:05:53		AmAnd0A quits [Read error: Connection reset by peer]
16:07:37		AmAnd0A joins
16:12:19	<icedice>	c3manu: Depends on how much free space is on the HDD that that archivation job uses
16:13:23	<icedice>	I once tried archiving a site that had collected every Sailor Moon manga, anime, live action, radio drama, and so on, both in Japanese and English iirc
16:13:53	<icedice>	Got 800GB out of that site before the HDD the archivation job was using filled up and the archivation job ended
16:14:48	<icedice>	If it's just a few hundred GB it should be fine
16:15:32	<c3manu>	Is there a way I can make a reasonable estimate?
16:31:55	<h2ibot>	Yts98 edited URLTeam (+2806, Fix formatting, add links, and remove redundant…): https://wiki.archiveteam.org/?diff=49871&oldid=49865
16:31:56	<h2ibot>	Yts98 edited ArchiveBot/Educational institutions/list (+3675): https://wiki.archiveteam.org/?diff=49872&oldid=49663
16:40:38		hackbug quits [Quit: Lost terminal]
16:41:18	<icedice>	Probably not
16:41:41	<icedice>	Not without downloading it yourself using grab-site
16:42:02	<icedice>	Which I don't think is Wayback Machine-approved
16:42:06	<icedice>	Not sure though
16:44:51		hackbug (hackbug) joins
16:50:52	<c3manu>	wait is there a list of wayback machine approved formats or sth like that?
16:51:25	<c3manu>	i mean the photos are not really high resolution either, i wouldn’t think it could be that much
17:06:18	<icedice>	WARC is what they want
17:06:32	<icedice>	grab-site produces WARCs
17:06:48	<icedice>	I'm just not sure if it has all the metadata and whatnot
17:06:59	<icedice>	It might be fine, I just haven't looked into it
17:14:27	<c3manu>	i thought WARC uploads as some regular user don’t automatically end up in the WBM anyways and require manual admin approval or something
17:16:27	<spirit>	if you are not looking for your archive to end up in the wayback machine, do anything you like. keep each item at ~=50 GB to make it easy to host for IA
17:17:37	<c3manu>	i do, which is why i would like to submit it via archivebot :)
17:41:57		AlsoHP_Archivist joins
17:41:57		HP_Archivist quits [Ping timeout: 265 seconds]
17:48:34		AmAnd0A quits [Ping timeout: 252 seconds]
17:49:03		AmAnd0A joins
17:55:45		spirit quits [Client Quit]
18:02:52		spirit joins
18:05:16		thuban quits [Read error: Connection reset by peer]
18:05:48		thuban joins
18:07:29	<fireonlive>	freenom seems like such a good and well run company
18:31:13	<joepie91\|m>	fireonlive: here, you dropped a /s
18:31:17	<joepie91\|m>	:p
18:31:21	<fireonlive>	:p
18:42:58	<@JAA>	icedice, c3manu: Total data size doesn't matter much, number of URLs does.
18:43:13	<@JAA>	Re ArchiveBot limits etc.
18:44:21	<h2ibot>	AvantApres edited Andriasang (+139, Added note about Wikipedia): https://wiki.archiveteam.org/?diff=49873&oldid=49732
18:44:34	<@JAA>	Hans5958: I rejected the edit a couple hours ago as it conflicted with Yts98's, which had already fixed that error in another way earlier.
18:50:22	<h2ibot>	Pokechu22 edited Deathwatch (-1, stray bracket): https://wiki.archiveteam.org/?diff=49874&oldid=49868
19:02:01		za3k quits [Client Quit]
19:03:51		decky_e (decky_e) joins
19:18:25		that_lurker quits [Client Quit]
19:18:44		that_lurker (that_lurker) joins
19:38:53		Kraken joins
19:47:39		railen63 quits [Remote host closed the connection]
19:47:52		railen63 joins
19:51:44		Megame (Megame) joins
19:52:17		jtagcat quits [Client Quit]
19:59:46		jtagcat (jtagcat) joins
20:05:22		cdreimanu (c3manu) joins
20:05:36		cdreimanu quits [Remote host closed the connection]
20:07:55		c3manu quits [Ping timeout: 265 seconds]
20:12:25		sonick quits [Client Quit]
20:12:31		c3manu (c3manu) joins
20:36:23	<TheTechRobo>	icedice: I think grab-site works fine for the WBM. The problem is that items need to go into special collections to be ingested into the WBM. Because allowing anyone to ingest WARCs allows anyone to fake WBM snapshots, you need special permissions for those collections.
20:36:27	<TheTechRobo>	At least thats how I understand it.
20:47:46		Kraken quits [Client Quit]
20:59:09	<@JAA>	Correct
21:00:47	<masterX244>	Luckily WARCs are still useful even if not in the WBM for those that search further so WARC is better than "loose files"
21:01:59	<@JAA>	Absolutely
21:03:07	<@JAA>	And they can also be added to the WBM later after vetting etc. My first WARC uploads were just random items, then they got moved to the right place and I got the relevant access to upload there directly.
21:05:13	<masterX244>	got a fresh batch for moving to WARCzone waiting btw
21:05:52	<masterX244>	(for context: WARCzone is where any external warcs that got noticed are stored)
21:07:26	<c3manu>	what’s your preferred way to create WARCs? just wget or something special? I like to use the browsertrix-crawler which creates WACZ files, but I have no idea whether IA likes those or not
21:08:37	<masterX244>	NO WGET! wpull (used by grab-site) is better, Wget got bugs
21:09:16	<masterX244>	browsertrix is from the webrecorder devs, there are WARC bugs, too
21:09:28	<masterX244>	their code is only good for reading/viewing WARCs
21:10:08	<icedice>	Ah, ok
21:12:12	<nicolas17>	"please upload WARCs, but 90% of tools are buggy and you shouldn't use them"
21:16:08		pokechu22 quits [Quit: Performing electrical work, will be back in a few hours probably]
21:16:16	<c3manu>	masterX244: haha thanks, i’m only 1-2 weeks deep into this archiving rabbit hole ^^
21:29:53	<masterX244>	c3manu: wget also got a disadvantage that it does its retries immediately and it keeps it todo in RAM. sometimes crawls get really long with long lists (had one recently where i had a 10GB wpull sqlite) and retries at back are useful if something was gummed up since it can unstick if it was a temporary thing
21:30:16	<masterX244>	sucks when you get a page that does fuckery on pagination though where it desyncs the URL from the real page
21:32:50	<fireonlive>	nicolas17: garbage in, garbage out
21:32:53	<fireonlive>	jsut ask my parents
21:32:58	<nicolas17>	x_x
21:33:49	<c3manu>	i’m definitely playing around with wpull/grab-site next :)
21:34:10	<masterX244>	i wonder if there are other sites than only planetminecraft that don't allow jumping pagination pages by tweaking the parameter in the URL
21:36:46	<fireonlive>	there's a wget2 that's certainly 2 times better, right?
21:37:14	<@JAA>	No WARC support in Wget2.
21:37:25	<fireonlive>	ah boo
21:37:34	<masterX244>	wget also didnt take the Pullrequests for wget1 on warc compliance
21:38:20	<@JAA>	Well, the whole story's a bit more complicated, but yeah, still hasn't been fixed.
21:38:27	<@JAA>	Might get done soonish though.
21:38:54	<@JAA>	We're working on getting our changes merged upstream, and they're open to accepting them in general.
21:39:04	<fireonlive>	:)
21:39:26	<fireonlive>	ignore my knock by the way; was seeing if that still existed lol
21:39:33		hitgrr8 quits [Client Quit]
21:39:45		sec^nd quits [Remote host closed the connection]
21:40:23	<masterX244>	the limitation of RAM only still stays (and the wpull db keeping even iognored URLs is useful for when you need to postprocess some URLs)
21:41:20	<masterX244>	(or when you fuck up a first sweep with a overly broad ignore)
21:42:06	<fireonlive>	i thoguht i had to append like .* to the end of every ignore :\|
21:42:11	<fireonlive>	apparently not!
21:42:32	<fireonlive>	/groups/ vs /groups/.*
21:52:44		Megame quits [Client Quit]
21:54:27		c3manu quits [Client Quit]
21:55:55		sec^nd (second) joins
22:49:50		decky_e quits [Ping timeout: 265 seconds]
22:50:26		decky_e (decky_e) joins
23:08:04		Miori quits [Remote host closed the connection]
23:19:31		AlsoHP_Archivist quits [Client Quit]
23:20:03		HP_Archivist (HP_Archivist) joins
23:38:05		sonick (sonick) joins
23:39:07		BlueMaxima joins
23:50:44		decky_e quits [Ping timeout: 265 seconds]
23:51:12		decky_e (decky_e) joins

Home Search Previous day Next day