#archiveteam-bs log for 2026-01-17

Home Search Previous day Next day

00:18:42		Wohlstand (Wohlstand) joins
00:23:43		Wohlstand quits [Ping timeout: 272 seconds]
00:48:06	<klea>	btw, it'd be neat to have a system to track planned and unplanned outages in systems that AT uses.
00:54:14	<klea>	/cc DigitalDragons ^
01:06:23		pika joins
01:09:58		SootBector quits [Remote host closed the connection]
01:11:04		SootBector (SootBector) joins
01:16:55		pika quits [Ping timeout: 272 seconds]
01:17:38		Suika_ quits [Ping timeout: 256 seconds]
01:24:35		pika joins
01:25:41		pokechu22 quits [Quit: WeeChat 4.7.1]
01:26:08		pokechu22 (pokechu22) joins
01:26:33		Suika joins
01:29:35		pika quits [Ping timeout: 272 seconds]
01:37:50		pika joins
01:42:53		pika quits [Ping timeout: 272 seconds]
01:49:15		Sk1d quits [Read error: Connection reset by peer]
01:51:30		pika joins
01:51:51		pika leaves
01:59:47		cyan_box joins
02:04:06		cyanbox_ quits [Ping timeout: 256 seconds]
02:36:51		MrMcNuggets (MrMcNuggets) joins
03:06:39		etnguyen03 (etnguyen03) joins
03:18:10		etnguyen03 quits [Remote host closed the connection]
03:19:47		etnguyen03 (etnguyen03) joins
03:21:38		etnguyen03 quits [Remote host closed the connection]
03:22:47		etnguyen03 (etnguyen03) joins
03:38:21		etnguyen03 quits [Client Quit]
03:46:06		CYBERDEV quits [Quit: Leaving]
03:46:16		etnguyen03 (etnguyen03) joins
03:54:55		CYBERDEV joins
03:59:21		etnguyen03 quits [Remote host closed the connection]
04:27:37		chrismeller3 quits [Quit: chrismeller3]
05:02:38		Webuser290291 joins
05:02:55		Webuser290291 quits [Client Quit]
05:04:14		wotd quits [Remote host closed the connection]
05:04:18		n9nes quits [Ping timeout: 256 seconds]
05:04:47		wotd joins
05:07:05		n9nes joins
05:14:25		chunkynutz60 quits [Ping timeout: 272 seconds]
05:17:20		ThetaDev quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
05:18:00		ThetaDev joins
05:22:08	<HP_Archivist>	I just noticed this: https://www.reddit.com/r/Archivists/comments/1q9n5nt/nara_is_shutting_down_history_hub_for_citizen/
05:22:26	<HP_Archivist>	https://historyhub.history.gov/citizen_archivists/f/discussions
05:22:29	<HP_Archivist>	Can we grab it?
05:23:12	<pokechu22>	"On January 15, 2026 the History Hub site will be “frozen in time.” The site will remain available for reference until February 13, 2026."
05:23:57	<HP_Archivist>	Does that mean it will be frozen in time and online or?
05:24:16		Webuser025436 joins
05:24:18	<pokechu22>	Presumably that means they made it read-only yesterday and it'll be online for a month before they close it fully
05:24:19	<nicolas17>	sounds like it will be frozen and online between jan 15 and feb 13, and offline after feb 13
05:24:29	<pokechu22>	I'm seeing incapsula on it
05:24:34	<@arkiver>	i've been working on a job in AB
05:24:34	<nicolas17>	pls add to deathwatch
05:24:38	<@arkiver>	but not sure if it went well
05:24:57	<pokechu22>	It didn't work
05:26:00	<HP_Archivist>	Hm.Throw it into AB now?
05:26:13	<pokechu22>	It seems like the incapsula JS challenge would need to be solved, and I don't know how long that lasts
05:26:48	<Webuser025436>	Hi. Ming Pao Canada, a hk-based newspaper that has a Canada edition and daily newspaper, announced they are shutting down as of today. [1] Notably, they have an archive of all articles Ming Pao Hong Kong (the HK edition). All of Ming Pao HK's articles pre-2021 have been removed a long time ago from the internet because of the media situation in HK.
05:26:49	<Webuser025436>	Is it possible to archive? https://www.mingpaocanada.com/
05:26:49	<Webuser025436>	[1]: https://www.cp24.com/news/canada/2026/01/13/ex-journalists-lament-closure-of-ming-pao-canadas-last-chinese-language-daily-paper/
05:28:13	<h2ibot>	Pokechu22 edited Deathwatch (+237, /* 2026-02 */ https://historyhub.history.gov/): https://wiki.archiveteam.org/?diff=60210&oldid=60209
05:28:30	<pokechu22>	Webuser025436: I believe we've already started an archivebot job for that; I'm going to double-check the status of it
05:29:59	<pokechu22>	Webuser025436: It's currently running, together with http://mingshengbao.com/ - see http://archivebot.com/?initialFilter=mingpaocanada#log-container-6cnonr9bi6enxit9na57wckan
05:30:55	<Webuser025436>	Many thanks 🙏
05:31:15	<pokechu22>	Where is the archive of the pre-2021 Hong Kong articles? I can't read Chinese so I'd like to make sure we're saving that too
05:32:54	<nicolas17>	pokechu22: wonder if we should speed up that job
05:33:28	<pokechu22>	It closes on the end of January; today was date of the last edition being published
05:33:44	<pokechu22>	(at least according to https://www.cp24.com/news/canada/2026/01/13/ex-journalists-lament-closure-of-ming-pao-canadas-last-chinese-language-daily-paper/)
05:37:03	<Webuser025436>	pokechu22 All HK articles for a given day are shown here: https://www.mingpaocanada.com/tor/htm/News/YYYYMMDD/HK-GAindex_r.htm
05:37:03	<Webuser025436>	So for example: https://www.mingpaocanada.com/tor/htm/News/20140710/HK-GAindex_r.htm
05:37:03	<Webuser025436>	The earliest AFAICT is 20140710.
05:38:53	<pokechu22>	Is there a page that lists all of the previous ones? I assume there must be because archivebot has found https://www.mingpaocanada.com/tor/htm/News/20220429/tam1_r.htm but I don't know exactly where it came from
05:40:12	<pokechu22>	(there's https://www.mingpaocanada.com/tor/htm/Responsive/archiveList.cfm but that seems to only directly show the last week)
05:40:19	<nicolas17>	pokechu22: also I'm seeing many requests like https://www.mingpaocanada.com/tor/htm/News/20220815/TD/TD/tdc1.txt that redirect to an error page, might be a crawling glitch finding garbage in JS or something?
05:41:58	<nicolas17>	ah yes, docPath: "HK-GA/gc/gcc1.txt"
05:41:59	<pokechu22>	Yeah, looks like that comes from https://www.mingpaocanada.com/tor/htm/News/20220815/HK-gaa1_r.htm containing a POST to /Tor/cfc/popular_addone.cfc with HK-GA/ga/gaa1.txt as a parameter
05:42:44	<nicolas17>	not sure how to avoid this, excluding *.txt feels too broad
05:43:39	<pokechu22>	It's probably fine to just leave them as-is since there's 1 per article and most articles have several images as well
05:44:25	<nicolas17>	well it's also following the redirect and saving errorpage.html every single time
05:45:21	<pokechu22>	Looks like that's not a new issue: https://web.archive.org/web/20260901000000*/https://www.mingpaocanada.com/errorpage.html :)
05:45:36	<pokechu22>	... ok, though 10660 snapshots on January 16 is probably still excessive
05:46:26	<nicolas17>	pain
05:50:09	<pokechu22>	I guess I can check what dates it's already found using ab2f
05:55:16	<nicolas17>	I was thinking something like tor/htm/News/[0-9]{8}/[A-Z]+/[A-Z]+/[a-z]+[0-9]\.txt
05:55:27	<nicolas17>	but that's not exhaustive, will need a few more patterns
05:56:12		sec^nd quits [Remote host closed the connection]
05:56:38		sec^nd (second) joins
05:59:31		evergreen56 joins
06:02:06		evergreen5 quits [Ping timeout: 256 seconds]
06:02:06		evergreen56 is now known as evergreen5
06:04:34	<pokechu22>	The oldest archivebot has found so far is http://www.mingpaocanada.com/TOR/htm/News/20220319/HK-GAindex_r.htm
06:05:09	<pokechu22>	JAA: can you trace http://www.mingpaocanada.com/TOR/htm/News/20220319/HK-GAindex_r.htm on 6cnonr9bi6enxit9na57wckan please?
06:06:09	<Webuser025436>	is the link i provided not good enough above? this link contains outlinks to all hk articles for a given day: https://www.mingpaocanada.com/tor/htm/News/20140710/HK-GAindex_r.htm
06:07:00	<pokechu22>	It is, but now I'm trying to figure out if archivebot will have already found those or if I need to start the job in a way that will discover those
06:11:25	<pokechu22>	(there isn't any good way to add urls to an existing archivebot job, but I could start a new one with a list of those pages for all days since 2014 or similar, along with https://www.mingpaocanada.com/van/htm/News/20260116/VAindex_r.htm and https://www.mingpaocanada.com/tor/htm/News/20260116/TAindex_r.htm)
06:14:47		chunkynutz60 joins
06:17:24		nexussfan quits [Quit: Konversation terminated!]
07:32:26		chrismeller3 (chrismeller) joins
09:00:02		midou quits [Ping timeout: 256 seconds]
09:20:45		midou joins
09:27:48		midou quits [Ping timeout: 256 seconds]
09:41:56		midou joins
09:46:45		midou quits [Ping timeout: 272 seconds]
09:52:45		midou joins
10:23:29		midou quits [Ping timeout: 272 seconds]
10:43:05	<h2ibot>	KleaBot made 2 bot changes: https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters[]=nsInvert&wpfilters[]=associated
10:43:34	<klea>	mhmm
10:44:14	<klea>	oh i love that my terminal thinks the url is shorter.
10:44:37	<klea>	so my browser opened <https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters> instead
10:45:30		Dada joins
11:14:24		midou joins
11:20:00	<alexlehm>	the same happens in my irc client, it does not consider [] as a valid url character
11:22:15	<klea>	time to urlencode it, or wrap it in <>
11:22:25	<klea>	alexlehm: how does <https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters[]=nsInvert&wpfilters[]=associated> open?
11:23:28	<alexlehm>	"https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters"
11:24:14	<klea>	alexlehm: and KleaBot made 2 bot changes: https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters%5B%5D=nsInvert&wpfilters%5B%5D=associated
11:24:42	<alexlehm>	it would probably work with https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters%5B%5D=nsInvert&wpfilters%5B%5D=associated
11:30:46		ArchivalEfforts quits [Ping timeout: 256 seconds]
11:30:58		ArchivalEfforts joins
11:31:51		midou quits [Read error: Connection reset by peer]
11:34:08		HP_Archivist quits [Quit: Leaving]
11:35:05	<Juest>	hexchat processes the url fine
11:38:51		Doomaholic quits [Ping timeout: 272 seconds]
11:39:07	<alexlehm>	: could also be a stop character
11:48:57		Hackerpcs_1 (Hackerpcs) joins
11:49:28		Hackerpcs quits [Ping timeout: 256 seconds]
12:00:02		Bleo182600722719623455222 quits [Quit: The Lounge - https://thelounge.chat]
12:02:48		Bleo182600722719623455222 joins
12:11:18		midou joins
12:13:41		szczot3k quits [Ping timeout: 272 seconds]
12:13:43		szczot3k_ (szczot3k) joins
12:14:23		szczot3k_ is now known as szczot3k
12:18:32		szczot3k quits [Remote host closed the connection]
12:18:58		szczot3k (szczot3k) joins
12:21:28		Doomaholic (Doomaholic) joins
12:30:51		HP_Archivist (HP_Archivist) joins
12:31:55		midou quits [Read error: Connection reset by peer]
12:38:49		szczot3k quits [Remote host closed the connection]
12:41:18		szczot3k (szczot3k) joins
12:45:07		szczot3k quits [Remote host closed the connection]
12:46:57		Webuser989898 joins
12:47:20		Webuser989898 quits [Client Quit]
12:47:35		szczot3k (szczot3k) joins
12:49:37		szczot3k quits [Remote host closed the connection]
12:52:02		szczot3k (szczot3k) joins
12:52:51		midou joins
13:03:35		cyan_box quits [Read error: Connection reset by peer]
13:04:21		midou quits [Ping timeout: 272 seconds]
13:11:17		szczot3k_ (szczot3k) joins
13:12:35		szczot3k quits [Ping timeout: 272 seconds]
13:12:35		szczot3k_ is now known as szczot3k
13:14:24		midou joins
13:15:00		Marie0 joins
13:17:00		szczot3k quits [Remote host closed the connection]
13:19:26		szczot3k (szczot3k) joins
13:21:38	<Marie0>	Sorry for getting to this so late, but I think we should archive some Honduran government websites before the inauguration on the 27th. The current president is a leftist and the new one is a Trump ally promising all kinds of austerity measures, so I expect the web presence of the government will completely change fairly quickly
13:21:45		midou quits [Read error: Connection reset by peer]
13:21:50		nine quits [Quit: See ya!]
13:22:03		nine joins
13:22:03		nine is now authenticated as nine
13:22:03		nine quits [Changing host]
13:22:03		nine (nine) joins
13:23:38	<Marie0>	On the bright side, Honduras is a small country and their internet is actually pretty modern. I think if we start immediately, we could easily do this
13:24:54		etnguyen03 (etnguyen03) joins
13:30:18		szczot3k quits [Remote host closed the connection]
13:32:44		szczot3k (szczot3k) joins
13:34:51	<Marie0>	I'm new here btw. I have some experience doing small scrapes with cURL but have never collaborated on one. I was initially going to do that for the Honduras thing, but I was in over my head. Huge fan of Archiveteam and your work!
13:35:17		midou joins
14:00:04	<justauser>	Marie0: Do you have a list?
14:13:58		midou quits [Ping timeout: 256 seconds]
14:23:13		midou joins
14:51:24		etnguyen03 quits [Client Quit]
14:51:32		the joins
15:05:09		nexussfan (nexussfan) joins
15:05:13		etnguyen03 (etnguyen03) joins
15:11:54		the quits [Client Quit]
15:21:08	<Marie0>	justauser: Sort of. There's a general directory of all government agency that I extracted the links from, but it's just the front page of each agency. A lot of them run other websites for specific offices/services. These are usually conspicuously linked somewhere near the front page of the agency, but I haven't actually compiled them into a list
15:21:10		nexussfan quits [Client Quit]
15:22:31		nexussfan (nexussfan) joins
15:27:17	<klea>	Marie0: are they subdomains or are they subpages of the domain?
15:27:29		IDK quits [Quit: Updating details, brb]
15:28:59		IDK (IDK) joins
15:29:49	<h2ibot>	Klea edited Dealing with Cloudflare (+28, /* Scenario 2 - TLS fingerprinting */…): https://wiki.archiveteam.org/?diff=60214&oldid=58744
15:31:02		midou quits [Ping timeout: 256 seconds]
15:37:39		DogsRNice joins
15:38:16		szczot3k quits [Remote host closed the connection]
15:38:35	<Marie0>	klea: In some cases neither. For example, presidencia.gob.hn is obviously right there, but it has a section "periódico impreso" that links to poderpopular.hn, which isn't on the list but is still run by the government, and since it's straight up party propoganda it'll almost certainly be taken down
15:39:30	<klea>	aaaaa, that'd require getting all sublinks and archiving more content
15:40:25	<klea>	i guess it's possible by dumping the dbs after having made a AB job with all initial domains, and getting more domains, is it so justauser, or is that a bad idea?
15:40:42		szczot3k (szczot3k) joins
15:40:59	<Juest>	try grabbing the sitemaps? \
15:41:06		ATinySpaceMarine quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
15:41:27	<Marie0>	There's not that many root links to begin with so it's not THAT bad. I think I could do it by hand in an afternoon if needed.
15:42:09	<Marie0>	Here's the list btw: https://transfer.archivete.am/xbYNo/websites.txt
15:42:10	<eggdrop>	inline (for browser viewing): https://transfer.archivete.am/inline/xbYNo/websites.txt
15:42:38	<Marie0>	Ah, thank you
15:43:04		ATinySpaceMarine joins
15:44:58	<justauser>	Started jobs for the two you mentioned first for now.
15:45:40	<justauser>	Some pages return normal contents but with HTTP 500.
15:45:52	<klea>	!kfind protocol incompliant
15:45:53	<eggdrop>	[karma] 'protocol incompliant' not found.
15:45:59	<klea>	shitty :(
15:46:12	<justauser>	Example https://www.poderpopular.hn/vernoticias.php?id_noticia=25018
15:46:49	<klea>	> Expired website :( < https://www.partidoliberal.hn/
15:48:03	<Marie0>	justauser: idk but it's working fine in my browser
15:48:17	<klea>	> Domain for sale: https://partidonacional.hn/
15:48:17	<klea>	Unable to resolve: https://mamsurpazhn.com/
15:48:18	<klea>	500 https://fonac.hn/
15:48:25	<justauser>	Exactly. It works fine while claiming an error on lower level.
15:48:39	<justauser>	It confuses our machinery.
15:48:49	<klea>	403 https://portalunico.iaip.gob.hn/
15:48:59	<klea>	Loading page even with js on? https://www.dpr.gob.hn/
15:49:05	<klea>	maybe we should make a channel for this?
15:49:14	<justauser>	That's #vooterbooter
15:49:16	<klea>	oh
15:49:18	<klea>	sorry
15:49:55	<justauser>	Tries to load script from some CDN and fails.
15:50:42	<klea>	i've moved my part of the discussion there.
15:59:38		alexlehm quits [Remote host closed the connection]
15:59:41		alexlehm (alexlehm) joins
16:01:08		midou joins
16:07:08	<nulldata>	https://learn.microsoft.com/en-us/troubleshoot/mem/configmgr/mdt/mdt-retirement
16:07:32	<nulldata>	"MDT download packages might be removed or deprecated from official distribution channels."
16:14:03	<nulldata>	Here's an old version https://www.microsoft.com/en-us/download/details.aspx?id=57917
16:14:16	<nulldata>	Newest one seems to be pulled already
16:17:32		midou quits [Read error: Connection reset by peer]
16:20:20		BluRaf quits [Quit: WeeChat 3.8]
16:20:25		BluRaf (BluRaf) joins
16:22:32		Dada quits [Remote host closed the connection]
16:22:45		Dada joins
16:26:38		midou joins
16:39:03		midou quits [Ping timeout: 272 seconds]
16:48:02		midou joins
16:58:26		deafmute joins
17:03:35	<deafmute>	Hello everyone. Are there any plans to archive cosplay.com galleries? The site has been undead and dysfunctional for a long time
17:03:51		Marie0 quits [Quit: Ooops, wrong browser tab.]
17:08:03	<h2ibot>	Klea created ArchiveBot/2025 Honduran General Election/list (+2609, Created page with "https://congresonacional.hn/…): https://wiki.archiveteam.org/?title=ArchiveBot/2025%20Honduran%20General%20Election/list
17:39:40	<that_lurker>	https://console.cloud.google.com/storage/browser/net-ntlmv1-tables/tables;tab=objects?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false
17:40:30	<that_lurker>	"Google just dropped 1.1 QUADRILLION pre-computed passwords (aka rainbowtable) for NetNTLMv1."
17:40:41	<that_lurker>	so about 8,6TB
17:40:46	<that_lurker>	ref. https://www.linkedin.com/posts/benjamin-iheukumere_google-just-dropped-11-quadrillion-pre-computed-activity-7418215510380802048-9NIK/
17:41:23		oxtyped quits [Read error: Connection reset by peer]
17:41:35		oxtyped joins
17:42:01	<Hans5958>	That LinkedIn post is so engagement bait-y
17:42:02	<Hans5958>	https://cloud.google.com/blog/topics/threat-intelligence/net-ntlmv1-deprecation-rainbow-tables
17:42:32	<Hans5958>	https://x.com/Mandiant/status/2012268623662874906
17:42:32	<eggdrop>	nitter: https://nitter.net/Mandiant/status/2012268623662874906
17:43:04	<that_lurker>	ohh nice did not know there was an article on that
17:43:28	<that_lurker>	I don't have enough space to download all of that and then push it to IA, but someone here might :-P
18:17:35	<katia>	👀
18:18:32		Ajay quits [Quit: Bridge terminating on SIGTERM]
18:18:32		@Sanqui\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:32		britmob\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:33		anon00001\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:33		xxia\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:33		mind_combatant quits [Quit: Bridge terminating on SIGTERM]
18:18:33		x9fff00 quits [Quit: Bridge terminating on SIGTERM]
18:18:33		DigitalDragon quits [Quit: Bridge terminating on SIGTERM]
18:18:33		gamer191-1\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:33		theblazehen\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:33		igneousx quits [Quit: Bridge terminating on SIGTERM]
18:18:34		audrooku\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:34		alexshpilkin quits [Quit: Bridge terminating on SIGTERM]
18:18:34		flashfire42\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:34		Minkafighter\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:34		tomodachi94 quits [Quit: Bridge terminating on SIGTERM]
18:18:34		Vokun quits [Quit: Bridge terminating on SIGTERM]
18:18:35		Tom\|m1 quits [Quit: Bridge terminating on SIGTERM]
18:18:35		Hans5958 quits [Quit: Bridge terminating on SIGTERM]
18:18:35		mpeter\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:35		rewby\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:35		MaxG quits [Quit: Bridge terminating on SIGTERM]
18:18:35		Exorcism quits [Quit: Bridge terminating on SIGTERM]
18:18:35		nyuuzyou quits [Quit: Bridge terminating on SIGTERM]
18:18:35		masterx244\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:35		nstrom\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:36		ram\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:36		cruller quits [Quit: Bridge terminating on SIGTERM]
18:18:36		aaq\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:36		MinePlayersPEMyNey\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:36		Thibaultmol quits [Quit: Bridge terminating on SIGTERM]
18:18:37		Fletcher quits [Quit: Bridge terminating on SIGTERM]
18:18:39		akaibu\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:40		tech234a quits [Quit: Bridge terminating on SIGTERM]
18:18:41		Ruk8 quits [Quit: Bridge terminating on SIGTERM]
18:18:41		hlgs\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:41		EvanBoehs\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:42		Alienmaster\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:42		schwarzkatz\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:43		moe-a-m\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:43		vexr quits [Quit: Bridge terminating on SIGTERM]
18:18:43		andrewvieyra\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:43		yzqzss quits [Quit: Bridge terminating on SIGTERM]
18:18:44		that_lurker\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:44		GhostIsBeHere\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:44		mikolaj\|m quits [Quit: Bridge terminating on SIGTERM]
18:18:45		upperbody321\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:30		finalti\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:30		tech234a\|m-backup quits [Quit: Bridge terminating on SIGTERM]
18:19:30		Roki_100\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:30		e2mau\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:30		qyxojzh\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:30		victor_vaughn\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:30		th3z0l4\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:31		coro quits [Quit: Bridge terminating on SIGTERM]
18:19:31		triplecamera\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:31		thermospheric quits [Quit: Bridge terminating on SIGTERM]
18:19:31		phaeton quits [Quit: Bridge terminating on SIGTERM]
18:19:31		wrangle\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:31		s-crypt\|m\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:31		hexagonwin\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:31		froxcey\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:31		Video quits [Quit: Bridge terminating on SIGTERM]
18:19:31		iCesenberk\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:31		madpro\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39		GRBaset quits [Quit: Bridge terminating on SIGTERM]
18:19:39		pannekoek11\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39		superusercode quits [Quit: Bridge terminating on SIGTERM]
18:19:39		octylFractal\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39		lasdkfj\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39		CrispyAlice2 quits [Quit: Bridge terminating on SIGTERM]
18:19:39		jevinskie quits [Quit: Bridge terminating on SIGTERM]
18:19:39		jackt1365\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39		jwoglom\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39		noobirc\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39		cmostracker\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39		Cydog\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		ragu\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		nosamu\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		joepie91\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		spearcat\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		ax\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		katia\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		l0rd_enki\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		bogsen quits [Quit: Bridge terminating on SIGTERM]
18:19:40		Adamvoltagex\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		osiride\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		trumad\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		vics quits [Quit: Bridge terminating on SIGTERM]
18:19:40		NickS\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		v1cs quits [Quit: Bridge terminating on SIGTERM]
18:19:40		its_notjack quits [Quit: Bridge terminating on SIGTERM]
18:19:40		haha-whered-it-go\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		supermariofan67\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		Fijxu\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		nano412510 quits [Quit: Bridge terminating on SIGTERM]
18:19:40		mister_x quits [Quit: Bridge terminating on SIGTERM]
18:19:40		Misty\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		Valkum\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		kaz__\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		rain\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		mat\|m1 quits [Quit: Bridge terminating on SIGTERM]
18:19:40		miksters\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		PhoHale\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		will\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		yetanotherarchiver\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		Passiing\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		username675f\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		Tyrasuki\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		gareth48\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		Nulo\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		nightpool quits [Quit: Bridge terminating on SIGTERM]
18:19:40		noxious quits [Quit: Bridge terminating on SIGTERM]
18:19:40		ampdot\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40		Claire\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:41		saouroun\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:41		hillow596\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:41		yarnover\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:41		Cronfox\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:41		Joy\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:41		IceCodeNew\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:42		M--mlv\|m quits [Quit: Bridge terminating on SIGTERM]
18:19:42		justauser\|m quits [Quit: Bridge terminating on SIGTERM]
18:20:25	<klea>	that_lurker: just don't download it :p, slowly download chunks and do chunked uploads to IA :p
18:20:38		rewby\|m (rewby) joins
18:20:38		@ChanServ sets mode: +o rewby\|m
18:22:04	<that_lurker>	:-P
18:22:17	<that_lurker>	If someone does that please request a collection for it first :-)
18:22:28		ram\|m joins
18:22:28		pannekoek11\|m joins
18:22:28		MinePlayersPEMyNey\|m joins
18:22:28		Ruk8 (Ruk8) joins
18:22:41	<that_lurker>	and maybe get arkiver or IA's ok to dumb that much data :-P
18:22:55		Cronfox\|m joins
18:22:55		xxia\|m joins
18:22:55		andrewvieyra\|m joins
18:22:55		nstrom\|m joins
18:23:37		Dada quits [Remote host closed the connection]
18:23:49		Dada joins
18:23:51		saouroun\|m joins
18:23:51		haha-whered-it-go\|m joins
18:23:52		schwarzkatz\|m joins
18:23:52		jackt1365\|m joins
18:23:52		x9fff00 (x9fff00) joins
18:23:53		madpro\|m joins
18:23:53		DigitalDragon joins
18:23:53		ragu\|m joins
18:23:53		audrooku\|m joins
18:23:53		Ajay joins
18:23:53		igneousx (igneousx) joins
18:23:53		akaibu\|m joins
18:23:53		masterx244\|m (masterx244\|m) joins
18:24:01		superusercode joins
18:24:01		Cydog\|m joins
18:24:01		trumad\|m joins
18:24:01		cmostracker\|m joins
18:24:01		aaq\|m joins
18:24:01		mikolaj\|m joins
18:24:01		upperbody321\|m joins
18:24:01		nyuuzyou joins
18:24:01		spearcat\|m joins
18:24:01		hexagonwin\|m joins
18:24:01		th3z0l4\|m joins
18:24:01		victor_vaughn\|m joins
18:24:01		ax\|m joins
18:24:01		e2mau\|m joins
18:24:01		moe-a-m\|m joins
18:24:01		Sanqui\|m (Sanqui) joins
18:24:01		joepie91\|m joins
18:24:01		tech234a\|m-backup (tech234a) joins
18:24:01		mpeter\|m joins
18:24:01		@ChanServ sets mode: +o Sanqui\|m
18:24:01		jwoglom\|m joins
18:24:01		thermospheric joins
18:24:01		theblazehen\|m joins
18:24:01		britmob\|m joins
18:24:01		octylFractal\|m joins
18:24:01		GRBaset (GRBaset) joins
18:24:01		Minkafighter\|m joins
18:24:01		NickS\|m joins
18:24:01		hlgs\|m joins
18:24:01		Roki_100\|m joins
18:24:01		lasdkfj\|m joins
18:24:01		finalti\|m joins
18:24:01		Thibaultmol joins
18:24:01		Fletcher (Fletcher) joins
18:24:01		MaxG joins
18:24:01		yzqzss (yzqzss) joins
18:24:01		CrispyAlice2 joins
18:24:01		jevinskie joins
18:24:01		wrangle\|m joins
18:24:01		vexr joins
18:24:01		nosamu\|m joins
18:24:01		noobirc\|m joins
18:24:01		phaeton (phaeton) joins
18:24:01		coro joins
18:24:01		alexshpilkin joins
18:24:01		v1cs joins
18:24:01		flashfire42\|m (flashfire42) joins
18:24:01		EvanBoehs\|m joins
18:24:01		iCesenberk\|m joins
18:24:01		qyxojzh\|m joins
18:24:01		s-crypt\|m\|m joins
18:24:01		l0rd_enki\|m joins
18:24:01		Tom\|m1 joins
18:24:01		gamer191-1\|m joins
18:24:01		that_lurker\|m joins
18:24:01		bogsen (bogsen) joins
18:24:02		Vokun (Vokun) joins
18:24:02		supermariofan67\|m joins
18:24:02		vics joins
18:24:02		GhostIsBeHere\|m joins
18:24:02		Adamvoltagex\|m joins
18:24:02		anon00001\|m joins
18:24:02		cruller joins
18:24:02		tomodachi94 (tomodachi94) joins
18:24:02		triplecamera\|m joins
18:24:02		froxcey\|m joins
18:24:02		tech234a (tech234a) joins
18:24:02		katia\|m joins
18:24:02		Fijxu\|m joins
18:24:02		Video joins
18:24:02		its_notjack (its_notjack) joins
18:24:02		osiride\|m joins
18:24:02		Hans5958 joins
18:24:02		justauser\|m joins
18:24:02		mind_combatant (mind_combatant) joins
18:24:02		Alienmaster\|m joins
18:24:02		Exorcism (exorcism) joins
18:24:18	<@arkiver>	!remindme 8h that_lurker password dump google
18:24:18	<eggdrop>	[remind] ok, i'll remind you at 2026-01-18T02:24:18Z
18:25:27		username675f\|m joins
18:26:17		mister_x joins
18:26:18		PhoHale\|m joins
18:26:19		yetanotherarchiver\|m joins
18:26:19		hillow596\|m joins
18:26:19		Joy\|m joins
18:26:19		nano412510 (nano412510) joins
18:26:19		Tyrasuki\|m joins
18:26:20		yarnover\|m joins
18:26:20		mat\|m1 joins
18:26:20		gareth48\|m joins
18:26:20		kaz__\|m joins
18:26:20		Passiing\|m joins
18:26:20		noxious joins
18:26:22		will\|m joins
18:26:22		Misty\|m joins
18:26:22		Nulo\|m joins
18:26:22		ampdot\|m joins
18:26:22		Valkum\|m joins
18:26:22		Claire\|m joins
18:26:22		IceCodeNew\|m joins
18:26:22		rain\|m joins
18:26:22		nightpool (nightpool) joins
18:26:23		miksters\|m joins
18:30:11	<klea>	that_lurker: nah, it seems IA has tooling to move things from search results to collections (or given item ids put them in collections), i suppose that's why archiveteam_inbox exists.
18:30:43	<klea>	oh wait, will arkiver store the dump on ia themselves?
18:31:31	<that_lurker>	It's easier to just push to a collection directly than ask to move items later also can let them know how much data there will be :-P
18:31:38	<klea>	oh
18:31:52	<that_lurker>	them = a_rkive :-P
18:32:14	<klea>	smh https://console.cloud.google.com/storage/browser/net-ntlmv1-tables requires login :(
18:34:19		Gadelhas562873784438 joins
18:34:45	<klea>	it seems to be split into 2.1G chunks: https://console.cloud.google.com/storage/browser/net-ntlmv1-tables/tables?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))
18:35:37		M--mlv\|m joins
18:36:18	<klea>	that_lurker: do you know the total size?
18:36:39	<that_lurker>	8,6TB was mentioned in a few places
18:37:03	<klea>	ouch
18:39:28	<klea>	i wonder if IA could lend me access to FOS. :p http://fos.textfiles.com/pipeline.html#:~:text=%2Fdev%2Fmd1%2013T%202%2E2T%2011T%2018%25%20%2F2
18:39:38	<Juest>	is there a direct contact with internet archive around here by chance?
18:39:47	<klea>	Juest: wdym by DC?
18:39:48	<klea>	oh also
18:40:03	<justauser>	#internetarchive has an employee.
18:40:19	<justauser>	But that's only for small bug reports and status updates.
18:40:36	<klea>	yeah
18:40:52	<Juest>	archive team is independent and not affiliated with ia but working very closely with despite not seeming like much?
18:41:03	<Juest>	excuse my weirdness
18:41:41	<klea>	kind of
18:42:07	<justauser>	AT founder is an IA employee.
18:42:24	<klea>	yes
18:42:29	<justauser>	Current team lead has some extended access.
18:42:31	<Juest>	also i very much prefer chat, so apologies for not researching much with the wiki
18:42:54	<klea>	btw, afaik AT is more IRC first, doc later kind of from what i've seen.
18:48:10		etnguyen03 quits [Quit: Konversation terminated!]
18:55:19	<@JAA>	pokechu22: https://transfer.archivete.am/inline/WB59p/6cnonr9bi6enxit9na57wckan-trace.txt
18:56:38		colla quits [Quit: StoCa!]
18:58:27		colla joins
18:59:35	<klea>	wait
18:59:38	<klea>	google's bucket is s3
18:59:53	<klea>	compatible.
19:00:09	<klea>	so, what happens if we ask someone in #archivebot to queue it
19:06:15	<klea>	fun idea
19:06:25	<klea>	project specifically for archival of s3-like buckets.
19:08:24	<nicolas17>	what's the size limit of an IA item?
19:10:03	<justauser>	1TB unless they changed it and forgot to tell.
19:10:07	<klea>	nicolas17: https://irclogs.archivete.am/internetarchive/2026-01-07#l43630241
19:10:30	<nicolas17>	klea: google cloud console needs login
19:10:34	<nicolas17>	but the bucket is open https://storage.googleapis.com/net-ntlmv1-tables
19:10:39	<klea>	yeah, i noticed later.
19:10:58	<nicolas17>	having a # in the filename is insane
19:11:42	<klea>	that's the %23 right?
19:11:59	<nicolas17>	yes
19:12:16	<klea>	also, im stupid
19:12:23	<klea>	s3 exposes two paths per item.
19:12:46		etnguyen03 (etnguyen03) joins
19:12:52	<klea>	so how do we get WBM to load the same warc files later for net-ntlmv1-tables.storage.googleapis.com/?
19:13:19	<nicolas17>	oh ew
19:14:18	<justauser>	No way? No way!
19:14:25	<klea>	?
19:15:09	<nicolas17>	using other tools (like wget-lua) you can produce deduplicated WARCs, but the whole point of using archivebot would have been to avoid downloading+uploading yourself
19:15:56	<klea>	nicolas17: i believe we can later churn out some warcs that just contain deduplication entries?
19:17:16	<justauser>	Producing WARCs based on unproven guesses is discouraged.
19:17:23	<klea>	yeah :p
19:17:38	<@JAA>	s/discouraged/illegal/
19:17:48	<klea>	justauser: im thinking of a separate idea
19:17:54	<klea>	someone uploads all those warcs to ia
19:18:06	<klea>	wait, AB doesn't yet support deduplication between jobs right?
19:18:16	<@JAA>	AB doesn't dedupe at all.
19:18:20	<klea>	oh
19:18:21	<klea>	lovely :(
19:19:11	<nicolas17>	if AB supported deduplication (even if only within a job), we could just give it https://storage.googleapis.com/net-ntlmv1-tables/whatever and https://net-ntlmv1-tables.storage.googleapis.com/whatever in the same job
19:19:19	<klea>	i wonder how much time it'd take to make tooling to archive s3-like buckets, including but not limited to storage.googleapis.com, s3.amazonaws.com, s3.archive.org, etc.
19:19:45	<justauser>	little-things has a lister, and it's usually sufficient.
19:19:49	<klea>	oh wait, archiving IA inside of IA is not a good idea, and probably illegal.
19:19:50		colla quits [Client Quit]
19:19:53	<@JAA>	I have tooling for the listing, and the rest is just URLs.
19:20:07	<klea>	nono, i mean having a DPoS project for those :p
19:20:23	<@JAA>	Mhm
19:20:24	<klea>	that would deduplicate given some config that'd make it get two files, and dedupe
19:20:43	<nicolas17>	I'd rather have a more generic DPoS job for dedup purposes
19:20:52		klea nods
19:21:11	<klea>	i have a silly idea
19:21:18	<klea>	but it's probably still illegal.
19:21:54	<klea>	things provide etag headers
19:21:56		colla joins
19:22:33	<klea>	so if you do a request to check the etag, and it tells you it's not been modified since, can you assume it's the same file (supposing it's serviced by the same provider)?
19:23:48	<katia>	No.
19:24:03	<klea>	why?
19:24:37	<nicolas17>	some cloud storage systems provide hashes in the listing, or sometimes we can just "know" (or guess) in some other way that multiple URLs will produce the same file
19:24:41	<katia>	Assuming makes an ass out of u and me
19:24:49	<klea>	i guess because I didn't ask the server for the original files, so i can't verify
19:25:10	<katia>	WARCs are raw data that goes over the wire. Not assumptions based on a header
19:25:47	<klea>	wait, another silly and stupid question, if i were to actually do real requests for the files, get the other warc, and notice they are duplicates (for whichever files are duplicates, hopefully all), could i then write warc duplication records for that other warc that's in another archive.org item?
19:26:08	<nicolas17>	what we need is a DPoS project where the worker can be told to download https://net-ntlmv1-tables.storage.googleapis.com/LICENSE and https://storage.googleapis.com/net-ntlmv1-tables/LICENSE as part of the same item, and dedup if the data is the same (it already does that dedup)
19:26:31		klea agrees on what nicolas17 just said.
19:26:51	<nicolas17>	we can't shove a dozen "these are probably the same file" URLs into the item name though, so it could be trickier
19:27:08	<klea>	silly idea
19:27:22	<klea>	i'm sure the 3 letter archivist here that has two duplicate letters will love this
19:27:33	<nicolas17>	he's already raising an eyebrow at you
19:27:34	<klea>	transfer.archivete.am link with "probably same file" urls
19:27:53	<klea>	then queue that to the tracker
19:28:20	<klea>	worker gets that item, downloads it (with the added side effect of archival since we're running wget-at i believe on them), then tries to get every single url in that file, and then deduplicates.
19:29:11	<klea>	if you're silly and stupid like me you make batches instead of uploading only one, which means the worker will effectively write more than one response since i am silly and give each worker more than one task at a time :p
19:30:27	<klea>	> we can't shove a dozen "these are probably the same file" URLs into the item name though, so it could be trickier <- wait, what prevents us from shoving those urls into the item name, is there some limit length on the tracker?
19:31:27	<nicolas17>	uhh what did rclone or azure or whatever do here?
19:31:31	<nicolas17>	{"Path":"NU5T-14J002-JAD_1667587484000.VBF","Name":"NU5T-14J002-JAD_1667587484000.VBF","Size":270478638,"MimeType":"application/octet-stream","ModTime":"2022-12-08T20:05:42.000000000Z","IsDir":false,"Hashes":{"md5":"0000bd0fadf8df707aeb6d790020fdd79ec5d75e76ec5d7b"}}
19:31:35	<nicolas17>	that's not an md5 T_T
19:34:04	<nicolas17>	but yeah
19:34:08	<nicolas17>	I have a few extreme cases of duplication
19:34:13	<nicolas17>	like this https://transfer.archivete.am/inline/1t46z/f7cf43ec508103ad7d1350450c1e36004f42085e80177135.txt
19:34:51	<klea>	so i guess nicolas17 agress on my idea of shoving urls into very small files on transfer, and then shoving that down tracker, and then shoving that to a custom DPoS
19:35:39		deafmute quits [Client Quit]
19:36:50	<@JAA>	The ETag thing doesn't work because it's specified to be an opaque identifier for the same representation of the same resource only.
19:37:39	<@JAA>	So you can do rearchivals that way, but you can't do revisit records for a different URL.
19:37:46	<klea>	oh.
19:37:55	<nicolas17>	we can't use it for deduplication but we can use it to make the URL lists, if we're wrong and think two files are the same when we're not, then wget-at will simply not dedup them
19:37:59	<klea>	> So you can do rearchivals that way, but you can't do revisit records for a different URL. <- that also i suppose relies on the remote system not changing ownership.
19:38:06		Dango3607 (Dango360) joins
19:38:15	<klea>	wait, wget-at does dedupe, or is that in need of implementation?
19:38:48	<nicolas17>	wget-at does dedupe (if the requests are done as part of the same DPoS item)
19:39:06	<klea>	and wpull doesn't?
19:39:09	<@JAA>	There's a way to load CDXs as well, but I'm not sure how well-tested that is.
19:41:27		Dango360 quits [Ping timeout: 272 seconds]
19:41:27		Dango3607 is now known as Dango360
19:41:28	<@JAA>	nicolas17: To be clear, technically, it would work, and it might even be fine in this particular case. But it definitely doesn't generalise.
19:41:43	<@JAA>	From a pure HTTP client spec perspective, it's still a violation.
19:42:03	<katia>	This incident will be reported.
19:42:11	<@JAA>	oh no
19:42:22	<klea>	oh no
19:42:37	<nicolas17>	https://xkcd.com/838/
19:42:49	<katia>	Hehe
19:42:55	<@JAA>	wget-at doesn't support that dedupe profile, by the way.
19:43:14	<klea>	how is data deduped by AT then?
19:43:19	<@JAA>	It'd be this: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#profile-server-not-modified
19:43:23	<klea>	or does AT dump data without even dedup checking it?
19:43:34	<@JAA>	By payload digest, usually only within a single process.
19:44:23	<nicolas17>	wget-at downloads the whole thing and checks if the sha1 of the response matches a previously-downloaded response
19:44:54	<@JAA>	Yep
19:45:04	<@JAA>	And that's what I've done with S3 stuff before as well, just with qwarc.
19:45:51	<klea>	oh, so if i want s3 repos archived, i should contact JAA with a big list of urls that has both formats, and then JAA will run qwarc and upload to IA for me?
19:46:16	<klea>	should i also give the archive-s3-repo-with-dedup-on-both-urls-service-that-JAA-provides my S3 creds too?
19:46:26	<klea>	i suppose not since then it doesn't get in the web category.
19:58:16		asie (asie) joins
20:02:51		FiTheArchiver joins
20:03:47		PaiMei quits [Quit: Wololo]
20:04:06		PaiMei (PaiMei) joins
20:05:31		n9nes quits [Ping timeout: 272 seconds]
20:08:05		n9nes joins
20:11:33		FiTheArchiver quits [Remote host closed the connection]
20:11:54		FiTheArchiver joins
20:14:26		deafmute joins
20:14:54		Dada quits [Remote host closed the connection]
20:15:08		Dada joins
20:16:35	<h2ibot>	Haiseiko edited URLTeam/Warrior (+183, /* Warrior projects */): https://wiki.archiveteam.org/?diff=60216&oldid=53418
20:16:36	<h2ibot>	Brad edited Deathwatch (+692, Added Meta Horizon managed services and added…): https://wiki.archiveteam.org/?diff=60217&oldid=60210
20:16:37	<h2ibot>	Vi edited Deathwatch (+285, /* 2026-03 */ added The Anime Network): https://wiki.archiveteam.org/?diff=60218&oldid=60217
20:16:38	<h2ibot>	John123521 edited Deathwatch (+179, move Tom Lehrer and TUYU to Frozen Solid): https://wiki.archiveteam.org/?diff=60219&oldid=60218
20:19:35	<h2ibot>	JustAnotherArchivist edited URLTeam/Warrior (-18, Fix shor.kr addition): https://wiki.archiveteam.org/?diff=60220&oldid=60216
20:22:36	<h2ibot>	JustAnotherArchivist edited URLTeam/Warrior (-165, Reverted; there has been and is no shor-kr…): https://wiki.archiveteam.org/?diff=60221&oldid=60220
20:24:36	<h2ibot>	JustAnotherArchivist edited Deathwatch (+160, Restore Eshizuoka entry removed by…): https://wiki.archiveteam.org/?diff=60222&oldid=60219
20:25:34		FiTheArchiver quits [Remote host closed the connection]
20:25:36	<h2ibot>	JustAnotherArchivist edited Deathwatch (-135, Remove empty year sections for 2030s; we can…): https://wiki.archiveteam.org/?diff=60223&oldid=60222
20:25:55		FiTheArchiver joins
20:27:45		Aurora quits [Quit: Ooops, wrong browser tab.]
20:30:50	<klea>	i think i'll make a subforum or subservice for a small thing that will close in 2500, that way the wiki needs more sections :p
20:44:28		HP_Archivist quits [Quit: Leaving]
20:44:44		HP_Archivist joins
20:56:33		FiTheArchiver quits [Remote host closed the connection]
20:57:47		FiTheArchiver joins
21:25:02		nine quits [Quit: See ya!]
21:25:19		nine joins
21:31:33		FiTheArchiver quits [Remote host closed the connection]
21:31:56		FiTheArchiver joins
21:50:58		Webuser025436 quits [Quit: Ooops, wrong browser tab.]
22:02:33		FiTheArchiver quits [Remote host closed the connection]
22:03:09		FiTheArchiver joins
22:03:10		ScarlettStunningSpace quits [Ping timeout: 256 seconds]
22:05:32		Karlett joins
22:11:51	<h2ibot>	Klea edited Phorge/uncategorized (+66, Added git.kolab.org): https://wiki.archiveteam.org/?diff=60224&oldid=60124
22:12:51	<h2ibot>	Klea edited Phorge/uncategorized (+135, Add forge.softwareheritage.org): https://wiki.archiveteam.org/?diff=60225&oldid=60224
22:34:11		FiTheArchiver quits [Read error: Connection reset by peer]
22:37:00	<pokechu22>	deafmute: I'm not aware of an existing project for https://cosplay.com/ - do you have an estimate for how big it is?
22:37:34	<pokechu22>	It looks like it's not too scripty so archivebot probably could be used for it
22:42:55	<h2ibot>	Nintendofan885 edited Namuwiki (+40, fix infobox name): https://wiki.archiveteam.org/?diff=60226&oldid=60200
22:51:09	<deafmute>	pokechi22: difficult to estimate. judging from official numbers maybe 100-150k photos, which would be 100GB-1TB I guess. But the site is broken and it doesn't show nearly the same amount of images it claims to have, so probably less than that.
22:51:48	<deafmute>	and I haven't found a way to see the full res photos, just thumbnails
22:53:51		deafmute quits [Client Quit]
22:54:07		deafmute joins
22:54:17	<klea>	deafmute: it seems https://s3.amazonaws.com/cosplay-cdn/large/3d8bc518-ade8-4cad-9729-032b37331052.jpg they're on s3 but that bucket isn't open
22:57:51		Webuser239629 joins
22:58:00		Webuser239629 quits [Client Quit]
23:00:01		SootBector quits [Remote host closed the connection]
23:01:09		SootBector (SootBector) joins
23:06:18	<pokechu22>	hmm, yeah. https://cosplay.com/series/reborn says https://cosplay.com/character/gokudera-hayato has 233 costumes 1023 photos, but only lists 12 photos on https://cosplay.com/character/gokudera-hayato
23:06:48	<pokechu22>	(and guessing https://cosplay.com/character/gokudera-hayato?page=2 doesn't do anything either)
23:08:19	<pokechu22>	... and https://cosplay.com/?page=3 and beyond requires logging in, so it can't be discovered that way either (/series doesn't have that limitation though)
23:08:29	<klea>	pokechu22: they're categories, so they have photos inside?
23:09:14	<pokechu22>	I mispoke - https://cosplay.com/character/gokudera-hayato only lists 12 costumes and like a hundred photos
23:09:33	<klea>	oh
23:10:17	<nicolas17>	https://cosplay.com/s/092dder49 -> https://s3.amazonaws.com/photo.cosplay.com/143783/1800163.jpg
23:10:29	<nicolas17>	however the .jpg only loads with Referer: https://cosplaxy.com/
23:10:59	<deafmute>	I tried logging in with a bugmenot account, it doesn't seem to change anything regarding how many images are shown
23:11:12	<pokechu22>	Huh, I didn't know s3 could do referer checks
23:11:36	<klea>	that confused me
23:11:39	<pokechu22>	but archivebot generates referers properly so that probably won't be an issue
23:11:47	<nicolas17>	bucket policy can do many things
23:12:01	<pokechu22>	Does the bugmenot account make e.g. https://cosplay.com/?page=1000 work?
23:12:08	<klea>	there's also a forum
23:12:13		nicolas17 wonders if you can make a bucket policy that only allows access for 5 minutes of every hour
23:12:35	<klea>	pokechu22: logging in changes the ui
23:12:52	<pokechu22>	hmm
23:13:24	<pokechu22>	Oh, https://cosplay.com/member/206560 and https://cosplay.com/member/206559 both exist; we could maybe enumerate those
23:13:26	<deafmute>	pokechu22: no, still the same page for me
23:13:40	<klea>	for me it seems to be a doomscrollable interface.
23:14:03	<deafmute>	nicholas17: how did you find that /s/09dder49 page?
23:14:05	<klea>	which pokes <https://cosplay.com/livewire/message/status-list> with a servermemo
23:14:26	<klea>	deafmute: clicking on profile for user
23:14:30	<pokechu22>	Those /s/ ones are just linked on the home page too
23:14:41	<klea>	also there seems to be a forum.
23:15:30	<deafmute>	oh okay. I thought you found a way to get those pages for old posts like that sub-zero in my example
23:15:33	<pokechu22>	Huh, and https://cosplay.com/member/26559 also works, so they really do have 200k users maybe?
23:15:50	<pokechu22>	that links old posts like https://cosplay.com/s/ynz11e8m9
23:16:32	<pokechu22>	https://cosplay.com/member/1 -> https://cosplay.com/s/b934yzmvn apparently 3 years ago
23:17:05		Dada quits [Remote host closed the connection]
23:17:16	<klea>	https://cosplay.com/livewire/livewire.js
23:17:17		Dada joins
23:18:11	<pokechu22>	I think bruteforcing the user list would be the easiest way to discover all of the images since it paginates all the way to https://cosplay.com/member/1?page=604 (and also stops paginating there, unlike some other more annoying sites...)
23:18:44	<klea>	pokechu22: we'd need to get to crawl depth two which i think AB does right?
23:19:01	<pokechu22>	What do you mean?
23:19:19	<klea>	it has to go to https://cosplay.com/s/yxny028e9 to be able to get the full picture too, not just the thumbnail.
23:19:34	<pokechu22>	I'm specifically thinking of a recursive job as an !a < list job with a custom sitemap (so it behaves like !a https://cosplay.com/ but also gets seeded with all of the member URLs)
23:19:46	<pokechu22>	that should go to those /s/ URLs and everything else on the site that's linked normally
23:20:14	<klea>	they seem to have up to user https://cosplay.com/member/402093
23:21:12		nine quits [Client Quit]
23:21:25		nine joins
23:23:35	<pokechu22>	huh, looking at https://cosplay.com/series?page=50 the first row of the table claims 327 costumes, 33 photos, but https://cosplay.com/series/animamundi-dark says it's the other way around
23:25:05	<pokechu22>	it also says there are 6 characters and then lists 13 of them, with a sum of 21 costumes and 327 photos
23:25:30	<deafmute>	advanced cosplay maths
23:30:12	<deafmute>	Now I finally know how to navigate on this broken site and actually find what I was looking for, many thanks for that.
23:31:09	<deafmute>	can't really help with the technical stuff however, I was just curious
23:38:02	<h2ibot>	Nintendofan885 edited Qwarc (+4, link [[WARC]]): https://wiki.archiveteam.org/?diff=60227&oldid=54296
23:40:58		deafmute quits [Client Quit]
23:42:19	<klea>	JAA: could you put every item that you made using qwarc have subject qwarc?
23:42:50		deafmute joins
23:46:26	<pokechu22>	I wrote a quick script to sum the data on https://cosplay.com/series: https://transfer.archivete.am/inline/KPGxg/cosplay.com_count_photos.py which says 1608146 photos, 333780 costumes
23:47:04	<pokechu22>	That's roughly the same size as refsheet.net, which is definitely doable
23:47:20	<@JAA>	klea: I could, yes.
23:47:38	<klea>	mostly because then finding qwarc scripts would be significantly easier.
23:48:09	<pokechu22>	That's roughly the same size as refsheet.net (1445504 images, 408063 characters, 105078 users), which is definitely doable
23:49:03	<@JAA>	Mhm
23:49:15	<deafmute>	oh that's like 10x my guess lol
23:50:33		Dada quits [Remote host closed the connection]

Home Search Previous day Next day