00:18:42Wohlstand (Wohlstand) joins
00:23:43Wohlstand quits [Ping timeout: 272 seconds]
00:48:06<klea>btw, it'd be neat to have a system to track planned and unplanned outages in systems that AT uses.
00:54:14<klea>/cc DigitalDragons ^
01:06:23pika joins
01:09:58SootBector quits [Remote host closed the connection]
01:11:04SootBector (SootBector) joins
01:16:55pika quits [Ping timeout: 272 seconds]
01:17:38Suika_ quits [Ping timeout: 256 seconds]
01:24:35pika joins
01:25:41pokechu22 quits [Quit: WeeChat 4.7.1]
01:26:08pokechu22 (pokechu22) joins
01:26:33Suika joins
01:29:35pika quits [Ping timeout: 272 seconds]
01:37:50pika joins
01:42:53pika quits [Ping timeout: 272 seconds]
01:49:15Sk1d quits [Read error: Connection reset by peer]
01:51:30pika joins
01:51:51pika leaves
01:59:47cyan_box joins
02:04:06cyanbox_ quits [Ping timeout: 256 seconds]
02:36:51MrMcNuggets (MrMcNuggets) joins
03:06:39etnguyen03 (etnguyen03) joins
03:18:10etnguyen03 quits [Remote host closed the connection]
03:19:47etnguyen03 (etnguyen03) joins
03:21:38etnguyen03 quits [Remote host closed the connection]
03:22:47etnguyen03 (etnguyen03) joins
03:38:21etnguyen03 quits [Client Quit]
03:46:06CYBERDEV quits [Quit: Leaving]
03:46:16etnguyen03 (etnguyen03) joins
03:54:55CYBERDEV joins
03:59:21etnguyen03 quits [Remote host closed the connection]
04:27:37chrismeller3 quits [Quit: chrismeller3]
05:02:38Webuser290291 joins
05:02:55Webuser290291 quits [Client Quit]
05:04:14wotd quits [Remote host closed the connection]
05:04:18n9nes quits [Ping timeout: 256 seconds]
05:04:47wotd joins
05:07:05n9nes joins
05:14:25chunkynutz60 quits [Ping timeout: 272 seconds]
05:17:20ThetaDev quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
05:18:00ThetaDev joins
05:22:08<HP_Archivist>I just noticed this: https://www.reddit.com/r/Archivists/comments/1q9n5nt/nara_is_shutting_down_history_hub_for_citizen/
05:22:26<HP_Archivist>https://historyhub.history.gov/citizen_archivists/f/discussions
05:22:29<HP_Archivist>Can we grab it?
05:23:12<pokechu22>"On January 15, 2026 the History Hub site will be “frozen in time.” The site will remain available for reference until February 13, 2026."
05:23:57<HP_Archivist>Does that mean it will be frozen in time and online or?
05:24:16Webuser025436 joins
05:24:18<pokechu22>Presumably that means they made it read-only yesterday and it'll be online for a month before they close it fully
05:24:19<nicolas17>sounds like it will be frozen and online between jan 15 and feb 13, and offline after feb 13
05:24:29<pokechu22>I'm seeing incapsula on it
05:24:34<@arkiver>i've been working on a job in AB
05:24:34<nicolas17>pls add to deathwatch
05:24:38<@arkiver>but not sure if it went well
05:24:57<pokechu22>It didn't work
05:26:00<HP_Archivist>Hm.Throw it into AB now?
05:26:13<pokechu22>It seems like the incapsula JS challenge would need to be solved, and I don't know how long that lasts
05:26:48<Webuser025436>Hi. Ming Pao Canada, a hk-based newspaper that has a Canada edition and daily newspaper, announced they are shutting down as of today. [1] Notably, they have an archive of all articles Ming Pao Hong Kong (the HK edition). All of Ming Pao HK's articles pre-2021 have been removed a long time ago from the internet because of the media situation in HK.
05:26:49<Webuser025436>Is it possible to archive? https://www.mingpaocanada.com/
05:26:49<Webuser025436>[1]: https://www.cp24.com/news/canada/2026/01/13/ex-journalists-lament-closure-of-ming-pao-canadas-last-chinese-language-daily-paper/
05:28:13<h2ibot>Pokechu22 edited Deathwatch (+237, /* 2026-02 */ https://historyhub.history.gov/): https://wiki.archiveteam.org/?diff=60210&oldid=60209
05:28:30<pokechu22>Webuser025436: I believe we've already started an archivebot job for that; I'm going to double-check the status of it
05:29:59<pokechu22>Webuser025436: It's currently running, together with http://mingshengbao.com/ - see http://archivebot.com/?initialFilter=mingpaocanada#log-container-6cnonr9bi6enxit9na57wckan
05:30:55<Webuser025436>Many thanks 🙏
05:31:15<pokechu22>Where is the archive of the pre-2021 Hong Kong articles? I can't read Chinese so I'd like to make sure we're saving that too
05:32:54<nicolas17>pokechu22: wonder if we should speed up that job
05:33:28<pokechu22>It closes on the end of January; today was date of the last edition being published
05:33:44<pokechu22>(at least according to https://www.cp24.com/news/canada/2026/01/13/ex-journalists-lament-closure-of-ming-pao-canadas-last-chinese-language-daily-paper/)
05:37:03<Webuser025436>pokechu22 All HK articles for a given day are shown here: https://www.mingpaocanada.com/tor/htm/News/YYYYMMDD/HK-GAindex_r.htm
05:37:03<Webuser025436>So for example: https://www.mingpaocanada.com/tor/htm/News/20140710/HK-GAindex_r.htm
05:37:03<Webuser025436>The earliest AFAICT is 20140710.
05:38:53<pokechu22>Is there a page that lists all of the previous ones? I assume there must be because archivebot has found https://www.mingpaocanada.com/tor/htm/News/20220429/tam1_r.htm but I don't know exactly where it came from
05:40:12<pokechu22>(there's https://www.mingpaocanada.com/tor/htm/Responsive/archiveList.cfm but that seems to only directly show the last week)
05:40:19<nicolas17>pokechu22: also I'm seeing many requests like https://www.mingpaocanada.com/tor/htm/News/20220815/TD/TD/tdc1.txt that redirect to an error page, might be a crawling glitch finding garbage in JS or something?
05:41:58<nicolas17>ah yes, docPath: "HK-GA/gc/gcc1.txt"
05:41:59<pokechu22>Yeah, looks like that comes from https://www.mingpaocanada.com/tor/htm/News/20220815/HK-gaa1_r.htm containing a POST to /Tor/cfc/popular_addone.cfc with HK-GA/ga/gaa1.txt as a parameter
05:42:44<nicolas17>not sure how to avoid this, excluding *.txt feels too broad
05:43:39<pokechu22>It's probably fine to just leave them as-is since there's 1 per article and most articles have several images as well
05:44:25<nicolas17>well it's also following the redirect and saving errorpage.html every single time
05:45:21<pokechu22>Looks like that's not a new issue: https://web.archive.org/web/20260901000000*/https://www.mingpaocanada.com/errorpage.html :)
05:45:36<pokechu22>... ok, though 10660 snapshots on January 16 is probably still excessive
05:46:26<nicolas17>pain
05:50:09<pokechu22>I guess I can check what dates it's already found using ab2f
05:55:16<nicolas17>I was thinking something like tor/htm/News/[0-9]{8}/[A-Z]+/[A-Z]+/[a-z]+[0-9]\.txt
05:55:27<nicolas17>but that's not exhaustive, will need a few more patterns
05:56:12sec^nd quits [Remote host closed the connection]
05:56:38sec^nd (second) joins
05:59:31evergreen56 joins
06:02:06evergreen5 quits [Ping timeout: 256 seconds]
06:02:06evergreen56 is now known as evergreen5
06:04:34<pokechu22>The oldest archivebot has found so far is http://www.mingpaocanada.com/TOR/htm/News/20220319/HK-GAindex_r.htm
06:05:09<pokechu22>JAA: can you trace http://www.mingpaocanada.com/TOR/htm/News/20220319/HK-GAindex_r.htm on 6cnonr9bi6enxit9na57wckan please?
06:06:09<Webuser025436>is the link i provided not good enough above? this link contains outlinks to all hk articles for a given day: https://www.mingpaocanada.com/tor/htm/News/20140710/HK-GAindex_r.htm
06:07:00<pokechu22>It is, but now I'm trying to figure out if archivebot will have already found those or if I need to start the job in a way that will discover those
06:11:25<pokechu22>(there isn't any good way to add urls to an existing archivebot job, but I could start a new one with a list of those pages for all days since 2014 or similar, along with https://www.mingpaocanada.com/van/htm/News/20260116/VAindex_r.htm and https://www.mingpaocanada.com/tor/htm/News/20260116/TAindex_r.htm)
06:14:47chunkynutz60 joins
06:17:24nexussfan quits [Quit: Konversation terminated!]
07:32:26chrismeller3 (chrismeller) joins
09:00:02midou quits [Ping timeout: 256 seconds]
09:20:45midou joins
09:27:48midou quits [Ping timeout: 256 seconds]
09:41:56midou joins
09:46:45midou quits [Ping timeout: 272 seconds]
09:52:45midou joins
10:23:29midou quits [Ping timeout: 272 seconds]
10:43:05<h2ibot>KleaBot made 2 bot changes: https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters[]=nsInvert&wpfilters[]=associated
10:43:34<klea>mhmm
10:44:14<klea>oh i love that my terminal thinks the url is shorter.
10:44:37<klea>so my browser opened <https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters> instead
10:45:30Dada joins
11:14:24midou joins
11:20:00<alexlehm>the same happens in my irc client, it does not consider [] as a valid url character
11:22:15<klea>time to urlencode it, or wrap it in <>
11:22:25<klea>alexlehm: how does <https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters[]=nsInvert&wpfilters[]=associated> open?
11:23:28<alexlehm>"https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters"
11:24:14<klea>alexlehm: and KleaBot made 2 bot changes: https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters%5B%5D=nsInvert&wpfilters%5B%5D=associated
11:24:42<alexlehm>it would probably work with https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters%5B%5D=nsInvert&wpfilters%5B%5D=associated
11:30:46ArchivalEfforts quits [Ping timeout: 256 seconds]
11:30:58ArchivalEfforts joins
11:31:51midou quits [Read error: Connection reset by peer]
11:34:08HP_Archivist quits [Quit: Leaving]
11:35:05<Juest>hexchat processes the url fine
11:38:51Doomaholic quits [Ping timeout: 272 seconds]
11:39:07<alexlehm>: could also be a stop character
11:48:57Hackerpcs_1 (Hackerpcs) joins
11:49:28Hackerpcs quits [Ping timeout: 256 seconds]
12:00:02Bleo182600722719623455222 quits [Quit: The Lounge - https://thelounge.chat]
12:02:48Bleo182600722719623455222 joins
12:11:18midou joins
12:13:41szczot3k quits [Ping timeout: 272 seconds]
12:13:43szczot3k_ (szczot3k) joins
12:14:23szczot3k_ is now known as szczot3k
12:18:32szczot3k quits [Remote host closed the connection]
12:18:58szczot3k (szczot3k) joins
12:21:28Doomaholic (Doomaholic) joins
12:30:51HP_Archivist (HP_Archivist) joins
12:31:55midou quits [Read error: Connection reset by peer]
12:38:49szczot3k quits [Remote host closed the connection]
12:41:18szczot3k (szczot3k) joins
12:45:07szczot3k quits [Remote host closed the connection]
12:46:57Webuser989898 joins
12:47:20Webuser989898 quits [Client Quit]
12:47:35szczot3k (szczot3k) joins
12:49:37szczot3k quits [Remote host closed the connection]
12:52:02szczot3k (szczot3k) joins
12:52:51midou joins
13:03:35cyan_box quits [Read error: Connection reset by peer]
13:04:21midou quits [Ping timeout: 272 seconds]
13:11:17szczot3k_ (szczot3k) joins
13:12:35szczot3k quits [Ping timeout: 272 seconds]
13:12:35szczot3k_ is now known as szczot3k
13:14:24midou joins
13:15:00Marie0 joins
13:17:00szczot3k quits [Remote host closed the connection]
13:19:26szczot3k (szczot3k) joins
13:21:38<Marie0>Sorry for getting to this so late, but I think we should archive some Honduran government websites before the inauguration on the 27th. The current president is a leftist and the new one is a Trump ally promising all kinds of austerity measures, so I expect the web presence of the government will completely change fairly quickly
13:21:45midou quits [Read error: Connection reset by peer]
13:21:50nine quits [Quit: See ya!]
13:22:03nine joins
13:22:03nine quits [Changing host]
13:22:03nine (nine) joins
13:23:38<Marie0>On the bright side, Honduras is a small country and their internet is actually pretty modern. I think if we start immediately, we could easily do this
13:24:54etnguyen03 (etnguyen03) joins
13:30:18szczot3k quits [Remote host closed the connection]
13:32:44szczot3k (szczot3k) joins
13:34:51<Marie0>I'm new here btw. I have some experience doing small scrapes with cURL but have never collaborated on one. I was initially going to do that for the Honduras thing, but I was in over my head. Huge fan of Archiveteam and your work!
13:35:17midou joins
14:00:04<justauser>Marie0: Do you have a list?
14:13:58midou quits [Ping timeout: 256 seconds]
14:23:13midou joins
14:51:24etnguyen03 quits [Client Quit]
14:51:32the joins
15:05:09nexussfan (nexussfan) joins
15:05:13etnguyen03 (etnguyen03) joins
15:11:54the quits [Client Quit]
15:21:08<Marie0>justauser: Sort of. There's a general directory of all government agency that I extracted the links from, but it's just the front page of each agency. A lot of them run other websites for specific offices/services. These are usually conspicuously linked somewhere near the front page of the agency, but I haven't actually compiled them into a list
15:21:10nexussfan quits [Client Quit]
15:22:31nexussfan (nexussfan) joins
15:27:17<klea>Marie0: are they subdomains or are they subpages of the domain?
15:27:29IDK quits [Quit: Updating details, brb]
15:28:59IDK (IDK) joins
15:29:49<h2ibot>Klea edited Dealing with Cloudflare (+28, /* Scenario 2 - TLS fingerprinting */…): https://wiki.archiveteam.org/?diff=60214&oldid=58744
15:31:02midou quits [Ping timeout: 256 seconds]
15:37:39DogsRNice joins
15:38:16szczot3k quits [Remote host closed the connection]
15:38:35<Marie0>klea: In some cases neither. For example, presidencia.gob.hn is obviously right there, but it has a section "periódico impreso" that links to poderpopular.hn, which isn't on the list but is still run by the government, and since it's straight up party propoganda it'll almost certainly be taken down
15:39:30<klea>aaaaa, that'd require getting all sublinks and archiving more content
15:40:25<klea>i guess it's possible by dumping the dbs after having made a AB job with all initial domains, and getting more domains, is it so justauser, or is that a bad idea?
15:40:42szczot3k (szczot3k) joins
15:40:59<Juest>try grabbing the sitemaps? \
15:41:06ATinySpaceMarine quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
15:41:27<Marie0>There's not that many root links to begin with so it's not THAT bad. I think I could do it by hand in an afternoon if needed.
15:42:09<Marie0>Here's the list btw: https://transfer.archivete.am/xbYNo/websites.txt
15:42:10<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/xbYNo/websites.txt
15:42:38<Marie0>Ah, thank you
15:43:04ATinySpaceMarine joins
15:44:58<justauser>Started jobs for the two you mentioned first for now.
15:45:40<justauser>Some pages return normal contents but with HTTP 500.
15:45:52<klea>!kfind protocol incompliant
15:45:53<eggdrop>[karma] 'protocol incompliant' not found.
15:45:59<klea>shitty :(
15:46:12<justauser>Example https://www.poderpopular.hn/vernoticias.php?id_noticia=25018
15:46:49<klea>> Expired website :( < https://www.partidoliberal.hn/
15:48:03<Marie0>justauser: idk but it's working fine in my browser
15:48:17<klea>> Domain for sale: https://partidonacional.hn/
15:48:17<klea>Unable to resolve: https://mamsurpazhn.com/
15:48:18<klea>500 https://fonac.hn/
15:48:25<justauser>Exactly. It works fine while claiming an error on lower level.
15:48:39<justauser>It confuses our machinery.
15:48:49<klea>403 https://portalunico.iaip.gob.hn/
15:48:59<klea>Loading page even with js on? https://www.dpr.gob.hn/
15:49:05<klea>maybe we should make a channel for this?
15:49:14<justauser>That's #vooterbooter
15:49:16<klea>oh
15:49:18<klea>sorry
15:49:55<justauser>Tries to load script from some CDN and fails.
15:50:42<klea>i've moved my part of the discussion there.
15:59:38alexlehm quits [Remote host closed the connection]
15:59:41alexlehm (alexlehm) joins
16:01:08midou joins
16:07:08<nulldata>https://learn.microsoft.com/en-us/troubleshoot/mem/configmgr/mdt/mdt-retirement
16:07:32<nulldata>"MDT download packages might be removed or deprecated from official distribution channels."
16:14:03<nulldata>Here's an old version https://www.microsoft.com/en-us/download/details.aspx?id=57917
16:14:16<nulldata>Newest one seems to be pulled already
16:17:32midou quits [Read error: Connection reset by peer]
16:20:20BluRaf quits [Quit: WeeChat 3.8]
16:20:25BluRaf (BluRaf) joins
16:22:32Dada quits [Remote host closed the connection]
16:22:45Dada joins
16:26:38midou joins
16:39:03midou quits [Ping timeout: 272 seconds]
16:48:02midou joins
16:58:26deafmute joins
17:03:35<deafmute>Hello everyone. Are there any plans to archive cosplay.com galleries? The site has been undead and dysfunctional for a long time
17:03:51Marie0 quits [Quit: Ooops, wrong browser tab.]
17:08:03<h2ibot>Klea created ArchiveBot/2025 Honduran General Election/list (+2609, Created page with "https://congresonacional.hn/…): https://wiki.archiveteam.org/?title=ArchiveBot/2025%20Honduran%20General%20Election/list
17:39:40<that_lurker>https://console.cloud.google.com/storage/browser/net-ntlmv1-tables/tables;tab=objects?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false
17:40:30<that_lurker>"Google just dropped 1.1 QUADRILLION pre-computed passwords (aka rainbowtable) for NetNTLMv1."
17:40:41<that_lurker>so about 8,6TB
17:40:46<that_lurker>ref. https://www.linkedin.com/posts/benjamin-iheukumere_google-just-dropped-11-quadrillion-pre-computed-activity-7418215510380802048-9NIK/
17:41:23oxtyped quits [Read error: Connection reset by peer]
17:41:35oxtyped joins
17:42:01<Hans5958>That LinkedIn post is so engagement bait-y
17:42:02<Hans5958>https://cloud.google.com/blog/topics/threat-intelligence/net-ntlmv1-deprecation-rainbow-tables
17:42:32<Hans5958>https://x.com/Mandiant/status/2012268623662874906
17:42:32<eggdrop>nitter: https://nitter.net/Mandiant/status/2012268623662874906
17:43:04<that_lurker>ohh nice did not know there was an article on that
17:43:28<that_lurker>I don't have enough space to download all of that and then push it to IA, but someone here might :-P
18:17:35<katia>👀
18:18:32Ajay quits [Quit: Bridge terminating on SIGTERM]
18:18:32@Sanqui|m quits [Quit: Bridge terminating on SIGTERM]
18:18:32britmob|m quits [Quit: Bridge terminating on SIGTERM]
18:18:33anon00001|m quits [Quit: Bridge terminating on SIGTERM]
18:18:33xxia|m quits [Quit: Bridge terminating on SIGTERM]
18:18:33mind_combatant quits [Quit: Bridge terminating on SIGTERM]
18:18:33x9fff00 quits [Quit: Bridge terminating on SIGTERM]
18:18:33DigitalDragon quits [Quit: Bridge terminating on SIGTERM]
18:18:33gamer191-1|m quits [Quit: Bridge terminating on SIGTERM]
18:18:33theblazehen|m quits [Quit: Bridge terminating on SIGTERM]
18:18:33igneousx quits [Quit: Bridge terminating on SIGTERM]
18:18:34audrooku|m quits [Quit: Bridge terminating on SIGTERM]
18:18:34alexshpilkin quits [Quit: Bridge terminating on SIGTERM]
18:18:34flashfire42|m quits [Quit: Bridge terminating on SIGTERM]
18:18:34Minkafighter|m quits [Quit: Bridge terminating on SIGTERM]
18:18:34tomodachi94 quits [Quit: Bridge terminating on SIGTERM]
18:18:34Vokun quits [Quit: Bridge terminating on SIGTERM]
18:18:35Tom|m1 quits [Quit: Bridge terminating on SIGTERM]
18:18:35Hans5958 quits [Quit: Bridge terminating on SIGTERM]
18:18:35mpeter|m quits [Quit: Bridge terminating on SIGTERM]
18:18:35rewby|m quits [Quit: Bridge terminating on SIGTERM]
18:18:35MaxG quits [Quit: Bridge terminating on SIGTERM]
18:18:35Exorcism quits [Quit: Bridge terminating on SIGTERM]
18:18:35nyuuzyou quits [Quit: Bridge terminating on SIGTERM]
18:18:35masterx244|m quits [Quit: Bridge terminating on SIGTERM]
18:18:35nstrom|m quits [Quit: Bridge terminating on SIGTERM]
18:18:36ram|m quits [Quit: Bridge terminating on SIGTERM]
18:18:36cruller quits [Quit: Bridge terminating on SIGTERM]
18:18:36aaq|m quits [Quit: Bridge terminating on SIGTERM]
18:18:36MinePlayersPEMyNey|m quits [Quit: Bridge terminating on SIGTERM]
18:18:36Thibaultmol quits [Quit: Bridge terminating on SIGTERM]
18:18:37Fletcher quits [Quit: Bridge terminating on SIGTERM]
18:18:39akaibu|m quits [Quit: Bridge terminating on SIGTERM]
18:18:40tech234a quits [Quit: Bridge terminating on SIGTERM]
18:18:41Ruk8 quits [Quit: Bridge terminating on SIGTERM]
18:18:41hlgs|m quits [Quit: Bridge terminating on SIGTERM]
18:18:41EvanBoehs|m quits [Quit: Bridge terminating on SIGTERM]
18:18:42Alienmaster|m quits [Quit: Bridge terminating on SIGTERM]
18:18:42schwarzkatz|m quits [Quit: Bridge terminating on SIGTERM]
18:18:43moe-a-m|m quits [Quit: Bridge terminating on SIGTERM]
18:18:43vexr quits [Quit: Bridge terminating on SIGTERM]
18:18:43andrewvieyra|m quits [Quit: Bridge terminating on SIGTERM]
18:18:43yzqzss quits [Quit: Bridge terminating on SIGTERM]
18:18:44that_lurker|m quits [Quit: Bridge terminating on SIGTERM]
18:18:44GhostIsBeHere|m quits [Quit: Bridge terminating on SIGTERM]
18:18:44mikolaj|m quits [Quit: Bridge terminating on SIGTERM]
18:18:45upperbody321|m quits [Quit: Bridge terminating on SIGTERM]
18:19:30finalti|m quits [Quit: Bridge terminating on SIGTERM]
18:19:30tech234a|m-backup quits [Quit: Bridge terminating on SIGTERM]
18:19:30Roki_100|m quits [Quit: Bridge terminating on SIGTERM]
18:19:30e2mau|m quits [Quit: Bridge terminating on SIGTERM]
18:19:30qyxojzh|m quits [Quit: Bridge terminating on SIGTERM]
18:19:30victor_vaughn|m quits [Quit: Bridge terminating on SIGTERM]
18:19:30th3z0l4|m quits [Quit: Bridge terminating on SIGTERM]
18:19:31coro quits [Quit: Bridge terminating on SIGTERM]
18:19:31triplecamera|m quits [Quit: Bridge terminating on SIGTERM]
18:19:31thermospheric quits [Quit: Bridge terminating on SIGTERM]
18:19:31phaeton quits [Quit: Bridge terminating on SIGTERM]
18:19:31wrangle|m quits [Quit: Bridge terminating on SIGTERM]
18:19:31s-crypt|m|m quits [Quit: Bridge terminating on SIGTERM]
18:19:31hexagonwin|m quits [Quit: Bridge terminating on SIGTERM]
18:19:31froxcey|m quits [Quit: Bridge terminating on SIGTERM]
18:19:31Video quits [Quit: Bridge terminating on SIGTERM]
18:19:31iCesenberk|m quits [Quit: Bridge terminating on SIGTERM]
18:19:31madpro|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39GRBaset quits [Quit: Bridge terminating on SIGTERM]
18:19:39pannekoek11|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39superusercode quits [Quit: Bridge terminating on SIGTERM]
18:19:39octylFractal|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39lasdkfj|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39CrispyAlice2 quits [Quit: Bridge terminating on SIGTERM]
18:19:39jevinskie quits [Quit: Bridge terminating on SIGTERM]
18:19:39jackt1365|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39jwoglom|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39noobirc|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39cmostracker|m quits [Quit: Bridge terminating on SIGTERM]
18:19:39Cydog|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40ragu|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40nosamu|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40joepie91|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40spearcat|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40ax|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40katia|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40l0rd_enki|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40bogsen quits [Quit: Bridge terminating on SIGTERM]
18:19:40Adamvoltagex|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40osiride|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40trumad|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40vics quits [Quit: Bridge terminating on SIGTERM]
18:19:40NickS|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40v1cs quits [Quit: Bridge terminating on SIGTERM]
18:19:40its_notjack quits [Quit: Bridge terminating on SIGTERM]
18:19:40haha-whered-it-go|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40supermariofan67|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40Fijxu|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40nano412510 quits [Quit: Bridge terminating on SIGTERM]
18:19:40mister_x quits [Quit: Bridge terminating on SIGTERM]
18:19:40Misty|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40Valkum|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40kaz__|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40rain|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40mat|m1 quits [Quit: Bridge terminating on SIGTERM]
18:19:40miksters|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40PhoHale|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40will|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40yetanotherarchiver|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40Passiing|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40username675f|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40Tyrasuki|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40gareth48|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40Nulo|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40nightpool quits [Quit: Bridge terminating on SIGTERM]
18:19:40noxious quits [Quit: Bridge terminating on SIGTERM]
18:19:40ampdot|m quits [Quit: Bridge terminating on SIGTERM]
18:19:40Claire|m quits [Quit: Bridge terminating on SIGTERM]
18:19:41saouroun|m quits [Quit: Bridge terminating on SIGTERM]
18:19:41hillow596|m quits [Quit: Bridge terminating on SIGTERM]
18:19:41yarnover|m quits [Quit: Bridge terminating on SIGTERM]
18:19:41Cronfox|m quits [Quit: Bridge terminating on SIGTERM]
18:19:41Joy|m quits [Quit: Bridge terminating on SIGTERM]
18:19:41IceCodeNew|m quits [Quit: Bridge terminating on SIGTERM]
18:19:42M--mlv|m quits [Quit: Bridge terminating on SIGTERM]
18:19:42justauser|m quits [Quit: Bridge terminating on SIGTERM]
18:20:25<klea>that_lurker: just don't download it :p, slowly download chunks and do chunked uploads to IA :p
18:20:38rewby|m (rewby) joins
18:20:38@ChanServ sets mode: +o rewby|m
18:22:04<that_lurker>:-P
18:22:17<that_lurker>If someone does that please request a collection for it first :-)
18:22:28ram|m joins
18:22:28pannekoek11|m joins
18:22:28MinePlayersPEMyNey|m joins
18:22:28Ruk8 (Ruk8) joins
18:22:41<that_lurker>and maybe get arkiver or IA's ok to dumb that much data :-P
18:22:55Cronfox|m joins
18:22:55xxia|m joins
18:22:55andrewvieyra|m joins
18:22:55nstrom|m joins
18:23:37Dada quits [Remote host closed the connection]
18:23:49Dada joins
18:23:51saouroun|m joins
18:23:51haha-whered-it-go|m joins
18:23:52schwarzkatz|m joins
18:23:52jackt1365|m joins
18:23:52x9fff00 (x9fff00) joins
18:23:53madpro|m joins
18:23:53DigitalDragon joins
18:23:53ragu|m joins
18:23:53audrooku|m joins
18:23:53Ajay joins
18:23:53igneousx (igneousx) joins
18:23:53akaibu|m joins
18:23:53masterx244|m (masterx244|m) joins
18:24:01superusercode joins
18:24:01Cydog|m joins
18:24:01trumad|m joins
18:24:01cmostracker|m joins
18:24:01aaq|m joins
18:24:01mikolaj|m joins
18:24:01upperbody321|m joins
18:24:01nyuuzyou joins
18:24:01spearcat|m joins
18:24:01hexagonwin|m joins
18:24:01th3z0l4|m joins
18:24:01victor_vaughn|m joins
18:24:01ax|m joins
18:24:01e2mau|m joins
18:24:01moe-a-m|m joins
18:24:01Sanqui|m (Sanqui) joins
18:24:01joepie91|m joins
18:24:01tech234a|m-backup (tech234a) joins
18:24:01mpeter|m joins
18:24:01@ChanServ sets mode: +o Sanqui|m
18:24:01jwoglom|m joins
18:24:01thermospheric joins
18:24:01theblazehen|m joins
18:24:01britmob|m joins
18:24:01octylFractal|m joins
18:24:01GRBaset (GRBaset) joins
18:24:01Minkafighter|m joins
18:24:01NickS|m joins
18:24:01hlgs|m joins
18:24:01Roki_100|m joins
18:24:01lasdkfj|m joins
18:24:01finalti|m joins
18:24:01Thibaultmol joins
18:24:01Fletcher (Fletcher) joins
18:24:01MaxG joins
18:24:01yzqzss (yzqzss) joins
18:24:01CrispyAlice2 joins
18:24:01jevinskie joins
18:24:01wrangle|m joins
18:24:01vexr joins
18:24:01nosamu|m joins
18:24:01noobirc|m joins
18:24:01phaeton (phaeton) joins
18:24:01coro joins
18:24:01alexshpilkin joins
18:24:01v1cs joins
18:24:01flashfire42|m (flashfire42) joins
18:24:01EvanBoehs|m joins
18:24:01iCesenberk|m joins
18:24:01qyxojzh|m joins
18:24:01s-crypt|m|m joins
18:24:01l0rd_enki|m joins
18:24:01Tom|m1 joins
18:24:01gamer191-1|m joins
18:24:01that_lurker|m joins
18:24:01bogsen (bogsen) joins
18:24:02Vokun (Vokun) joins
18:24:02supermariofan67|m joins
18:24:02vics joins
18:24:02GhostIsBeHere|m joins
18:24:02Adamvoltagex|m joins
18:24:02anon00001|m joins
18:24:02cruller joins
18:24:02tomodachi94 (tomodachi94) joins
18:24:02triplecamera|m joins
18:24:02froxcey|m joins
18:24:02tech234a (tech234a) joins
18:24:02katia|m joins
18:24:02Fijxu|m joins
18:24:02Video joins
18:24:02its_notjack (its_notjack) joins
18:24:02osiride|m joins
18:24:02Hans5958 joins
18:24:02justauser|m joins
18:24:02mind_combatant (mind_combatant) joins
18:24:02Alienmaster|m joins
18:24:02Exorcism (exorcism) joins
18:24:18<@arkiver>!remindme 8h that_lurker password dump google
18:24:18<eggdrop>[remind] ok, i'll remind you at 2026-01-18T02:24:18Z
18:25:27username675f|m joins
18:26:17mister_x joins
18:26:18PhoHale|m joins
18:26:19yetanotherarchiver|m joins
18:26:19hillow596|m joins
18:26:19Joy|m joins
18:26:19nano412510 (nano412510) joins
18:26:19Tyrasuki|m joins
18:26:20yarnover|m joins
18:26:20mat|m1 joins
18:26:20gareth48|m joins
18:26:20kaz__|m joins
18:26:20Passiing|m joins
18:26:20noxious joins
18:26:22will|m joins
18:26:22Misty|m joins
18:26:22Nulo|m joins
18:26:22ampdot|m joins
18:26:22Valkum|m joins
18:26:22Claire|m joins
18:26:22IceCodeNew|m joins
18:26:22rain|m joins
18:26:22nightpool (nightpool) joins
18:26:23miksters|m joins
18:30:11<klea>that_lurker: nah, it seems IA has tooling to move things from search results to collections (or given item ids put them in collections), i suppose that's why archiveteam_inbox exists.
18:30:43<klea>oh wait, will arkiver store the dump on ia themselves?
18:31:31<that_lurker>It's easier to just push to a collection directly than ask to move items later also can let them know how much data there will be :-P
18:31:38<klea>oh
18:31:52<that_lurker>them = a_rkive :-P
18:32:14<klea>smh https://console.cloud.google.com/storage/browser/net-ntlmv1-tables requires login :(
18:34:19Gadelhas562873784438 joins
18:34:45<klea>it seems to be split into 2.1G chunks: https://console.cloud.google.com/storage/browser/net-ntlmv1-tables/tables?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))
18:35:37M--mlv|m joins
18:36:18<klea>that_lurker: do you know the total size?
18:36:39<that_lurker>8,6TB was mentioned in a few places
18:37:03<klea>ouch
18:39:28<klea>i wonder if IA could lend me access to FOS. :p http://fos.textfiles.com/pipeline.html#:~:text=%2Fdev%2Fmd1%2013T%202%2E2T%2011T%2018%25%20%2F2
18:39:38<Juest>is there a direct contact with internet archive around here by chance?
18:39:47<klea>Juest: wdym by DC?
18:39:48<klea>oh also
18:40:03<justauser>#internetarchive has an employee.
18:40:19<justauser>But that's only for small bug reports and status updates.
18:40:36<klea>yeah
18:40:52<Juest>archive team is independent and not affiliated with ia but working very closely with despite not seeming like much?
18:41:03<Juest>excuse my weirdness
18:41:41<klea>kind of
18:42:07<justauser>AT founder is an IA employee.
18:42:24<klea>yes
18:42:29<justauser>Current team lead has some extended access.
18:42:31<Juest>also i very much prefer chat, so apologies for not researching much with the wiki
18:42:54<klea>btw, afaik AT is more IRC first, doc later kind of from what i've seen.
18:48:10etnguyen03 quits [Quit: Konversation terminated!]
18:55:19<@JAA>pokechu22: https://transfer.archivete.am/inline/WB59p/6cnonr9bi6enxit9na57wckan-trace.txt
18:56:38colla quits [Quit: StoCa!]
18:58:27colla joins
18:59:35<klea>wait
18:59:38<klea>google's bucket is s3
18:59:53<klea>compatible.
19:00:09<klea>so, what happens if we ask someone in #archivebot to queue it
19:06:15<klea>fun idea
19:06:25<klea>project specifically for archival of s3-like buckets.
19:08:24<nicolas17>what's the size limit of an IA item?
19:10:03<justauser>1TB unless they changed it and forgot to tell.
19:10:07<klea>nicolas17: https://irclogs.archivete.am/internetarchive/2026-01-07#l43630241
19:10:30<nicolas17>klea: google cloud console needs login
19:10:34<nicolas17>but the bucket is open https://storage.googleapis.com/net-ntlmv1-tables
19:10:39<klea>yeah, i noticed later.
19:10:58<nicolas17>having a # in the filename is insane
19:11:42<klea>that's the %23 right?
19:11:59<nicolas17>yes
19:12:16<klea>also, im stupid
19:12:23<klea>s3 exposes two paths per item.
19:12:46etnguyen03 (etnguyen03) joins
19:12:52<klea>so how do we get WBM to load the same warc files later for net-ntlmv1-tables.storage.googleapis.com/?
19:13:19<nicolas17>oh ew
19:14:18<justauser>No way? No way!
19:14:25<klea>?
19:15:09<nicolas17>using other tools (like wget-lua) you can produce deduplicated WARCs, but the whole point of using archivebot would have been to avoid downloading+uploading yourself
19:15:56<klea>nicolas17: i believe we can later churn out some warcs that just contain deduplication entries?
19:17:16<justauser>Producing WARCs based on unproven guesses is discouraged.
19:17:23<klea>yeah :p
19:17:38<@JAA>s/discouraged/illegal/
19:17:48<klea>justauser: im thinking of a separate idea
19:17:54<klea>someone uploads all those warcs to ia
19:18:06<klea>wait, AB doesn't yet support deduplication between jobs right?
19:18:16<@JAA>AB doesn't dedupe at all.
19:18:20<klea>oh
19:18:21<klea>lovely :(
19:19:11<nicolas17>if AB supported deduplication (even if only within a job), we could just give it https://storage.googleapis.com/net-ntlmv1-tables/whatever and https://net-ntlmv1-tables.storage.googleapis.com/whatever in the same job
19:19:19<klea>i wonder how much time it'd take to make tooling to archive s3-like buckets, including but not limited to storage.googleapis.com, s3.amazonaws.com, s3.archive.org, etc.
19:19:45<justauser>little-things has a lister, and it's usually sufficient.
19:19:49<klea>oh wait, archiving IA inside of IA is not a good idea, and probably illegal.
19:19:50colla quits [Client Quit]
19:19:53<@JAA>I have tooling for the listing, and the rest is just URLs.
19:20:07<klea>nono, i mean having a DPoS project for those :p
19:20:23<@JAA>Mhm
19:20:24<klea>that would deduplicate given some config that'd make it get two files, and dedupe
19:20:43<nicolas17>I'd rather have a more generic DPoS job for dedup purposes
19:20:52klea nods
19:21:11<klea>i have a silly idea
19:21:18<klea>but it's probably still illegal.
19:21:54<klea>things provide etag headers
19:21:56colla joins
19:22:33<klea>so if you do a request to check the etag, and it tells you it's not been modified since, can you assume it's the same file (supposing it's serviced by the same provider)?
19:23:48<katia>No.
19:24:03<klea>why?
19:24:37<nicolas17>some cloud storage systems provide hashes in the listing, or sometimes we can just "know" (or guess) in some other way that multiple URLs will produce the same file
19:24:41<katia>Assuming makes an ass out of u and me
19:24:49<klea>i guess because I didn't ask the server for the original files, so i can't verify
19:25:10<katia>WARCs are raw data that goes over the wire. Not assumptions based on a header
19:25:47<klea>wait, another silly and stupid question, if i were to actually do real requests for the files, get the other warc, and notice they are duplicates (for whichever files are duplicates, hopefully all), could i then write warc duplication records for that other warc that's in another archive.org item?
19:26:08<nicolas17>what we need is a DPoS project where the worker can be told to download https://net-ntlmv1-tables.storage.googleapis.com/LICENSE and https://storage.googleapis.com/net-ntlmv1-tables/LICENSE as part of the same item, and dedup if the data is the same (it already does that dedup)
19:26:31klea agrees on what nicolas17 just said.
19:26:51<nicolas17>we can't shove a dozen "these are probably the same file" URLs into the item name though, so it could be trickier
19:27:08<klea>silly idea
19:27:22<klea>i'm sure the 3 letter archivist here that has two duplicate letters will love this
19:27:33<nicolas17>he's already raising an eyebrow at you
19:27:34<klea>transfer.archivete.am link with "probably same file" urls
19:27:53<klea>then queue that to the tracker
19:28:20<klea>worker gets that item, downloads it (with the added side effect of archival since we're running wget-at i believe on them), then tries to get every single url in that file, and then deduplicates.
19:29:11<klea>if you're silly and stupid like me you make batches instead of uploading only one, which means the worker will effectively write more than one response since i am silly and give each worker more than one task at a time :p
19:30:27<klea>> we can't shove a dozen "these are probably the same file" URLs into the item name though, so it could be trickier <- wait, what prevents us from shoving those urls into the item name, is there some limit length on the tracker?
19:31:27<nicolas17>uhh what did rclone or azure or whatever do here?
19:31:31<nicolas17>{"Path":"NU5T-14J002-JAD_1667587484000.VBF","Name":"NU5T-14J002-JAD_1667587484000.VBF","Size":270478638,"MimeType":"application/octet-stream","ModTime":"2022-12-08T20:05:42.000000000Z","IsDir":false,"Hashes":{"md5":"0000bd0fadf8df707aeb6d790020fdd79ec5d75e76ec5d7b"}}
19:31:35<nicolas17>that's not an md5 T_T
19:34:04<nicolas17>but yeah
19:34:08<nicolas17>I have a few extreme cases of duplication
19:34:13<nicolas17>like this https://transfer.archivete.am/inline/1t46z/f7cf43ec508103ad7d1350450c1e36004f42085e80177135.txt
19:34:51<klea>so i guess nicolas17 agress on my idea of shoving urls into very small files on transfer, and then shoving that down tracker, and then shoving that to a custom DPoS
19:35:39deafmute quits [Client Quit]
19:36:50<@JAA>The ETag thing doesn't work because it's specified to be an opaque identifier for the same representation of the same resource only.
19:37:39<@JAA>So you can do rearchivals that way, but you can't do revisit records for a different URL.
19:37:46<klea>oh.
19:37:55<nicolas17>we can't use it for deduplication but we can use it to make the URL lists, if we're wrong and think two files are the same when we're not, then wget-at will simply not dedup them
19:37:59<klea>> So you can do rearchivals that way, but you can't do revisit records for a different URL. <- that also i suppose relies on the remote system not changing ownership.
19:38:06Dango3607 (Dango360) joins
19:38:15<klea>wait, wget-at does dedupe, or is that in need of implementation?
19:38:48<nicolas17>wget-at does dedupe (if the requests are done as part of the same DPoS item)
19:39:06<klea>and wpull doesn't?
19:39:09<@JAA>There's a way to load CDXs as well, but I'm not sure how well-tested that is.
19:41:27Dango360 quits [Ping timeout: 272 seconds]
19:41:27Dango3607 is now known as Dango360
19:41:28<@JAA>nicolas17: To be clear, technically, it would work, and it might even be fine in this particular case. But it definitely doesn't generalise.
19:41:43<@JAA>From a pure HTTP client spec perspective, it's still a violation.
19:42:03<katia>This incident will be reported.
19:42:11<@JAA>oh no
19:42:22<klea>oh no
19:42:37<nicolas17>https://xkcd.com/838/
19:42:49<katia>Hehe
19:42:55<@JAA>wget-at doesn't support that dedupe profile, by the way.
19:43:14<klea>how is data deduped by AT then?
19:43:19<@JAA>It'd be this: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#profile-server-not-modified
19:43:23<klea>or does AT dump data without even dedup checking it?
19:43:34<@JAA>By payload digest, usually only within a single process.
19:44:23<nicolas17>wget-at downloads the whole thing and checks if the sha1 of the response matches a previously-downloaded response
19:44:54<@JAA>Yep
19:45:04<@JAA>And that's what I've done with S3 stuff before as well, just with qwarc.
19:45:51<klea>oh, so if i want s3 repos archived, i should contact JAA with a big list of urls that has both formats, and then JAA will run qwarc and upload to IA for me?
19:46:16<klea>should i also give the archive-s3-repo-with-dedup-on-both-urls-service-that-JAA-provides my S3 creds too?
19:46:26<klea>i suppose not since then it doesn't get in the web category.
19:58:16asie (asie) joins
20:02:51FiTheArchiver joins
20:03:47PaiMei quits [Quit: Wololo]
20:04:06PaiMei (PaiMei) joins
20:05:31n9nes quits [Ping timeout: 272 seconds]
20:08:05n9nes joins
20:11:33FiTheArchiver quits [Remote host closed the connection]
20:11:54FiTheArchiver joins
20:14:26deafmute joins
20:14:54Dada quits [Remote host closed the connection]
20:15:08Dada joins
20:16:35<h2ibot>Haiseiko edited URLTeam/Warrior (+183, /* Warrior projects */): https://wiki.archiveteam.org/?diff=60216&oldid=53418
20:16:36<h2ibot>Brad edited Deathwatch (+692, Added Meta Horizon managed services and added…): https://wiki.archiveteam.org/?diff=60217&oldid=60210
20:16:37<h2ibot>Vi edited Deathwatch (+285, /* 2026-03 */ added The Anime Network): https://wiki.archiveteam.org/?diff=60218&oldid=60217
20:16:38<h2ibot>John123521 edited Deathwatch (+179, move Tom Lehrer and TUYU to Frozen Solid): https://wiki.archiveteam.org/?diff=60219&oldid=60218
20:19:35<h2ibot>JustAnotherArchivist edited URLTeam/Warrior (-18, Fix shor.kr addition): https://wiki.archiveteam.org/?diff=60220&oldid=60216
20:22:36<h2ibot>JustAnotherArchivist edited URLTeam/Warrior (-165, Reverted; there has been and is no shor-kr…): https://wiki.archiveteam.org/?diff=60221&oldid=60220
20:24:36<h2ibot>JustAnotherArchivist edited Deathwatch (+160, Restore Eshizuoka entry removed by…): https://wiki.archiveteam.org/?diff=60222&oldid=60219
20:25:34FiTheArchiver quits [Remote host closed the connection]
20:25:36<h2ibot>JustAnotherArchivist edited Deathwatch (-135, Remove empty year sections for 2030s; we can…): https://wiki.archiveteam.org/?diff=60223&oldid=60222
20:25:55FiTheArchiver joins
20:27:45Aurora quits [Quit: Ooops, wrong browser tab.]
20:30:50<klea>i think i'll make a subforum or subservice for a small thing that will close in 2500, that way the wiki needs more sections :p
20:44:28HP_Archivist quits [Quit: Leaving]
20:44:44HP_Archivist joins
20:56:33FiTheArchiver quits [Remote host closed the connection]
20:57:47FiTheArchiver joins
21:25:02nine quits [Quit: See ya!]
21:25:19nine joins
21:31:33FiTheArchiver quits [Remote host closed the connection]
21:31:56FiTheArchiver joins
21:50:58Webuser025436 quits [Quit: Ooops, wrong browser tab.]
22:02:33FiTheArchiver quits [Remote host closed the connection]
22:03:09FiTheArchiver joins
22:03:10ScarlettStunningSpace quits [Ping timeout: 256 seconds]
22:05:32Karlett joins
22:11:51<h2ibot>Klea edited Phorge/uncategorized (+66, Added git.kolab.org): https://wiki.archiveteam.org/?diff=60224&oldid=60124
22:12:51<h2ibot>Klea edited Phorge/uncategorized (+135, Add forge.softwareheritage.org): https://wiki.archiveteam.org/?diff=60225&oldid=60224
22:34:11FiTheArchiver quits [Read error: Connection reset by peer]
22:37:00<pokechu22>deafmute: I'm not aware of an existing project for https://cosplay.com/ - do you have an estimate for how big it is?
22:37:34<pokechu22>It looks like it's not too scripty so archivebot probably could be used for it
22:42:55<h2ibot>Nintendofan885 edited Namuwiki (+40, fix infobox name): https://wiki.archiveteam.org/?diff=60226&oldid=60200
22:51:09<deafmute>pokechi22: difficult to estimate. judging from official numbers maybe 100-150k photos, which would be 100GB-1TB I guess. But the site is broken and it doesn't show nearly the same amount of images it claims to have, so probably less than that.
22:51:48<deafmute>and I haven't found a way to see the full res photos, just thumbnails
22:53:51deafmute quits [Client Quit]
22:54:07deafmute joins
22:54:17<klea>deafmute: it seems https://s3.amazonaws.com/cosplay-cdn/large/3d8bc518-ade8-4cad-9729-032b37331052.jpg they're on s3 but that bucket isn't open
22:57:51Webuser239629 joins
22:58:00Webuser239629 quits [Client Quit]
23:00:01SootBector quits [Remote host closed the connection]
23:01:09SootBector (SootBector) joins
23:06:18<pokechu22>hmm, yeah. https://cosplay.com/series/reborn says https://cosplay.com/character/gokudera-hayato has 233 costumes 1023 photos, but only lists 12 photos on https://cosplay.com/character/gokudera-hayato
23:06:48<pokechu22>(and guessing https://cosplay.com/character/gokudera-hayato?page=2 doesn't do anything either)
23:08:19<pokechu22>... and https://cosplay.com/?page=3 and beyond requires logging in, so it can't be discovered that way either (/series doesn't have that limitation though)
23:08:29<klea>pokechu22: they're categories, so they have photos inside?
23:09:14<pokechu22>I mispoke - https://cosplay.com/character/gokudera-hayato only lists 12 costumes and like a hundred photos
23:09:33<klea>oh
23:10:17<nicolas17>https://cosplay.com/s/092dder49 -> https://s3.amazonaws.com/photo.cosplay.com/143783/1800163.jpg
23:10:29<nicolas17>however the .jpg only loads with Referer: https://cosplaxy.com/
23:10:59<deafmute>I tried logging in with a bugmenot account, it doesn't seem to change anything regarding how many images are shown
23:11:12<pokechu22>Huh, I didn't know s3 could do referer checks
23:11:36<klea>that confused me
23:11:39<pokechu22>but archivebot generates referers properly so that probably won't be an issue
23:11:47<nicolas17>bucket policy can do many things
23:12:01<pokechu22>Does the bugmenot account make e.g. https://cosplay.com/?page=1000 work?
23:12:08<klea>there's also a forum
23:12:13nicolas17 wonders if you can make a bucket policy that only allows access for 5 minutes of every hour
23:12:35<klea>pokechu22: logging in changes the ui
23:12:52<pokechu22>hmm
23:13:24<pokechu22>Oh, https://cosplay.com/member/206560 and https://cosplay.com/member/206559 both exist; we could maybe enumerate those
23:13:26<deafmute>pokechu22: no, still the same page for me
23:13:40<klea>for me it seems to be a doomscrollable interface.
23:14:03<deafmute>nicholas17: how did you find that /s/09dder49 page?
23:14:05<klea>which pokes <https://cosplay.com/livewire/message/status-list> with a servermemo
23:14:26<klea>deafmute: clicking on profile for user
23:14:30<pokechu22>Those /s/ ones are just linked on the home page too
23:14:41<klea>also there seems to be a forum.
23:15:30<deafmute>oh okay. I thought you found a way to get those pages for old posts like that sub-zero in my example
23:15:33<pokechu22>Huh, and https://cosplay.com/member/26559 also works, so they really do have 200k users maybe?
23:15:50<pokechu22>that links old posts like https://cosplay.com/s/ynz11e8m9
23:16:32<pokechu22>https://cosplay.com/member/1 -> https://cosplay.com/s/b934yzmvn apparently 3 years ago
23:17:05Dada quits [Remote host closed the connection]
23:17:16<klea>https://cosplay.com/livewire/livewire.js
23:17:17Dada joins
23:18:11<pokechu22>I think bruteforcing the user list would be the easiest way to discover all of the images since it paginates all the way to https://cosplay.com/member/1?page=604 (and also *stops* paginating there, unlike some other more annoying sites...)
23:18:44<klea>pokechu22: we'd need to get to crawl depth two which i think AB does right?
23:19:01<pokechu22>What do you mean?
23:19:19<klea>it has to go to https://cosplay.com/s/yxny028e9 to be able to get the full picture too, not just the thumbnail.
23:19:34<pokechu22>I'm specifically thinking of a recursive job as an !a < list job with a custom sitemap (so it behaves like !a https://cosplay.com/ but also gets seeded with all of the member URLs)
23:19:46<pokechu22>that should go to those /s/ URLs and everything else on the site that's linked normally
23:20:14<klea>they seem to have up to user https://cosplay.com/member/402093
23:21:12nine quits [Client Quit]
23:21:25nine joins
23:23:35<pokechu22>huh, looking at https://cosplay.com/series?page=50 the first row of the table claims 327 costumes, 33 photos, but https://cosplay.com/series/animamundi-dark says it's the other way around
23:25:05<pokechu22>it also says there are 6 characters and then lists 13 of them, with a sum of 21 costumes and 327 photos
23:25:30<deafmute>advanced cosplay maths
23:30:12<deafmute>Now I finally know how to navigate on this broken site and actually find what I was looking for, many thanks for that.
23:31:09<deafmute>can't really help with the technical stuff however, I was just curious
23:38:02<h2ibot>Nintendofan885 edited Qwarc (+4, link [[WARC]]): https://wiki.archiveteam.org/?diff=60227&oldid=54296
23:40:58deafmute quits [Client Quit]
23:42:19<klea>JAA: could you put every item that you made using qwarc have subject qwarc?
23:42:50deafmute joins
23:46:26<pokechu22>I wrote a quick script to sum the data on https://cosplay.com/series: https://transfer.archivete.am/inline/KPGxg/cosplay.com_count_photos.py which says 1608146 photos, 333780 costumes
23:47:04<pokechu22>That's roughly the same size as refsheet.net, which is definitely doable
23:47:20<@JAA>klea: I could, yes.
23:47:38<klea>mostly because then finding qwarc scripts would be significantly easier.
23:48:09<pokechu22>That's roughly the same size as refsheet.net (1445504 images, 408063 characters, 105078 users), which is definitely doable
23:49:03<@JAA>Mhm
23:49:15<deafmute>oh that's like 10x my guess lol
23:50:33Dada quits [Remote host closed the connection]