| 00:18:42 | | Wohlstand (Wohlstand) joins |
| 00:23:43 | | Wohlstand quits [Ping timeout: 272 seconds] |
| 00:48:06 | <klea> | btw, it'd be neat to have a system to track planned and unplanned outages in systems that AT uses. |
| 00:54:14 | <klea> | /cc DigitalDragons ^ |
| 01:06:23 | | pika joins |
| 01:09:58 | | SootBector quits [Remote host closed the connection] |
| 01:11:04 | | SootBector (SootBector) joins |
| 01:16:55 | | pika quits [Ping timeout: 272 seconds] |
| 01:17:38 | | Suika_ quits [Ping timeout: 256 seconds] |
| 01:24:35 | | pika joins |
| 01:25:41 | | pokechu22 quits [Quit: WeeChat 4.7.1] |
| 01:26:08 | | pokechu22 (pokechu22) joins |
| 01:26:33 | | Suika joins |
| 01:29:35 | | pika quits [Ping timeout: 272 seconds] |
| 01:37:50 | | pika joins |
| 01:42:53 | | pika quits [Ping timeout: 272 seconds] |
| 01:49:15 | | Sk1d quits [Read error: Connection reset by peer] |
| 01:51:30 | | pika joins |
| 01:51:51 | | pika leaves |
| 01:59:47 | | cyan_box joins |
| 02:04:06 | | cyanbox_ quits [Ping timeout: 256 seconds] |
| 02:36:51 | | MrMcNuggets (MrMcNuggets) joins |
| 03:06:39 | | etnguyen03 (etnguyen03) joins |
| 03:18:10 | | etnguyen03 quits [Remote host closed the connection] |
| 03:19:47 | | etnguyen03 (etnguyen03) joins |
| 03:21:38 | | etnguyen03 quits [Remote host closed the connection] |
| 03:22:47 | | etnguyen03 (etnguyen03) joins |
| 03:38:21 | | etnguyen03 quits [Client Quit] |
| 03:46:06 | | CYBERDEV quits [Quit: Leaving] |
| 03:46:16 | | etnguyen03 (etnguyen03) joins |
| 03:54:55 | | CYBERDEV joins |
| 03:59:21 | | etnguyen03 quits [Remote host closed the connection] |
| 04:27:37 | | chrismeller3 quits [Quit: chrismeller3] |
| 05:02:38 | | Webuser290291 joins |
| 05:02:55 | | Webuser290291 quits [Client Quit] |
| 05:04:14 | | wotd quits [Remote host closed the connection] |
| 05:04:18 | | n9nes quits [Ping timeout: 256 seconds] |
| 05:04:47 | | wotd joins |
| 05:07:05 | | n9nes joins |
| 05:14:25 | | chunkynutz60 quits [Ping timeout: 272 seconds] |
| 05:17:20 | | ThetaDev quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
| 05:18:00 | | ThetaDev joins |
| 05:22:08 | <HP_Archivist> | I just noticed this: https://www.reddit.com/r/Archivists/comments/1q9n5nt/nara_is_shutting_down_history_hub_for_citizen/ |
| 05:22:26 | <HP_Archivist> | https://historyhub.history.gov/citizen_archivists/f/discussions |
| 05:22:29 | <HP_Archivist> | Can we grab it? |
| 05:23:12 | <pokechu22> | "On January 15, 2026 the History Hub site will be “frozen in time.” The site will remain available for reference until February 13, 2026." |
| 05:23:57 | <HP_Archivist> | Does that mean it will be frozen in time and online or? |
| 05:24:16 | | Webuser025436 joins |
| 05:24:18 | <pokechu22> | Presumably that means they made it read-only yesterday and it'll be online for a month before they close it fully |
| 05:24:19 | <nicolas17> | sounds like it will be frozen and online between jan 15 and feb 13, and offline after feb 13 |
| 05:24:29 | <pokechu22> | I'm seeing incapsula on it |
| 05:24:34 | <@arkiver> | i've been working on a job in AB |
| 05:24:34 | <nicolas17> | pls add to deathwatch |
| 05:24:38 | <@arkiver> | but not sure if it went well |
| 05:24:57 | <pokechu22> | It didn't work |
| 05:26:00 | <HP_Archivist> | Hm.Throw it into AB now? |
| 05:26:13 | <pokechu22> | It seems like the incapsula JS challenge would need to be solved, and I don't know how long that lasts |
| 05:26:48 | <Webuser025436> | Hi. Ming Pao Canada, a hk-based newspaper that has a Canada edition and daily newspaper, announced they are shutting down as of today. [1] Notably, they have an archive of all articles Ming Pao Hong Kong (the HK edition). All of Ming Pao HK's articles pre-2021 have been removed a long time ago from the internet because of the media situation in HK. |
| 05:26:49 | <Webuser025436> | Is it possible to archive? https://www.mingpaocanada.com/ |
| 05:26:49 | <Webuser025436> | [1]: https://www.cp24.com/news/canada/2026/01/13/ex-journalists-lament-closure-of-ming-pao-canadas-last-chinese-language-daily-paper/ |
| 05:28:13 | <h2ibot> | Pokechu22 edited Deathwatch (+237, /* 2026-02 */ https://historyhub.history.gov/): https://wiki.archiveteam.org/?diff=60210&oldid=60209 |
| 05:28:30 | <pokechu22> | Webuser025436: I believe we've already started an archivebot job for that; I'm going to double-check the status of it |
| 05:29:59 | <pokechu22> | Webuser025436: It's currently running, together with http://mingshengbao.com/ - see http://archivebot.com/?initialFilter=mingpaocanada#log-container-6cnonr9bi6enxit9na57wckan |
| 05:30:55 | <Webuser025436> | Many thanks 🙏 |
| 05:31:15 | <pokechu22> | Where is the archive of the pre-2021 Hong Kong articles? I can't read Chinese so I'd like to make sure we're saving that too |
| 05:32:54 | <nicolas17> | pokechu22: wonder if we should speed up that job |
| 05:33:28 | <pokechu22> | It closes on the end of January; today was date of the last edition being published |
| 05:33:44 | <pokechu22> | (at least according to https://www.cp24.com/news/canada/2026/01/13/ex-journalists-lament-closure-of-ming-pao-canadas-last-chinese-language-daily-paper/) |
| 05:37:03 | <Webuser025436> | pokechu22 All HK articles for a given day are shown here: https://www.mingpaocanada.com/tor/htm/News/YYYYMMDD/HK-GAindex_r.htm |
| 05:37:03 | <Webuser025436> | So for example: https://www.mingpaocanada.com/tor/htm/News/20140710/HK-GAindex_r.htm |
| 05:37:03 | <Webuser025436> | The earliest AFAICT is 20140710. |
| 05:38:53 | <pokechu22> | Is there a page that lists all of the previous ones? I assume there must be because archivebot has found https://www.mingpaocanada.com/tor/htm/News/20220429/tam1_r.htm but I don't know exactly where it came from |
| 05:40:12 | <pokechu22> | (there's https://www.mingpaocanada.com/tor/htm/Responsive/archiveList.cfm but that seems to only directly show the last week) |
| 05:40:19 | <nicolas17> | pokechu22: also I'm seeing many requests like https://www.mingpaocanada.com/tor/htm/News/20220815/TD/TD/tdc1.txt that redirect to an error page, might be a crawling glitch finding garbage in JS or something? |
| 05:41:58 | <nicolas17> | ah yes, docPath: "HK-GA/gc/gcc1.txt" |
| 05:41:59 | <pokechu22> | Yeah, looks like that comes from https://www.mingpaocanada.com/tor/htm/News/20220815/HK-gaa1_r.htm containing a POST to /Tor/cfc/popular_addone.cfc with HK-GA/ga/gaa1.txt as a parameter |
| 05:42:44 | <nicolas17> | not sure how to avoid this, excluding *.txt feels too broad |
| 05:43:39 | <pokechu22> | It's probably fine to just leave them as-is since there's 1 per article and most articles have several images as well |
| 05:44:25 | <nicolas17> | well it's also following the redirect and saving errorpage.html every single time |
| 05:45:21 | <pokechu22> | Looks like that's not a new issue: https://web.archive.org/web/20260901000000*/https://www.mingpaocanada.com/errorpage.html :) |
| 05:45:36 | <pokechu22> | ... ok, though 10660 snapshots on January 16 is probably still excessive |
| 05:46:26 | <nicolas17> | pain |
| 05:50:09 | <pokechu22> | I guess I can check what dates it's already found using ab2f |
| 05:55:16 | <nicolas17> | I was thinking something like tor/htm/News/[0-9]{8}/[A-Z]+/[A-Z]+/[a-z]+[0-9]\.txt |
| 05:55:27 | <nicolas17> | but that's not exhaustive, will need a few more patterns |
| 05:56:12 | | sec^nd quits [Remote host closed the connection] |
| 05:56:38 | | sec^nd (second) joins |
| 05:59:31 | | evergreen56 joins |
| 06:02:06 | | evergreen5 quits [Ping timeout: 256 seconds] |
| 06:02:06 | | evergreen56 is now known as evergreen5 |
| 06:04:34 | <pokechu22> | The oldest archivebot has found so far is http://www.mingpaocanada.com/TOR/htm/News/20220319/HK-GAindex_r.htm |
| 06:05:09 | <pokechu22> | JAA: can you trace http://www.mingpaocanada.com/TOR/htm/News/20220319/HK-GAindex_r.htm on 6cnonr9bi6enxit9na57wckan please? |
| 06:06:09 | <Webuser025436> | is the link i provided not good enough above? this link contains outlinks to all hk articles for a given day: https://www.mingpaocanada.com/tor/htm/News/20140710/HK-GAindex_r.htm |
| 06:07:00 | <pokechu22> | It is, but now I'm trying to figure out if archivebot will have already found those or if I need to start the job in a way that will discover those |
| 06:11:25 | <pokechu22> | (there isn't any good way to add urls to an existing archivebot job, but I could start a new one with a list of those pages for all days since 2014 or similar, along with https://www.mingpaocanada.com/van/htm/News/20260116/VAindex_r.htm and https://www.mingpaocanada.com/tor/htm/News/20260116/TAindex_r.htm) |
| 06:14:47 | | chunkynutz60 joins |
| 06:17:24 | | nexussfan quits [Quit: Konversation terminated!] |
| 07:32:26 | | chrismeller3 (chrismeller) joins |
| 09:00:02 | | midou quits [Ping timeout: 256 seconds] |
| 09:20:45 | | midou joins |
| 09:27:48 | | midou quits [Ping timeout: 256 seconds] |
| 09:41:56 | | midou joins |
| 09:46:45 | | midou quits [Ping timeout: 272 seconds] |
| 09:52:45 | | midou joins |
| 10:23:29 | | midou quits [Ping timeout: 272 seconds] |
| 10:43:05 | <h2ibot> | KleaBot made 2 bot changes: https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters[]=nsInvert&wpfilters[]=associated |
| 10:43:34 | <klea> | mhmm |
| 10:44:14 | <klea> | oh i love that my terminal thinks the url is shorter. |
| 10:44:37 | <klea> | so my browser opened <https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters> instead |
| 10:45:30 | | Dada joins |
| 11:14:24 | | midou joins |
| 11:20:00 | <alexlehm> | the same happens in my irc client, it does not consider [] as a valid url character |
| 11:22:15 | <klea> | time to urlencode it, or wrap it in <> |
| 11:22:25 | <klea> | alexlehm: how does <https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters[]=nsInvert&wpfilters[]=associated> open? |
| 11:23:28 | <alexlehm> | "https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters" |
| 11:24:14 | <klea> | alexlehm: and KleaBot made 2 bot changes: https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters%5B%5D=nsInvert&wpfilters%5B%5D=associated |
| 11:24:42 | <alexlehm> | it would probably work with https://wiki.archiveteam.org/index.php?title=Special:Contributions/KleaBot&offset=20260117104222&limit=2&namespace=2&wpfilters%5B%5D=nsInvert&wpfilters%5B%5D=associated |
| 11:30:46 | | ArchivalEfforts quits [Ping timeout: 256 seconds] |
| 11:30:58 | | ArchivalEfforts joins |
| 11:31:51 | | midou quits [Read error: Connection reset by peer] |
| 11:34:08 | | HP_Archivist quits [Quit: Leaving] |
| 11:35:05 | <Juest> | hexchat processes the url fine |
| 11:38:51 | | Doomaholic quits [Ping timeout: 272 seconds] |
| 11:39:07 | <alexlehm> | : could also be a stop character |
| 11:48:57 | | Hackerpcs_1 (Hackerpcs) joins |
| 11:49:28 | | Hackerpcs quits [Ping timeout: 256 seconds] |
| 12:00:02 | | Bleo182600722719623455222 quits [Quit: The Lounge - https://thelounge.chat] |
| 12:02:48 | | Bleo182600722719623455222 joins |
| 12:11:18 | | midou joins |
| 12:13:41 | | szczot3k quits [Ping timeout: 272 seconds] |
| 12:13:43 | | szczot3k_ (szczot3k) joins |
| 12:14:23 | | szczot3k_ is now known as szczot3k |
| 12:18:32 | | szczot3k quits [Remote host closed the connection] |
| 12:18:58 | | szczot3k (szczot3k) joins |
| 12:21:28 | | Doomaholic (Doomaholic) joins |
| 12:30:51 | | HP_Archivist (HP_Archivist) joins |
| 12:31:55 | | midou quits [Read error: Connection reset by peer] |
| 12:38:49 | | szczot3k quits [Remote host closed the connection] |
| 12:41:18 | | szczot3k (szczot3k) joins |
| 12:45:07 | | szczot3k quits [Remote host closed the connection] |
| 12:46:57 | | Webuser989898 joins |
| 12:47:20 | | Webuser989898 quits [Client Quit] |
| 12:47:35 | | szczot3k (szczot3k) joins |
| 12:49:37 | | szczot3k quits [Remote host closed the connection] |
| 12:52:02 | | szczot3k (szczot3k) joins |
| 12:52:51 | | midou joins |
| 13:03:35 | | cyan_box quits [Read error: Connection reset by peer] |
| 13:04:21 | | midou quits [Ping timeout: 272 seconds] |
| 13:11:17 | | szczot3k_ (szczot3k) joins |
| 13:12:35 | | szczot3k quits [Ping timeout: 272 seconds] |
| 13:12:35 | | szczot3k_ is now known as szczot3k |
| 13:14:24 | | midou joins |
| 13:15:00 | | Marie0 joins |
| 13:17:00 | | szczot3k quits [Remote host closed the connection] |
| 13:19:26 | | szczot3k (szczot3k) joins |
| 13:21:38 | <Marie0> | Sorry for getting to this so late, but I think we should archive some Honduran government websites before the inauguration on the 27th. The current president is a leftist and the new one is a Trump ally promising all kinds of austerity measures, so I expect the web presence of the government will completely change fairly quickly |
| 13:21:45 | | midou quits [Read error: Connection reset by peer] |
| 13:21:50 | | nine quits [Quit: See ya!] |
| 13:22:03 | | nine joins |
| 13:22:03 | | nine is now authenticated as nine |
| 13:22:03 | | nine quits [Changing host] |
| 13:22:03 | | nine (nine) joins |
| 13:23:38 | <Marie0> | On the bright side, Honduras is a small country and their internet is actually pretty modern. I think if we start immediately, we could easily do this |
| 13:24:54 | | etnguyen03 (etnguyen03) joins |
| 13:30:18 | | szczot3k quits [Remote host closed the connection] |
| 13:32:44 | | szczot3k (szczot3k) joins |
| 13:34:51 | <Marie0> | I'm new here btw. I have some experience doing small scrapes with cURL but have never collaborated on one. I was initially going to do that for the Honduras thing, but I was in over my head. Huge fan of Archiveteam and your work! |
| 13:35:17 | | midou joins |
| 14:00:04 | <justauser> | Marie0: Do you have a list? |
| 14:13:58 | | midou quits [Ping timeout: 256 seconds] |
| 14:23:13 | | midou joins |
| 14:51:24 | | etnguyen03 quits [Client Quit] |
| 14:51:32 | | the joins |
| 15:05:09 | | nexussfan (nexussfan) joins |
| 15:05:13 | | etnguyen03 (etnguyen03) joins |
| 15:11:54 | | the quits [Client Quit] |
| 15:21:08 | <Marie0> | justauser: Sort of. There's a general directory of all government agency that I extracted the links from, but it's just the front page of each agency. A lot of them run other websites for specific offices/services. These are usually conspicuously linked somewhere near the front page of the agency, but I haven't actually compiled them into a list |
| 15:21:10 | | nexussfan quits [Client Quit] |
| 15:22:31 | | nexussfan (nexussfan) joins |
| 15:27:17 | <klea> | Marie0: are they subdomains or are they subpages of the domain? |
| 15:27:29 | | IDK quits [Quit: Updating details, brb] |
| 15:28:59 | | IDK (IDK) joins |
| 15:29:49 | <h2ibot> | Klea edited Dealing with Cloudflare (+28, /* Scenario 2 - TLS fingerprinting */…): https://wiki.archiveteam.org/?diff=60214&oldid=58744 |
| 15:31:02 | | midou quits [Ping timeout: 256 seconds] |
| 15:37:39 | | DogsRNice joins |
| 15:38:16 | | szczot3k quits [Remote host closed the connection] |
| 15:38:35 | <Marie0> | klea: In some cases neither. For example, presidencia.gob.hn is obviously right there, but it has a section "periódico impreso" that links to poderpopular.hn, which isn't on the list but is still run by the government, and since it's straight up party propoganda it'll almost certainly be taken down |
| 15:39:30 | <klea> | aaaaa, that'd require getting all sublinks and archiving more content |
| 15:40:25 | <klea> | i guess it's possible by dumping the dbs after having made a AB job with all initial domains, and getting more domains, is it so justauser, or is that a bad idea? |
| 15:40:42 | | szczot3k (szczot3k) joins |
| 15:40:59 | <Juest> | try grabbing the sitemaps? \ |
| 15:41:06 | | ATinySpaceMarine quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
| 15:41:27 | <Marie0> | There's not that many root links to begin with so it's not THAT bad. I think I could do it by hand in an afternoon if needed. |
| 15:42:09 | <Marie0> | Here's the list btw: https://transfer.archivete.am/xbYNo/websites.txt |
| 15:42:10 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/xbYNo/websites.txt |
| 15:42:38 | <Marie0> | Ah, thank you |
| 15:43:04 | | ATinySpaceMarine joins |
| 15:44:58 | <justauser> | Started jobs for the two you mentioned first for now. |
| 15:45:40 | <justauser> | Some pages return normal contents but with HTTP 500. |
| 15:45:52 | <klea> | !kfind protocol incompliant |
| 15:45:53 | <eggdrop> | [karma] 'protocol incompliant' not found. |
| 15:45:59 | <klea> | shitty :( |
| 15:46:12 | <justauser> | Example https://www.poderpopular.hn/vernoticias.php?id_noticia=25018 |
| 15:46:49 | <klea> | > Expired website :( < https://www.partidoliberal.hn/ |
| 15:48:03 | <Marie0> | justauser: idk but it's working fine in my browser |
| 15:48:17 | <klea> | > Domain for sale: https://partidonacional.hn/ |
| 15:48:17 | <klea> | Unable to resolve: https://mamsurpazhn.com/ |
| 15:48:18 | <klea> | 500 https://fonac.hn/ |
| 15:48:25 | <justauser> | Exactly. It works fine while claiming an error on lower level. |
| 15:48:39 | <justauser> | It confuses our machinery. |
| 15:48:49 | <klea> | 403 https://portalunico.iaip.gob.hn/ |
| 15:48:59 | <klea> | Loading page even with js on? https://www.dpr.gob.hn/ |
| 15:49:05 | <klea> | maybe we should make a channel for this? |
| 15:49:14 | <justauser> | That's #vooterbooter |
| 15:49:16 | <klea> | oh |
| 15:49:18 | <klea> | sorry |
| 15:49:55 | <justauser> | Tries to load script from some CDN and fails. |
| 15:50:42 | <klea> | i've moved my part of the discussion there. |
| 15:59:38 | | alexlehm quits [Remote host closed the connection] |
| 15:59:41 | | alexlehm (alexlehm) joins |
| 16:01:08 | | midou joins |
| 16:07:08 | <nulldata> | https://learn.microsoft.com/en-us/troubleshoot/mem/configmgr/mdt/mdt-retirement |
| 16:07:32 | <nulldata> | "MDT download packages might be removed or deprecated from official distribution channels." |
| 16:14:03 | <nulldata> | Here's an old version https://www.microsoft.com/en-us/download/details.aspx?id=57917 |
| 16:14:16 | <nulldata> | Newest one seems to be pulled already |
| 16:17:32 | | midou quits [Read error: Connection reset by peer] |
| 16:20:20 | | BluRaf quits [Quit: WeeChat 3.8] |
| 16:20:25 | | BluRaf (BluRaf) joins |
| 16:22:32 | | Dada quits [Remote host closed the connection] |
| 16:22:45 | | Dada joins |
| 16:26:38 | | midou joins |
| 16:39:03 | | midou quits [Ping timeout: 272 seconds] |
| 16:48:02 | | midou joins |
| 16:58:26 | | deafmute joins |
| 17:03:35 | <deafmute> | Hello everyone. Are there any plans to archive cosplay.com galleries? The site has been undead and dysfunctional for a long time |
| 17:03:51 | | Marie0 quits [Quit: Ooops, wrong browser tab.] |
| 17:08:03 | <h2ibot> | Klea created ArchiveBot/2025 Honduran General Election/list (+2609, Created page with "https://congresonacional.hn/…): https://wiki.archiveteam.org/?title=ArchiveBot/2025%20Honduran%20General%20Election/list |
| 17:39:40 | <that_lurker> | https://console.cloud.google.com/storage/browser/net-ntlmv1-tables/tables;tab=objects?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false |
| 17:40:30 | <that_lurker> | "Google just dropped 1.1 QUADRILLION pre-computed passwords (aka rainbowtable) for NetNTLMv1." |
| 17:40:41 | <that_lurker> | so about 8,6TB |
| 17:40:46 | <that_lurker> | ref. https://www.linkedin.com/posts/benjamin-iheukumere_google-just-dropped-11-quadrillion-pre-computed-activity-7418215510380802048-9NIK/ |
| 17:41:23 | | oxtyped quits [Read error: Connection reset by peer] |
| 17:41:35 | | oxtyped joins |
| 17:42:01 | <Hans5958> | That LinkedIn post is so engagement bait-y |
| 17:42:02 | <Hans5958> | https://cloud.google.com/blog/topics/threat-intelligence/net-ntlmv1-deprecation-rainbow-tables |
| 17:42:32 | <Hans5958> | https://x.com/Mandiant/status/2012268623662874906 |
| 17:42:32 | <eggdrop> | nitter: https://nitter.net/Mandiant/status/2012268623662874906 |
| 17:43:04 | <that_lurker> | ohh nice did not know there was an article on that |
| 17:43:28 | <that_lurker> | I don't have enough space to download all of that and then push it to IA, but someone here might :-P |
| 18:17:35 | <katia> | 👀 |
| 18:18:32 | | Ajay quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:32 | | @Sanqui|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:32 | | britmob|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:33 | | anon00001|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:33 | | xxia|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:33 | | mind_combatant quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:33 | | x9fff00 quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:33 | | DigitalDragon quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:33 | | gamer191-1|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:33 | | theblazehen|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:33 | | igneousx quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:34 | | audrooku|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:34 | | alexshpilkin quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:34 | | flashfire42|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:34 | | Minkafighter|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:34 | | tomodachi94 quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:34 | | Vokun quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:35 | | Tom|m1 quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:35 | | Hans5958 quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:35 | | mpeter|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:35 | | rewby|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:35 | | MaxG quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:35 | | Exorcism quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:35 | | nyuuzyou quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:35 | | masterx244|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:35 | | nstrom|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:36 | | ram|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:36 | | cruller quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:36 | | aaq|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:36 | | MinePlayersPEMyNey|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:36 | | Thibaultmol quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:37 | | Fletcher quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:39 | | akaibu|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:40 | | tech234a quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:41 | | Ruk8 quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:41 | | hlgs|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:41 | | EvanBoehs|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:42 | | Alienmaster|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:42 | | schwarzkatz|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:43 | | moe-a-m|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:43 | | vexr quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:43 | | andrewvieyra|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:43 | | yzqzss quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:44 | | that_lurker|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:44 | | GhostIsBeHere|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:44 | | mikolaj|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:18:45 | | upperbody321|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:30 | | finalti|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:30 | | tech234a|m-backup quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:30 | | Roki_100|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:30 | | e2mau|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:30 | | qyxojzh|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:30 | | victor_vaughn|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:30 | | th3z0l4|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:31 | | coro quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:31 | | triplecamera|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:31 | | thermospheric quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:31 | | phaeton quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:31 | | wrangle|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:31 | | s-crypt|m|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:31 | | hexagonwin|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:31 | | froxcey|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:31 | | Video quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:31 | | iCesenberk|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:31 | | madpro|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:39 | | GRBaset quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:39 | | pannekoek11|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:39 | | superusercode quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:39 | | octylFractal|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:39 | | lasdkfj|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:39 | | CrispyAlice2 quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:39 | | jevinskie quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:39 | | jackt1365|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:39 | | jwoglom|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:39 | | noobirc|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:39 | | cmostracker|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:39 | | Cydog|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | ragu|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | nosamu|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | joepie91|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | spearcat|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | ax|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | katia|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | l0rd_enki|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | bogsen quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | Adamvoltagex|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | osiride|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | trumad|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | vics quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | NickS|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | v1cs quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | its_notjack quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | haha-whered-it-go|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | supermariofan67|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | Fijxu|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | nano412510 quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | mister_x quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | Misty|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | Valkum|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | kaz__|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | rain|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | mat|m1 quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | miksters|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | PhoHale|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | will|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | yetanotherarchiver|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | Passiing|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | username675f|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | Tyrasuki|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | gareth48|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | Nulo|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | nightpool quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | noxious quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | ampdot|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:40 | | Claire|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:41 | | saouroun|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:41 | | hillow596|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:41 | | yarnover|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:41 | | Cronfox|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:41 | | Joy|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:41 | | IceCodeNew|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:42 | | M--mlv|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:19:42 | | justauser|m quits [Quit: Bridge terminating on SIGTERM] |
| 18:20:25 | <klea> | that_lurker: just don't download it :p, slowly download chunks and do chunked uploads to IA :p |
| 18:20:38 | | rewby|m (rewby) joins |
| 18:20:38 | | @ChanServ sets mode: +o rewby|m |
| 18:22:04 | <that_lurker> | :-P |
| 18:22:17 | <that_lurker> | If someone does that please request a collection for it first :-) |
| 18:22:28 | | ram|m joins |
| 18:22:28 | | pannekoek11|m joins |
| 18:22:28 | | MinePlayersPEMyNey|m joins |
| 18:22:28 | | Ruk8 (Ruk8) joins |
| 18:22:41 | <that_lurker> | and maybe get arkiver or IA's ok to dumb that much data :-P |
| 18:22:55 | | Cronfox|m joins |
| 18:22:55 | | xxia|m joins |
| 18:22:55 | | andrewvieyra|m joins |
| 18:22:55 | | nstrom|m joins |
| 18:23:37 | | Dada quits [Remote host closed the connection] |
| 18:23:49 | | Dada joins |
| 18:23:51 | | saouroun|m joins |
| 18:23:51 | | haha-whered-it-go|m joins |
| 18:23:52 | | schwarzkatz|m joins |
| 18:23:52 | | jackt1365|m joins |
| 18:23:52 | | x9fff00 (x9fff00) joins |
| 18:23:53 | | madpro|m joins |
| 18:23:53 | | DigitalDragon joins |
| 18:23:53 | | ragu|m joins |
| 18:23:53 | | audrooku|m joins |
| 18:23:53 | | Ajay joins |
| 18:23:53 | | igneousx (igneousx) joins |
| 18:23:53 | | akaibu|m joins |
| 18:23:53 | | masterx244|m (masterx244|m) joins |
| 18:24:01 | | superusercode joins |
| 18:24:01 | | Cydog|m joins |
| 18:24:01 | | trumad|m joins |
| 18:24:01 | | cmostracker|m joins |
| 18:24:01 | | aaq|m joins |
| 18:24:01 | | mikolaj|m joins |
| 18:24:01 | | upperbody321|m joins |
| 18:24:01 | | nyuuzyou joins |
| 18:24:01 | | spearcat|m joins |
| 18:24:01 | | hexagonwin|m joins |
| 18:24:01 | | th3z0l4|m joins |
| 18:24:01 | | victor_vaughn|m joins |
| 18:24:01 | | ax|m joins |
| 18:24:01 | | e2mau|m joins |
| 18:24:01 | | moe-a-m|m joins |
| 18:24:01 | | Sanqui|m (Sanqui) joins |
| 18:24:01 | | joepie91|m joins |
| 18:24:01 | | tech234a|m-backup (tech234a) joins |
| 18:24:01 | | mpeter|m joins |
| 18:24:01 | | @ChanServ sets mode: +o Sanqui|m |
| 18:24:01 | | jwoglom|m joins |
| 18:24:01 | | thermospheric joins |
| 18:24:01 | | theblazehen|m joins |
| 18:24:01 | | britmob|m joins |
| 18:24:01 | | octylFractal|m joins |
| 18:24:01 | | GRBaset (GRBaset) joins |
| 18:24:01 | | Minkafighter|m joins |
| 18:24:01 | | NickS|m joins |
| 18:24:01 | | hlgs|m joins |
| 18:24:01 | | Roki_100|m joins |
| 18:24:01 | | lasdkfj|m joins |
| 18:24:01 | | finalti|m joins |
| 18:24:01 | | Thibaultmol joins |
| 18:24:01 | | Fletcher (Fletcher) joins |
| 18:24:01 | | MaxG joins |
| 18:24:01 | | yzqzss (yzqzss) joins |
| 18:24:01 | | CrispyAlice2 joins |
| 18:24:01 | | jevinskie joins |
| 18:24:01 | | wrangle|m joins |
| 18:24:01 | | vexr joins |
| 18:24:01 | | nosamu|m joins |
| 18:24:01 | | noobirc|m joins |
| 18:24:01 | | phaeton (phaeton) joins |
| 18:24:01 | | coro joins |
| 18:24:01 | | alexshpilkin joins |
| 18:24:01 | | v1cs joins |
| 18:24:01 | | flashfire42|m (flashfire42) joins |
| 18:24:01 | | EvanBoehs|m joins |
| 18:24:01 | | iCesenberk|m joins |
| 18:24:01 | | qyxojzh|m joins |
| 18:24:01 | | s-crypt|m|m joins |
| 18:24:01 | | l0rd_enki|m joins |
| 18:24:01 | | Tom|m1 joins |
| 18:24:01 | | gamer191-1|m joins |
| 18:24:01 | | that_lurker|m joins |
| 18:24:01 | | bogsen (bogsen) joins |
| 18:24:02 | | Vokun (Vokun) joins |
| 18:24:02 | | supermariofan67|m joins |
| 18:24:02 | | vics joins |
| 18:24:02 | | GhostIsBeHere|m joins |
| 18:24:02 | | Adamvoltagex|m joins |
| 18:24:02 | | anon00001|m joins |
| 18:24:02 | | cruller joins |
| 18:24:02 | | tomodachi94 (tomodachi94) joins |
| 18:24:02 | | triplecamera|m joins |
| 18:24:02 | | froxcey|m joins |
| 18:24:02 | | tech234a (tech234a) joins |
| 18:24:02 | | katia|m joins |
| 18:24:02 | | Fijxu|m joins |
| 18:24:02 | | Video joins |
| 18:24:02 | | its_notjack (its_notjack) joins |
| 18:24:02 | | osiride|m joins |
| 18:24:02 | | Hans5958 joins |
| 18:24:02 | | justauser|m joins |
| 18:24:02 | | mind_combatant (mind_combatant) joins |
| 18:24:02 | | Alienmaster|m joins |
| 18:24:02 | | Exorcism (exorcism) joins |
| 18:24:18 | <@arkiver> | !remindme 8h that_lurker password dump google |
| 18:24:18 | <eggdrop> | [remind] ok, i'll remind you at 2026-01-18T02:24:18Z |
| 18:25:27 | | username675f|m joins |
| 18:26:17 | | mister_x joins |
| 18:26:18 | | PhoHale|m joins |
| 18:26:19 | | yetanotherarchiver|m joins |
| 18:26:19 | | hillow596|m joins |
| 18:26:19 | | Joy|m joins |
| 18:26:19 | | nano412510 (nano412510) joins |
| 18:26:19 | | Tyrasuki|m joins |
| 18:26:20 | | yarnover|m joins |
| 18:26:20 | | mat|m1 joins |
| 18:26:20 | | gareth48|m joins |
| 18:26:20 | | kaz__|m joins |
| 18:26:20 | | Passiing|m joins |
| 18:26:20 | | noxious joins |
| 18:26:22 | | will|m joins |
| 18:26:22 | | Misty|m joins |
| 18:26:22 | | Nulo|m joins |
| 18:26:22 | | ampdot|m joins |
| 18:26:22 | | Valkum|m joins |
| 18:26:22 | | Claire|m joins |
| 18:26:22 | | IceCodeNew|m joins |
| 18:26:22 | | rain|m joins |
| 18:26:22 | | nightpool (nightpool) joins |
| 18:26:23 | | miksters|m joins |
| 18:30:11 | <klea> | that_lurker: nah, it seems IA has tooling to move things from search results to collections (or given item ids put them in collections), i suppose that's why archiveteam_inbox exists. |
| 18:30:43 | <klea> | oh wait, will arkiver store the dump on ia themselves? |
| 18:31:31 | <that_lurker> | It's easier to just push to a collection directly than ask to move items later also can let them know how much data there will be :-P |
| 18:31:38 | <klea> | oh |
| 18:31:52 | <that_lurker> | them = a_rkive :-P |
| 18:32:14 | <klea> | smh https://console.cloud.google.com/storage/browser/net-ntlmv1-tables requires login :( |
| 18:34:19 | | Gadelhas562873784438 joins |
| 18:34:45 | <klea> | it seems to be split into 2.1G chunks: https://console.cloud.google.com/storage/browser/net-ntlmv1-tables/tables?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22)) |
| 18:35:37 | | M--mlv|m joins |
| 18:36:18 | <klea> | that_lurker: do you know the total size? |
| 18:36:39 | <that_lurker> | 8,6TB was mentioned in a few places |
| 18:37:03 | <klea> | ouch |
| 18:39:28 | <klea> | i wonder if IA could lend me access to FOS. :p http://fos.textfiles.com/pipeline.html#:~:text=%2Fdev%2Fmd1%2013T%202%2E2T%2011T%2018%25%20%2F2 |
| 18:39:38 | <Juest> | is there a direct contact with internet archive around here by chance? |
| 18:39:47 | <klea> | Juest: wdym by DC? |
| 18:39:48 | <klea> | oh also |
| 18:40:03 | <justauser> | #internetarchive has an employee. |
| 18:40:19 | <justauser> | But that's only for small bug reports and status updates. |
| 18:40:36 | <klea> | yeah |
| 18:40:52 | <Juest> | archive team is independent and not affiliated with ia but working very closely with despite not seeming like much? |
| 18:41:03 | <Juest> | excuse my weirdness |
| 18:41:41 | <klea> | kind of |
| 18:42:07 | <justauser> | AT founder is an IA employee. |
| 18:42:24 | <klea> | yes |
| 18:42:29 | <justauser> | Current team lead has some extended access. |
| 18:42:31 | <Juest> | also i very much prefer chat, so apologies for not researching much with the wiki |
| 18:42:54 | <klea> | btw, afaik AT is more IRC first, doc later kind of from what i've seen. |
| 18:48:10 | | etnguyen03 quits [Quit: Konversation terminated!] |
| 18:55:19 | <@JAA> | pokechu22: https://transfer.archivete.am/inline/WB59p/6cnonr9bi6enxit9na57wckan-trace.txt |
| 18:56:38 | | colla quits [Quit: StoCa!] |
| 18:58:27 | | colla joins |
| 18:59:35 | <klea> | wait |
| 18:59:38 | <klea> | google's bucket is s3 |
| 18:59:53 | <klea> | compatible. |
| 19:00:09 | <klea> | so, what happens if we ask someone in #archivebot to queue it |
| 19:06:15 | <klea> | fun idea |
| 19:06:25 | <klea> | project specifically for archival of s3-like buckets. |
| 19:08:24 | <nicolas17> | what's the size limit of an IA item? |
| 19:10:03 | <justauser> | 1TB unless they changed it and forgot to tell. |
| 19:10:07 | <klea> | nicolas17: https://irclogs.archivete.am/internetarchive/2026-01-07#l43630241 |
| 19:10:30 | <nicolas17> | klea: google cloud console needs login |
| 19:10:34 | <nicolas17> | but the bucket is open https://storage.googleapis.com/net-ntlmv1-tables |
| 19:10:39 | <klea> | yeah, i noticed later. |
| 19:10:58 | <nicolas17> | having a # in the filename is insane |
| 19:11:42 | <klea> | that's the %23 right? |
| 19:11:59 | <nicolas17> | yes |
| 19:12:16 | <klea> | also, im stupid |
| 19:12:23 | <klea> | s3 exposes two paths per item. |
| 19:12:46 | | etnguyen03 (etnguyen03) joins |
| 19:12:52 | <klea> | so how do we get WBM to load the same warc files later for net-ntlmv1-tables.storage.googleapis.com/? |
| 19:13:19 | <nicolas17> | oh ew |
| 19:14:18 | <justauser> | No way? No way! |
| 19:14:25 | <klea> | ? |
| 19:15:09 | <nicolas17> | using other tools (like wget-lua) you can produce deduplicated WARCs, but the whole point of using archivebot would have been to avoid downloading+uploading yourself |
| 19:15:56 | <klea> | nicolas17: i believe we can later churn out some warcs that just contain deduplication entries? |
| 19:17:16 | <justauser> | Producing WARCs based on unproven guesses is discouraged. |
| 19:17:23 | <klea> | yeah :p |
| 19:17:38 | <@JAA> | s/discouraged/illegal/ |
| 19:17:48 | <klea> | justauser: im thinking of a separate idea |
| 19:17:54 | <klea> | someone uploads all those warcs to ia |
| 19:18:06 | <klea> | wait, AB doesn't yet support deduplication between jobs right? |
| 19:18:16 | <@JAA> | AB doesn't dedupe at all. |
| 19:18:20 | <klea> | oh |
| 19:18:21 | <klea> | lovely :( |
| 19:19:11 | <nicolas17> | if AB supported deduplication (even if only within a job), we could just give it https://storage.googleapis.com/net-ntlmv1-tables/whatever and https://net-ntlmv1-tables.storage.googleapis.com/whatever in the same job |
| 19:19:19 | <klea> | i wonder how much time it'd take to make tooling to archive s3-like buckets, including but not limited to storage.googleapis.com, s3.amazonaws.com, s3.archive.org, etc. |
| 19:19:45 | <justauser> | little-things has a lister, and it's usually sufficient. |
| 19:19:49 | <klea> | oh wait, archiving IA inside of IA is not a good idea, and probably illegal. |
| 19:19:50 | | colla quits [Client Quit] |
| 19:19:53 | <@JAA> | I have tooling for the listing, and the rest is just URLs. |
| 19:20:07 | <klea> | nono, i mean having a DPoS project for those :p |
| 19:20:23 | <@JAA> | Mhm |
| 19:20:24 | <klea> | that would deduplicate given some config that'd make it get two files, and dedupe |
| 19:20:43 | <nicolas17> | I'd rather have a more generic DPoS job for dedup purposes |
| 19:20:52 | | klea nods |
| 19:21:11 | <klea> | i have a silly idea |
| 19:21:18 | <klea> | but it's probably still illegal. |
| 19:21:54 | <klea> | things provide etag headers |
| 19:21:56 | | colla joins |
| 19:22:33 | <klea> | so if you do a request to check the etag, and it tells you it's not been modified since, can you assume it's the same file (supposing it's serviced by the same provider)? |
| 19:23:48 | <katia> | No. |
| 19:24:03 | <klea> | why? |
| 19:24:37 | <nicolas17> | some cloud storage systems provide hashes in the listing, or sometimes we can just "know" (or guess) in some other way that multiple URLs will produce the same file |
| 19:24:41 | <katia> | Assuming makes an ass out of u and me |
| 19:24:49 | <klea> | i guess because I didn't ask the server for the original files, so i can't verify |
| 19:25:10 | <katia> | WARCs are raw data that goes over the wire. Not assumptions based on a header |
| 19:25:47 | <klea> | wait, another silly and stupid question, if i were to actually do real requests for the files, get the other warc, and notice they are duplicates (for whichever files are duplicates, hopefully all), could i then write warc duplication records for that other warc that's in another archive.org item? |
| 19:26:08 | <nicolas17> | what we need is a DPoS project where the worker can be told to download https://net-ntlmv1-tables.storage.googleapis.com/LICENSE and https://storage.googleapis.com/net-ntlmv1-tables/LICENSE as part of the same item, and dedup if the data is the same (it already does that dedup) |
| 19:26:31 | | klea agrees on what nicolas17 just said. |
| 19:26:51 | <nicolas17> | we can't shove a dozen "these are probably the same file" URLs into the item name though, so it could be trickier |
| 19:27:08 | <klea> | silly idea |
| 19:27:22 | <klea> | i'm sure the 3 letter archivist here that has two duplicate letters will love this |
| 19:27:33 | <nicolas17> | he's already raising an eyebrow at you |
| 19:27:34 | <klea> | transfer.archivete.am link with "probably same file" urls |
| 19:27:53 | <klea> | then queue that to the tracker |
| 19:28:20 | <klea> | worker gets that item, downloads it (with the added side effect of archival since we're running wget-at i believe on them), then tries to get every single url in that file, and then deduplicates. |
| 19:29:11 | <klea> | if you're silly and stupid like me you make batches instead of uploading only one, which means the worker will effectively write more than one response since i am silly and give each worker more than one task at a time :p |
| 19:30:27 | <klea> | > we can't shove a dozen "these are probably the same file" URLs into the item name though, so it could be trickier <- wait, what prevents us from shoving those urls into the item name, is there some limit length on the tracker? |
| 19:31:27 | <nicolas17> | uhh what did rclone or azure or whatever do here? |
| 19:31:31 | <nicolas17> | {"Path":"NU5T-14J002-JAD_1667587484000.VBF","Name":"NU5T-14J002-JAD_1667587484000.VBF","Size":270478638,"MimeType":"application/octet-stream","ModTime":"2022-12-08T20:05:42.000000000Z","IsDir":false,"Hashes":{"md5":"0000bd0fadf8df707aeb6d790020fdd79ec5d75e76ec5d7b"}} |
| 19:31:35 | <nicolas17> | that's not an md5 T_T |
| 19:34:04 | <nicolas17> | but yeah |
| 19:34:08 | <nicolas17> | I have a few extreme cases of duplication |
| 19:34:13 | <nicolas17> | like this https://transfer.archivete.am/inline/1t46z/f7cf43ec508103ad7d1350450c1e36004f42085e80177135.txt |
| 19:34:51 | <klea> | so i guess nicolas17 agress on my idea of shoving urls into very small files on transfer, and then shoving that down tracker, and then shoving that to a custom DPoS |
| 19:35:39 | | deafmute quits [Client Quit] |
| 19:36:50 | <@JAA> | The ETag thing doesn't work because it's specified to be an opaque identifier for the same representation of the same resource only. |
| 19:37:39 | <@JAA> | So you can do rearchivals that way, but you can't do revisit records for a different URL. |
| 19:37:46 | <klea> | oh. |
| 19:37:55 | <nicolas17> | we can't use it for deduplication but we can use it to make the URL lists, if we're wrong and think two files are the same when we're not, then wget-at will simply not dedup them |
| 19:37:59 | <klea> | > So you can do rearchivals that way, but you can't do revisit records for a different URL. <- that also i suppose relies on the remote system not changing ownership. |
| 19:38:06 | | Dango3607 (Dango360) joins |
| 19:38:15 | <klea> | wait, wget-at does dedupe, or is that in need of implementation? |
| 19:38:48 | <nicolas17> | wget-at does dedupe (if the requests are done as part of the same DPoS item) |
| 19:39:06 | <klea> | and wpull doesn't? |
| 19:39:09 | <@JAA> | There's a way to load CDXs as well, but I'm not sure how well-tested that is. |
| 19:41:27 | | Dango360 quits [Ping timeout: 272 seconds] |
| 19:41:27 | | Dango3607 is now known as Dango360 |
| 19:41:28 | <@JAA> | nicolas17: To be clear, technically, it would work, and it might even be fine in this particular case. But it definitely doesn't generalise. |
| 19:41:43 | <@JAA> | From a pure HTTP client spec perspective, it's still a violation. |
| 19:42:03 | <katia> | This incident will be reported. |
| 19:42:11 | <@JAA> | oh no |
| 19:42:22 | <klea> | oh no |
| 19:42:37 | <nicolas17> | https://xkcd.com/838/ |
| 19:42:49 | <katia> | Hehe |
| 19:42:55 | <@JAA> | wget-at doesn't support that dedupe profile, by the way. |
| 19:43:14 | <klea> | how is data deduped by AT then? |
| 19:43:19 | <@JAA> | It'd be this: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#profile-server-not-modified |
| 19:43:23 | <klea> | or does AT dump data without even dedup checking it? |
| 19:43:34 | <@JAA> | By payload digest, usually only within a single process. |
| 19:44:23 | <nicolas17> | wget-at downloads the whole thing and checks if the sha1 of the response matches a previously-downloaded response |
| 19:44:54 | <@JAA> | Yep |
| 19:45:04 | <@JAA> | And that's what I've done with S3 stuff before as well, just with qwarc. |
| 19:45:51 | <klea> | oh, so if i want s3 repos archived, i should contact JAA with a big list of urls that has both formats, and then JAA will run qwarc and upload to IA for me? |
| 19:46:16 | <klea> | should i also give the archive-s3-repo-with-dedup-on-both-urls-service-that-JAA-provides my S3 creds too? |
| 19:46:26 | <klea> | i suppose not since then it doesn't get in the web category. |
| 19:58:16 | | asie (asie) joins |
| 20:02:51 | | FiTheArchiver joins |
| 20:03:47 | | PaiMei quits [Quit: Wololo] |
| 20:04:06 | | PaiMei (PaiMei) joins |
| 20:05:31 | | n9nes quits [Ping timeout: 272 seconds] |
| 20:08:05 | | n9nes joins |
| 20:11:33 | | FiTheArchiver quits [Remote host closed the connection] |
| 20:11:54 | | FiTheArchiver joins |
| 20:14:26 | | deafmute joins |
| 20:14:54 | | Dada quits [Remote host closed the connection] |
| 20:15:08 | | Dada joins |
| 20:16:35 | <h2ibot> | Haiseiko edited URLTeam/Warrior (+183, /* Warrior projects */): https://wiki.archiveteam.org/?diff=60216&oldid=53418 |
| 20:16:36 | <h2ibot> | Brad edited Deathwatch (+692, Added Meta Horizon managed services and added…): https://wiki.archiveteam.org/?diff=60217&oldid=60210 |
| 20:16:37 | <h2ibot> | Vi edited Deathwatch (+285, /* 2026-03 */ added The Anime Network): https://wiki.archiveteam.org/?diff=60218&oldid=60217 |
| 20:16:38 | <h2ibot> | John123521 edited Deathwatch (+179, move Tom Lehrer and TUYU to Frozen Solid): https://wiki.archiveteam.org/?diff=60219&oldid=60218 |
| 20:19:35 | <h2ibot> | JustAnotherArchivist edited URLTeam/Warrior (-18, Fix shor.kr addition): https://wiki.archiveteam.org/?diff=60220&oldid=60216 |
| 20:22:36 | <h2ibot> | JustAnotherArchivist edited URLTeam/Warrior (-165, Reverted; there has been and is no shor-kr…): https://wiki.archiveteam.org/?diff=60221&oldid=60220 |
| 20:24:36 | <h2ibot> | JustAnotherArchivist edited Deathwatch (+160, Restore Eshizuoka entry removed by…): https://wiki.archiveteam.org/?diff=60222&oldid=60219 |
| 20:25:34 | | FiTheArchiver quits [Remote host closed the connection] |
| 20:25:36 | <h2ibot> | JustAnotherArchivist edited Deathwatch (-135, Remove empty year sections for 2030s; we can…): https://wiki.archiveteam.org/?diff=60223&oldid=60222 |
| 20:25:55 | | FiTheArchiver joins |
| 20:27:45 | | Aurora quits [Quit: Ooops, wrong browser tab.] |
| 20:30:50 | <klea> | i think i'll make a subforum or subservice for a small thing that will close in 2500, that way the wiki needs more sections :p |
| 20:44:28 | | HP_Archivist quits [Quit: Leaving] |
| 20:44:44 | | HP_Archivist joins |
| 20:56:33 | | FiTheArchiver quits [Remote host closed the connection] |
| 20:57:47 | | FiTheArchiver joins |
| 21:25:02 | | nine quits [Quit: See ya!] |
| 21:25:19 | | nine joins |
| 21:31:33 | | FiTheArchiver quits [Remote host closed the connection] |
| 21:31:56 | | FiTheArchiver joins |
| 21:50:58 | | Webuser025436 quits [Quit: Ooops, wrong browser tab.] |
| 22:02:33 | | FiTheArchiver quits [Remote host closed the connection] |
| 22:03:09 | | FiTheArchiver joins |
| 22:03:10 | | ScarlettStunningSpace quits [Ping timeout: 256 seconds] |
| 22:05:32 | | Karlett joins |
| 22:11:51 | <h2ibot> | Klea edited Phorge/uncategorized (+66, Added git.kolab.org): https://wiki.archiveteam.org/?diff=60224&oldid=60124 |
| 22:12:51 | <h2ibot> | Klea edited Phorge/uncategorized (+135, Add forge.softwareheritage.org): https://wiki.archiveteam.org/?diff=60225&oldid=60224 |
| 22:34:11 | | FiTheArchiver quits [Read error: Connection reset by peer] |
| 22:37:00 | <pokechu22> | deafmute: I'm not aware of an existing project for https://cosplay.com/ - do you have an estimate for how big it is? |
| 22:37:34 | <pokechu22> | It looks like it's not too scripty so archivebot probably could be used for it |
| 22:42:55 | <h2ibot> | Nintendofan885 edited Namuwiki (+40, fix infobox name): https://wiki.archiveteam.org/?diff=60226&oldid=60200 |
| 22:51:09 | <deafmute> | pokechi22: difficult to estimate. judging from official numbers maybe 100-150k photos, which would be 100GB-1TB I guess. But the site is broken and it doesn't show nearly the same amount of images it claims to have, so probably less than that. |
| 22:51:48 | <deafmute> | and I haven't found a way to see the full res photos, just thumbnails |
| 22:53:51 | | deafmute quits [Client Quit] |
| 22:54:07 | | deafmute joins |
| 22:54:17 | <klea> | deafmute: it seems https://s3.amazonaws.com/cosplay-cdn/large/3d8bc518-ade8-4cad-9729-032b37331052.jpg they're on s3 but that bucket isn't open |
| 22:57:51 | | Webuser239629 joins |
| 22:58:00 | | Webuser239629 quits [Client Quit] |
| 23:00:01 | | SootBector quits [Remote host closed the connection] |
| 23:01:09 | | SootBector (SootBector) joins |
| 23:06:18 | <pokechu22> | hmm, yeah. https://cosplay.com/series/reborn says https://cosplay.com/character/gokudera-hayato has 233 costumes 1023 photos, but only lists 12 photos on https://cosplay.com/character/gokudera-hayato |
| 23:06:48 | <pokechu22> | (and guessing https://cosplay.com/character/gokudera-hayato?page=2 doesn't do anything either) |
| 23:08:19 | <pokechu22> | ... and https://cosplay.com/?page=3 and beyond requires logging in, so it can't be discovered that way either (/series doesn't have that limitation though) |
| 23:08:29 | <klea> | pokechu22: they're categories, so they have photos inside? |
| 23:09:14 | <pokechu22> | I mispoke - https://cosplay.com/character/gokudera-hayato only lists 12 costumes and like a hundred photos |
| 23:09:33 | <klea> | oh |
| 23:10:17 | <nicolas17> | https://cosplay.com/s/092dder49 -> https://s3.amazonaws.com/photo.cosplay.com/143783/1800163.jpg |
| 23:10:29 | <nicolas17> | however the .jpg only loads with Referer: https://cosplaxy.com/ |
| 23:10:59 | <deafmute> | I tried logging in with a bugmenot account, it doesn't seem to change anything regarding how many images are shown |
| 23:11:12 | <pokechu22> | Huh, I didn't know s3 could do referer checks |
| 23:11:36 | <klea> | that confused me |
| 23:11:39 | <pokechu22> | but archivebot generates referers properly so that probably won't be an issue |
| 23:11:47 | <nicolas17> | bucket policy can do many things |
| 23:12:01 | <pokechu22> | Does the bugmenot account make e.g. https://cosplay.com/?page=1000 work? |
| 23:12:08 | <klea> | there's also a forum |
| 23:12:13 | | nicolas17 wonders if you can make a bucket policy that only allows access for 5 minutes of every hour |
| 23:12:35 | <klea> | pokechu22: logging in changes the ui |
| 23:12:52 | <pokechu22> | hmm |
| 23:13:24 | <pokechu22> | Oh, https://cosplay.com/member/206560 and https://cosplay.com/member/206559 both exist; we could maybe enumerate those |
| 23:13:26 | <deafmute> | pokechu22: no, still the same page for me |
| 23:13:40 | <klea> | for me it seems to be a doomscrollable interface. |
| 23:14:03 | <deafmute> | nicholas17: how did you find that /s/09dder49 page? |
| 23:14:05 | <klea> | which pokes <https://cosplay.com/livewire/message/status-list> with a servermemo |
| 23:14:26 | <klea> | deafmute: clicking on profile for user |
| 23:14:30 | <pokechu22> | Those /s/ ones are just linked on the home page too |
| 23:14:41 | <klea> | also there seems to be a forum. |
| 23:15:30 | <deafmute> | oh okay. I thought you found a way to get those pages for old posts like that sub-zero in my example |
| 23:15:33 | <pokechu22> | Huh, and https://cosplay.com/member/26559 also works, so they really do have 200k users maybe? |
| 23:15:50 | <pokechu22> | that links old posts like https://cosplay.com/s/ynz11e8m9 |
| 23:16:32 | <pokechu22> | https://cosplay.com/member/1 -> https://cosplay.com/s/b934yzmvn apparently 3 years ago |
| 23:17:05 | | Dada quits [Remote host closed the connection] |
| 23:17:16 | <klea> | https://cosplay.com/livewire/livewire.js |
| 23:17:17 | | Dada joins |
| 23:18:11 | <pokechu22> | I think bruteforcing the user list would be the easiest way to discover all of the images since it paginates all the way to https://cosplay.com/member/1?page=604 (and also *stops* paginating there, unlike some other more annoying sites...) |
| 23:18:44 | <klea> | pokechu22: we'd need to get to crawl depth two which i think AB does right? |
| 23:19:01 | <pokechu22> | What do you mean? |
| 23:19:19 | <klea> | it has to go to https://cosplay.com/s/yxny028e9 to be able to get the full picture too, not just the thumbnail. |
| 23:19:34 | <pokechu22> | I'm specifically thinking of a recursive job as an !a < list job with a custom sitemap (so it behaves like !a https://cosplay.com/ but also gets seeded with all of the member URLs) |
| 23:19:46 | <pokechu22> | that should go to those /s/ URLs and everything else on the site that's linked normally |
| 23:20:14 | <klea> | they seem to have up to user https://cosplay.com/member/402093 |
| 23:21:12 | | nine quits [Client Quit] |
| 23:21:25 | | nine joins |
| 23:23:35 | <pokechu22> | huh, looking at https://cosplay.com/series?page=50 the first row of the table claims 327 costumes, 33 photos, but https://cosplay.com/series/animamundi-dark says it's the other way around |
| 23:25:05 | <pokechu22> | it also says there are 6 characters and then lists 13 of them, with a sum of 21 costumes and 327 photos |
| 23:25:30 | <deafmute> | advanced cosplay maths |
| 23:30:12 | <deafmute> | Now I finally know how to navigate on this broken site and actually find what I was looking for, many thanks for that. |
| 23:31:09 | <deafmute> | can't really help with the technical stuff however, I was just curious |
| 23:38:02 | <h2ibot> | Nintendofan885 edited Qwarc (+4, link [[WARC]]): https://wiki.archiveteam.org/?diff=60227&oldid=54296 |
| 23:40:58 | | deafmute quits [Client Quit] |
| 23:42:19 | <klea> | JAA: could you put every item that you made using qwarc have subject qwarc? |
| 23:42:50 | | deafmute joins |
| 23:46:26 | <pokechu22> | I wrote a quick script to sum the data on https://cosplay.com/series: https://transfer.archivete.am/inline/KPGxg/cosplay.com_count_photos.py which says 1608146 photos, 333780 costumes |
| 23:47:04 | <pokechu22> | That's roughly the same size as refsheet.net, which is definitely doable |
| 23:47:20 | <@JAA> | klea: I could, yes. |
| 23:47:38 | <klea> | mostly because then finding qwarc scripts would be significantly easier. |
| 23:48:09 | <pokechu22> | That's roughly the same size as refsheet.net (1445504 images, 408063 characters, 105078 users), which is definitely doable |
| 23:49:03 | <@JAA> | Mhm |
| 23:49:15 | <deafmute> | oh that's like 10x my guess lol |
| 23:50:33 | | Dada quits [Remote host closed the connection] |