00:00:44<pokechu22>The meta one is the job log and should be uploaded. The .cdx file is normally derived from the WARC by IA itself, though I don't know if that always happens or only happens for items that get indexed by web.archive.org.
00:00:47tekulvw quits [Ping timeout: 272 seconds]
00:04:14<klea>I think it only happens for items that get indexed by web.archive.org? https://archive.org/download/limewire.com_d_7xNKB_NfXjrIqBWo
00:04:32<klea>Tho, maybe it was me not running the derive thing after every file
00:04:35<klea>lemme make it derive.
00:04:43<klea>(if i remember howto)
00:06:24<klea>It seems if you have IA derive (which is the default I believe on the web uploader?), it will make a cdx. <https://archive.org/log/5191197716> claims it will do a CDXIndex.
00:06:54<klea>Huh
00:06:58<klea>[ PST: 2026-02-16 16:05:08 ] Executing: ulimit -v 1048576 && PYTHONPATH=/petabox/sw/lib/python timeout 600 /petabox/sw/bin/cdx_writer.pex 'WARCPROX-20260216205304743-00000-y1i40ow9.warc.gz' --file-prefix='limewire.com_d_7xNKB_NfXjrIqBWo' --exclude-list='/petabox/sw/wayback/web_excludes.txt' --stats-file='/f/_limewire.com_d_7xNKB_NfXjrIqBWo/cdxstats.json'>
00:06:58<klea>'/t/_limewire.com_d_7xNKB_NfXjrIqBWo/cdx.txt'
00:07:09<klea>Wait a second.
00:07:35<klea>Couldn't that be a way to bulk check lots of urls by making a warc with records of lots of data, and then getting the cdx and seeing what apparently is missing?
00:07:50<klea>Then you'd request deletion of all that crap, because nobody wants it.
00:19:07nine quits [Quit: See ya!]
00:19:20nine joins
00:19:20nine quits [Changing host]
00:19:20nine (nine) joins
00:23:56<cruller>TheoH7: I uploaded the entire output directory. https://archive.org/details/community.jisc.ac.uk-2026-02-16-35e53623-00000
00:32:50etnguyen03 quits [Client Quit]
00:40:14SootBector quits [Remote host closed the connection]
00:41:22SootBector (SootBector) joins
00:42:30<TheoH7>cruller: Thanks, have downloaded it.
00:43:16<TheoH7>It looks like I also managed to do one where hard-coded links to https://community.ja.net (old address from years ago) are clickable in the WARC. I will upload that to IA likely in a few hours.
00:44:16<TheoH7>To upload the whole directory, is the best way to zip and upload, or can you select a whole folder for upload?
00:48:36<pokechu22>You can upload multiple files at once within a directory (uploading directories/subdirectories might also be possible but I think is more complicated?)
00:50:57<TheoH7>pokechu22: Great, will do that.
00:51:48<TheoH7>Seems one of my crawls has somehow managed to start crawling old versions of this site stored on the Wayback Machine, which is odd. I've added the pattern to ignores but just curious how grab-site would've found and started crawling such URL's.
00:52:11<TheoH7>I do already have 1 crawl without that done though, and will only upload the 2nd one if contains materially more content
01:02:20tekulvw (tekulvw) joins
01:03:29ducky quits [Ping timeout: 272 seconds]
01:04:19etnguyen03 (etnguyen03) joins
01:07:21tekulvw quits [Ping timeout: 268 seconds]
01:21:59tekulvw (tekulvw) joins
01:25:36Webuser614729 joins
01:26:29Webuser614729 quits [Client Quit]
01:27:43wotd joins
01:41:28pokechu22 quits [Quit: System maintenance]
02:22:10sec^nd quits [Remote host closed the connection]
02:22:35sec^nd (second) joins
02:36:40<nexussfan>There's a site dedicated to archiving Iranian series and films <https://nostalgik-tv.com/> which says they have 4 terabytes of videos. Would it be a good idea to archive it, or not now?
02:44:47APOLLO03 quits [Ping timeout: 268 seconds]
02:47:44ducky (ducky) joins
02:56:13nine quits [Ping timeout: 272 seconds]
02:58:38nine joins
02:58:40nine quits [Changing host]
02:58:40nine (nine) joins
03:13:09iPwnedYourIOTSmartdog quits [Ping timeout: 268 seconds]
03:13:46iPwnedYourIOTSmartdog joins
04:01:44etnguyen03 quits [Remote host closed the connection]
04:11:07tekulvw quits [Ping timeout: 268 seconds]
04:14:04tekulvw (tekulvw) joins
04:23:37tekulvw quits [Ping timeout: 272 seconds]
04:26:20Island quits [Read error: Connection reset by peer]
04:28:43tekulvw (tekulvw) joins
04:33:45tekulvw quits [Ping timeout: 272 seconds]
05:04:47n9nes quits [Ping timeout: 272 seconds]
05:08:15n9nes joins
05:14:14tekulvw (tekulvw) joins
05:24:25tekulvw quits [Ping timeout: 272 seconds]
05:32:43sec^nd quits [Remote host closed the connection]
05:33:05sec^nd (second) joins
05:42:42<ericgallager>I forget, did this make it here? https://www.theregister.com/2026/02/12/polyglot_notebooks_deprecation/
05:51:53tekulvw (tekulvw) joins
05:56:34tekulvw quits [Ping timeout: 268 seconds]
05:56:53tekulvw (tekulvw) joins
06:01:30tekulvw quits [Ping timeout: 268 seconds]
06:16:19nexussfan quits [Quit: Konversation terminated!]
06:17:13ArchivalEfforts quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
06:17:22ArchivalEfforts joins
06:22:50tekulvw (tekulvw) joins
06:27:45tekulvw quits [Ping timeout: 272 seconds]
06:57:16tekulvw (tekulvw) joins
07:05:37pokechu22 (pokechu22) joins
08:52:56ducky quits [Ping timeout: 268 seconds]
08:53:09ducky (ducky) joins
08:54:25Dango360 quits [Quit: The Lounge - https://thelounge.chat]
09:29:34TheEnbyperor_ quits [Read error: Connection reset by peer]
09:30:09cipherrot quits [Ping timeout: 272 seconds]
09:30:09TheEnbyperor quits [Ping timeout: 272 seconds]
09:37:48Snivy quits [Quit: The Lounge - https://thelounge.chat]
09:38:00TheEnbyperor joins
09:38:11petrichor (petrichor) joins
09:38:17Snivy (Snivy) joins
09:38:23Snivy quits [Remote host closed the connection]
09:38:36TheEnbyperor_ (TheEnbyperor) joins
09:39:42Snivy (Snivy) joins
09:42:53tekulvw quits [Ping timeout: 268 seconds]
10:03:54rohvani quits [Quit: The Lounge - https://thelounge.chat]
10:09:54@arkiver quits [Remote host closed the connection]
10:10:21arkiver (arkiver) joins
10:10:21@ChanServ sets mode: +o arkiver
10:14:12fireatseaparks quits [Remote host closed the connection]
10:14:48fireatseaparks (fireatseaparks) joins
10:26:37APOLLO03 joins
10:47:11Webuser505408 joins
10:47:32Webuser505408 quits [Client Quit]
11:37:03tekulvw (tekulvw) joins
11:39:37Cornelius7 (Cornelius) joins
11:41:15Cornelius quits [Ping timeout: 272 seconds]
11:41:15Cornelius7 is now known as Cornelius
11:41:53tekulvw quits [Ping timeout: 272 seconds]
11:47:35irisfreckles13 joins
11:58:52APOLLO03 quits [Read error: Connection reset by peer]
11:59:51APOLLO03 joins
12:00:03Bleo1826007227196234552220 quits [Quit: The Lounge - https://thelounge.chat]
12:02:44Bleo1826007227196234552220 joins
12:37:37petrichor quits [Client Quit]
12:51:47<irisfreckles13>how do i request yt video to be archived?
12:51:47<irisfreckles13>possible?
12:53:37<klea>irisfreckles13: Depends if it's in scope, see https://wiki.archiveteam.org/index.php/YouTube#Scope and if it's in scope you can query it to #down-the-tube
13:02:28<h2ibot>Bear created Philips (+1114, Philips - more like Sorryps): https://wiki.archiveteam.org/?oldid=60489
13:17:48Shard111 quits [Quit: Im doing something rq. Il brb]
13:19:14Shard1115 (Shard) joins
13:23:15petrichor (petrichor) joins
13:37:50Arcorann quits [Ping timeout: 268 seconds]
13:38:29<justauser>ericgallager: Doesn't look too actionable...
13:41:34Webuser660697 joins
14:03:07irisfreckles13 quits [Ping timeout: 272 seconds]
14:12:44Dada joins
14:18:20Dango360 (Dango360) joins
14:25:45<h2ibot>Justauser edited Discourse/active (+148, Added https://forums.kicksecure.com/…): https://wiki.archiveteam.org/?diff=60490&oldid=60465
14:31:32tekulvw (tekulvw) joins
14:36:25tekulvw quits [Ping timeout: 268 seconds]
14:52:01<@arkiver>imer: are you able to see something in your logs that is queuing the googleapis.com URLs?
14:54:54<@imer>arkiver: (assuming #//) no, don't think its related to the other spam though
14:56:09<@arkiver>right, sorry, this was for #//
15:09:01irisfreckles13 joins
15:11:35Webuser116786 joins
15:11:58Webuser116786 quits [Client Quit]
16:02:33tekulvw (tekulvw) joins
16:07:15tekulvw quits [Ping timeout: 272 seconds]
16:13:57Island joins
16:28:01<h2ibot>Bear edited Mortis (+17, Provided by [[User:BouleBoule]] but not…): https://wiki.archiveteam.org/?diff=60491&oldid=58254
16:30:01<h2ibot>Bear edited Mortis (-3, misplaced pipes): https://wiki.archiveteam.org/?diff=60492&oldid=60491
16:37:02<h2ibot>Bear edited List of websites excluded from the Wayback Machine (+356, More details on Philips.com ([[Philips]])): https://wiki.archiveteam.org/?diff=60493&oldid=60371
16:40:37Goofybally9 quits [Quit: The Lounge - https://thelounge.chat]
16:41:23Goofybally joins
16:42:12Goofybally quits [Client Quit]
16:42:43Goofybally (Goofybally) joins
16:46:03<h2ibot>Bear edited List of websites excluded from the Wayback Machine (+181, steampunkal.com excluded between 2013-11-12 and…): https://wiki.archiveteam.org/?diff=60494&oldid=60493
16:52:03DogsRNice joins
17:03:31tekulvw (tekulvw) joins
17:08:07tekulvw quits [Ping timeout: 268 seconds]
17:26:02<HP_Archivist>RE: WARC captures. JAA apologies I'm just now responding to this. But thank you. I have used webrecorder for captures before, a few years ago. SingleFilez was merged into just SingleFile now, I think.
17:26:39<HP_Archivist>I used Webrecorder for these individual captures https://archive.org/details/@archivist_goals?query=warc
17:27:04<HP_Archivist>But going forward, I am leaning on trying browsertrix, to do things right.
17:27:56<HP_Archivist>Oh, but you said warcprox, too. Hm.
17:30:11<HP_Archivist>SingleFile's options are a little obtuse, though I remember using that too a few years back.
17:33:50tekulvw (tekulvw) joins
17:42:02tekulvw quits [Ping timeout: 268 seconds]
17:44:31<justauser>Fun https://pomf.lain.la/robots.txt
17:45:04<justauser>Apparently Google interpreted it as Disallow: /, but whatever is behind DDG didn't.
17:54:22corentin quits [Ping timeout: 268 seconds]
17:54:27tekulvw (tekulvw) joins
17:59:01<justauser>https://transfer.archivete.am/IjrDe/pomf.lain.la_ddg_nitter.txt
17:59:02<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/IjrDe/pomf.lain.la_ddg_nitter.txt
17:59:19<justauser>Google, Yandex: nothing; Bind: unrelated websites.
17:59:21tekulvw quits [Ping timeout: 272 seconds]
18:07:57<klea>justauser: did you get pomf*.lain.la too?, iirc i've seen some pomf urls with pomf2 instead.
18:08:18<justauser>GitHub code, CDX: nothing
18:08:27<justauser>No, will check.
18:08:47<klea>ok, I think pomf3 and maybe check pomf[0-9] too I guess.
18:09:32<klea>apparently only pomf2 has valid tls.
18:09:56Webuser660697 quits [Quit: Ooops, wrong browser tab.]
18:10:35Webuser810542 joins
18:13:36<justauser>pomf2 is so much more abundant on Nitter, but only 1 link in DDG.
18:14:38<klea>huh
18:27:13@rewby quits [Ping timeout: 272 seconds]
18:29:22rewby (rewby) joins
18:29:22@ChanServ sets mode: +o rewby
19:04:40APOLLO03 quits [Ping timeout: 268 seconds]
19:04:43APOLLO03 joins
19:05:32tekulvw (tekulvw) joins
19:17:37Wohlstand1 (Wohlstand) joins
19:18:14tekulvw quits [Ping timeout: 268 seconds]
19:19:59Wohlstand1 is now known as Wohlstand
19:27:56<justauser>https://transfer.archivete.am/51vfP/pomf.lain.la_ddg_nitter_gharchive_2.txt
19:27:56<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/51vfP/pomf.lain.la_ddg_nitter_gharchive_2.txt
19:29:49<justauser>pomf.lain.la is Wayback-excluded, but pomf2 is not.
19:30:20<justauser>So I suggest saving everything as pomf2 no matter which URL is has originally?
19:31:31<justauser>pomf2.lain.la has huge CDX records, in fact. Does it make sense to scrape them?
19:31:50<justauser>One URL = one file, so everything available in CDX is already saved.
19:32:17<justauser>But making a copy as AB WARC could help if pomf2 gets excluded too.
19:32:28<klea>yeah.
19:32:59<klea>Also, whilst at it, it might be neat to archive philips stuff (re Bear's wiki page)
19:33:02Wohlstand quits [Ping timeout: 268 seconds]
19:33:43twiswist quits [Ping timeout: 272 seconds]
19:54:53APOLLO03a joins
19:57:42APOLLO03 quits [Ping timeout: 268 seconds]
20:07:02tekulvw (tekulvw) joins
20:10:38Doranwen quits [Read error: Connection reset by peer]
20:11:01Doranwen (Doranwen) joins
20:11:05rohvani joins
20:11:43tekulvw quits [Ping timeout: 272 seconds]
20:14:01<TheoH7>I'm in the process of uploading one of the crawls I did of https://community.jisc.ac.uk to IA. I can't see the websites collection. Is the best match "Community data" or do I need to request permissions?
20:15:44<@JAA>HP_Archivist: Browsertrix has the same issues of writing incorrect WARCs as far as I know. Haven't actually tested it though.
20:17:10<@JAA>(The readme of Browsertrix Crawler explicitly mentions capturing data through CDP, which can't possibly be correct because CDP doesn't expose the necessary data to write valid WARCs.)
20:21:51iPwnedYourIOTSmartdog quits [Ping timeout: 272 seconds]
20:32:34<h2ibot>Cooljeanius edited ArchiveBot/Ignore (+4, /* Substack */ red link to create article from): https://wiki.archiveteam.org/?diff=60495&oldid=59094
20:34:34<h2ibot>Cooljeanius created Substack (+154, Created page with "{{stub}} '''Substack''' is…): https://wiki.archiveteam.org/?oldid=60496
20:41:53<nicolas17>TheoH7: make sure you set mediatype to "web"
20:42:10<nicolas17>someone with more permissions can change the collection later
20:42:19<nicolas17>it still won't appear in wayback machine
20:49:10DlugasnyPL joins
20:50:28<DlugasnyPL>HI, does archiveteam has any team which downloading rendered pages ?
20:51:15Wohlstand1 (Wohlstand) joins
20:53:37Wohlstand1 is now known as Wohlstand
20:58:09Webuser081957 joins
20:58:26Webuser081957 quits [Client Quit]
20:58:49twiswist (twiswist) joins
21:02:27<pokechu22>What do you mean by "rendered pages"?
21:04:06tekulvw (tekulvw) joins
21:11:15tekulvw quits [Ping timeout: 272 seconds]
21:11:37<DlugasnyPL>as I see in the documentation, warrior using wget to download pages. WGET is ok, but cannot execute for example java scripts. rendered pages = browsertrix ?
21:13:08<DlugasnyPL>since few months I`m creating archives using browsertrix. Thats why I`m asking if You have any group here which is working with this kind of archiving
21:18:08<pokechu22>Yes, https://wiki.archiveteam.org/index.php/User:TheTechRobo/Mnbot (though this is not suitable for super large sites). Warrior projects also use lua so they can do things slightly smarter (e.g. look for specific strings in the page source and generate new requests off of those), though for particularly complicated sites that's insufficient.
21:25:01<DlugasnyPL>Interesting. But as I see its still in development phase. I will observe it.
21:25:21Island_ joins
21:25:49Ryz quits [Ping timeout: 272 seconds]
21:25:53Ryz (Ryz) joins
21:27:14<DlugasnyPL>I have started first docker container with warrior on one of my servers. is there any parameter to increase parallel downloads/uploads ? I have a lot of resources, but I do not know how to set it smart.
21:28:58Island quits [Ping timeout: 268 seconds]
21:29:51<klea>--concurrent iirc, but also keep in mind if the limit you set is too high the site may ban you.
21:41:00tekulvw (tekulvw) joins
21:41:13<DlugasnyPL>I do not want to open parallel connections to one site. I would like to increase number of pages which my warrior will crawl - 1-2connections per page, Multiple different domains - how to setup ?
21:42:16<DlugasnyPL>1-2 connections per domain, just to avoid ban, multiple domains parallel
21:45:37tekulvw quits [Ping timeout: 268 seconds]
21:45:43<klea>You probably want to run project workers directly instead of the docker warrior then.
21:45:47<klea>I believe?
21:46:14<klea>If you haven't set a choice, IIRC AT's default choice is normally telegram.
21:48:48<DlugasnyPL>project workers - yes, that sounds better ;)
21:50:34<klea>I don't know how people typically automate that process tho.
21:54:28<DlugasnyPL>i have asked chatpgt but it also do not know
21:56:56Dada quits [Remote host closed the connection]
22:03:30DlugasnyPL quits [Remote host closed the connection]
22:11:52Dada joins
22:44:28Dada quits [Remote host closed the connection]
22:47:19DlugasnyPL joins
22:50:30<DlugasnyPL>if I will create multiple instances of warrior on my server, you said that I will endup with multiple Telegram crawlers... is there any chance to download list of my pages using this tool ?
22:53:55<DlugasnyPL>In my opinion end user should have a choice what his instance will archive. I know that archive team is organised group with process, but is there any chance to do archiving (properly) as "lonely wolf", using stand alone version of warrior with user list of the domains ?
22:56:52<klea>You can choose the project, you can't choose what specific items you'll get for a project.
22:59:33APOLLO03a quits [Ping timeout: 272 seconds]
22:59:33^ quits [Ping timeout: 272 seconds]
23:00:39<DlugasnyPL>I do not understand context of this word "project". What is mean ? Does it mean that somebody must set some project for specified domain ? for example for telegram.org ? this is huge site, so i believe task must be deveided for small pieces and delegated to the end users. project - is something like configuration requreid to download specified page, correct ?
23:01:37tekulvw (tekulvw) joins
23:02:06<klea>A project is normally, but not always, a specific website, or combination of websites that work the in the same manner.
23:02:49<klea>The website https://tracker.archiveteam.org/ show some projects which are on the list of things people running Warriors can choose to do.
23:03:26<klea>Things like URLTeam2 or YouTube have some counter-recomendations against running them at home due to the fact they leak IP addresses.
23:03:56^ (^) joins
23:04:46Wohlstand quits [Client Quit]
23:04:53nexussfan (nexussfan) joins
23:06:23<DlugasnyPL>leak IP ? what do You mean ?
23:06:24tekulvw quits [Ping timeout: 268 seconds]
23:06:56<DlugasnyPL>archive team system trying to hide end user ips ?
23:07:01<klea>URLTeam2 also does lots of requests per second to urls that may not be fully vetted against.
23:07:56<DlugasnyPL>thats why You are spreading job for multiple end users
23:07:58<DlugasnyPL>ok
23:08:48<klea>no, also I'm not doing things right now.
23:09:21<@JAA>klea is confusing URLTeam with URLs again.
23:09:34<klea>Sorry.
23:09:40<klea>The webui is confusing.
23:09:48<klea>it'd be neat to show the warrior readmes on the webpage too.
23:10:23<klea>s/URLTeam2 or/URLs or/
23:11:07<DlugasnyPL>that would be perfect
23:11:40<DlugasnyPL>but discussion here is also nice :)
23:13:51APOLLO03 joins
23:14:26Dada joins
23:15:52<DlugasnyPL>I have list of 36000 domains not archived yet, polish sites. I come from Poland. Is there any chance to create project for them ?
23:17:11<klea>depends on what kind of sites, and how they work.
23:17:34<klea>it could be done slowly on #archivebot most likely if they're not using javascript too much.
23:17:49<DlugasnyPL>they are using a lot js
23:18:02<DlugasnyPL>formulars, search pages
23:18:04<DlugasnyPL>etc.etc.
23:18:51Arcorann (Arcorann) joins
23:19:11<DlugasnyPL>we have started archiving one time per month on most of that sites using browsertrix with crawl depth control but even with this it is very time consuming process
23:19:57<@JAA>Maybe it could slowly be run through #jseater but we don't currently have a distributed setup for JS-heavy things.
23:20:13<@JAA>No recursion there though.
23:22:10<klea>Should I add my jseater url list thingy to the wiki page?
23:22:20etnguyen03 (etnguyen03) joins
23:22:23<klea>maybe I should make it not give who queued stuff?
23:22:59<DlugasnyPL>do You know what exactly, which problem discrediting browsertrix ?
23:24:02<@JAA>Same problem as ArchiveWeb.Page and anything else that uses Chrome Debugging Protocol to produce WARCs: it can't accurately capture the actual HTTP data received from the server, only a parsed version of it.
23:24:11<@JAA>See https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem
23:28:04<DlugasnyPL>parsed version, you mean output from browser ?
23:28:50<@JAA>Parsed representation of the HTTP responses, e.g. key-value pair of headers instead of the raw bytes.
23:33:57<DlugasnyPL>what kind of impact it is generating on the final result ? when user will open archive from browsertrix and wget-at - what will be the difference ? for example warc from warrior, search field with js will not work and will not dispaly any items, correct ? Browsertrix will show "rendered" js page, right and wget-at not, but wget-at will have something which is not visibale for an end user but important for WARC stanadard correct ? ?
23:35:12Dada quits [Remote host closed the connection]
23:35:16<klea>WARC is a raw capture of the raw bytes of the HTTP requests and responses, not of rendered webpages.
23:35:57<DlugasnyPL>just trying to understand the idea of WARC - visual effect, archive to browse by people or dry standard to collect raw data from the http servers
23:36:34<DlugasnyPL>ok
23:37:02<@JAA>Capturing the data and playing back pages from it are two entirely separate issues.
23:38:22<DlugasnyPL>os if data saved in WARC are frinedly for human eye but not friendly for machine, then it is not a write standard. so You are trying to record raw output from the servers, write it in the form which can be used afterwards to "render" the page ?
23:38:56<DlugasnyPL>right standard
23:47:07<DlugasnyPL>I think thats all for today. I will keep warrior running and help you to crawl US pages, but any way I would like to start archiving process for planty of polish pages. Thank You for this nice introduction.
23:48:31<@JAA>WARC is both very human- and machine-readable, in the same way as HTTP.
23:50:07<@JAA>If you capture all HTTP requests/responses that occur during a page load, it should be possible to play that back again later as well. There are a lot of edge cases though.
23:50:22<@JAA>You can still store a screenshot or a DOM dump in the WARC as well if you want to do that.
23:50:55<klea>It'd be fun to make something to take tcpdump/wireshark output and make warcs out of it, but I don't have time.
23:51:51<DlugasnyPL>one more question before I will go sleep - where warrior uploading all files ? directly to archive.org ?