| 00:00:44 | <pokechu22> | The meta one is the job log and should be uploaded. The .cdx file is normally derived from the WARC by IA itself, though I don't know if that always happens or only happens for items that get indexed by web.archive.org. |
| 00:00:47 | | tekulvw quits [Ping timeout: 272 seconds] |
| 00:04:14 | <klea> | I think it only happens for items that get indexed by web.archive.org? https://archive.org/download/limewire.com_d_7xNKB_NfXjrIqBWo |
| 00:04:32 | <klea> | Tho, maybe it was me not running the derive thing after every file |
| 00:04:35 | <klea> | lemme make it derive. |
| 00:04:43 | <klea> | (if i remember howto) |
| 00:06:24 | <klea> | It seems if you have IA derive (which is the default I believe on the web uploader?), it will make a cdx. <https://archive.org/log/5191197716> claims it will do a CDXIndex. |
| 00:06:54 | <klea> | Huh |
| 00:06:58 | <klea> | [ PST: 2026-02-16 16:05:08 ] Executing: ulimit -v 1048576 && PYTHONPATH=/petabox/sw/lib/python timeout 600 /petabox/sw/bin/cdx_writer.pex 'WARCPROX-20260216205304743-00000-y1i40ow9.warc.gz' --file-prefix='limewire.com_d_7xNKB_NfXjrIqBWo' --exclude-list='/petabox/sw/wayback/web_excludes.txt' --stats-file='/f/_limewire.com_d_7xNKB_NfXjrIqBWo/cdxstats.json'> |
| 00:06:58 | <klea> | '/t/_limewire.com_d_7xNKB_NfXjrIqBWo/cdx.txt' |
| 00:07:09 | <klea> | Wait a second. |
| 00:07:35 | <klea> | Couldn't that be a way to bulk check lots of urls by making a warc with records of lots of data, and then getting the cdx and seeing what apparently is missing? |
| 00:07:50 | <klea> | Then you'd request deletion of all that crap, because nobody wants it. |
| 00:19:07 | | nine quits [Quit: See ya!] |
| 00:19:20 | | nine joins |
| 00:19:20 | | nine is now authenticated as nine |
| 00:19:20 | | nine quits [Changing host] |
| 00:19:20 | | nine (nine) joins |
| 00:23:56 | <cruller> | TheoH7: I uploaded the entire output directory. https://archive.org/details/community.jisc.ac.uk-2026-02-16-35e53623-00000 |
| 00:32:50 | | etnguyen03 quits [Client Quit] |
| 00:40:14 | | SootBector quits [Remote host closed the connection] |
| 00:41:22 | | SootBector (SootBector) joins |
| 00:42:30 | <TheoH7> | cruller: Thanks, have downloaded it. |
| 00:43:16 | <TheoH7> | It looks like I also managed to do one where hard-coded links to https://community.ja.net (old address from years ago) are clickable in the WARC. I will upload that to IA likely in a few hours. |
| 00:44:16 | <TheoH7> | To upload the whole directory, is the best way to zip and upload, or can you select a whole folder for upload? |
| 00:48:36 | <pokechu22> | You can upload multiple files at once within a directory (uploading directories/subdirectories might also be possible but I think is more complicated?) |
| 00:50:57 | <TheoH7> | pokechu22: Great, will do that. |
| 00:51:48 | <TheoH7> | Seems one of my crawls has somehow managed to start crawling old versions of this site stored on the Wayback Machine, which is odd. I've added the pattern to ignores but just curious how grab-site would've found and started crawling such URL's. |
| 00:52:11 | <TheoH7> | I do already have 1 crawl without that done though, and will only upload the 2nd one if contains materially more content |
| 01:02:20 | | tekulvw (tekulvw) joins |
| 01:03:29 | | ducky quits [Ping timeout: 272 seconds] |
| 01:04:19 | | etnguyen03 (etnguyen03) joins |
| 01:07:21 | | tekulvw quits [Ping timeout: 268 seconds] |
| 01:21:59 | | tekulvw (tekulvw) joins |
| 01:25:36 | | Webuser614729 joins |
| 01:26:29 | | Webuser614729 quits [Client Quit] |
| 01:27:43 | | wotd joins |
| 01:41:28 | | pokechu22 quits [Quit: System maintenance] |
| 02:22:10 | | sec^nd quits [Remote host closed the connection] |
| 02:22:35 | | sec^nd (second) joins |
| 02:36:40 | <nexussfan> | There's a site dedicated to archiving Iranian series and films <https://nostalgik-tv.com/> which says they have 4 terabytes of videos. Would it be a good idea to archive it, or not now? |
| 02:44:47 | | APOLLO03 quits [Ping timeout: 268 seconds] |
| 02:47:44 | | ducky (ducky) joins |
| 02:56:13 | | nine quits [Ping timeout: 272 seconds] |
| 02:58:38 | | nine joins |
| 02:58:40 | | nine is now authenticated as nine |
| 02:58:40 | | nine quits [Changing host] |
| 02:58:40 | | nine (nine) joins |
| 03:13:09 | | iPwnedYourIOTSmartdog quits [Ping timeout: 268 seconds] |
| 03:13:46 | | iPwnedYourIOTSmartdog joins |
| 04:01:44 | | etnguyen03 quits [Remote host closed the connection] |
| 04:11:07 | | tekulvw quits [Ping timeout: 268 seconds] |
| 04:14:04 | | tekulvw (tekulvw) joins |
| 04:23:37 | | tekulvw quits [Ping timeout: 272 seconds] |
| 04:26:20 | | Island quits [Read error: Connection reset by peer] |
| 04:28:43 | | tekulvw (tekulvw) joins |
| 04:33:45 | | tekulvw quits [Ping timeout: 272 seconds] |
| 05:04:47 | | n9nes quits [Ping timeout: 272 seconds] |
| 05:08:15 | | n9nes joins |
| 05:14:14 | | tekulvw (tekulvw) joins |
| 05:24:25 | | tekulvw quits [Ping timeout: 272 seconds] |
| 05:32:43 | | sec^nd quits [Remote host closed the connection] |
| 05:33:05 | | sec^nd (second) joins |
| 05:42:42 | <ericgallager> | I forget, did this make it here? https://www.theregister.com/2026/02/12/polyglot_notebooks_deprecation/ |
| 05:51:53 | | tekulvw (tekulvw) joins |
| 05:56:34 | | tekulvw quits [Ping timeout: 268 seconds] |
| 05:56:53 | | tekulvw (tekulvw) joins |
| 06:01:30 | | tekulvw quits [Ping timeout: 268 seconds] |
| 06:16:19 | | nexussfan quits [Quit: Konversation terminated!] |
| 06:17:13 | | ArchivalEfforts quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
| 06:17:22 | | ArchivalEfforts joins |
| 06:22:50 | | tekulvw (tekulvw) joins |
| 06:27:45 | | tekulvw quits [Ping timeout: 272 seconds] |
| 06:57:16 | | tekulvw (tekulvw) joins |
| 07:05:37 | | pokechu22 (pokechu22) joins |
| 08:52:56 | | ducky quits [Ping timeout: 268 seconds] |
| 08:53:09 | | ducky (ducky) joins |
| 08:54:25 | | Dango360 quits [Quit: The Lounge - https://thelounge.chat] |
| 09:29:34 | | TheEnbyperor_ quits [Read error: Connection reset by peer] |
| 09:30:09 | | cipherrot quits [Ping timeout: 272 seconds] |
| 09:30:09 | | TheEnbyperor quits [Ping timeout: 272 seconds] |
| 09:37:48 | | Snivy quits [Quit: The Lounge - https://thelounge.chat] |
| 09:38:00 | | TheEnbyperor joins |
| 09:38:11 | | petrichor (petrichor) joins |
| 09:38:17 | | Snivy (Snivy) joins |
| 09:38:23 | | Snivy quits [Remote host closed the connection] |
| 09:38:36 | | TheEnbyperor_ (TheEnbyperor) joins |
| 09:39:42 | | Snivy (Snivy) joins |
| 09:42:53 | | tekulvw quits [Ping timeout: 268 seconds] |
| 10:03:54 | | rohvani quits [Quit: The Lounge - https://thelounge.chat] |
| 10:09:54 | | @arkiver quits [Remote host closed the connection] |
| 10:10:21 | | arkiver (arkiver) joins |
| 10:10:21 | | @ChanServ sets mode: +o arkiver |
| 10:14:12 | | fireatseaparks quits [Remote host closed the connection] |
| 10:14:48 | | fireatseaparks (fireatseaparks) joins |
| 10:26:37 | | APOLLO03 joins |
| 10:47:11 | | Webuser505408 joins |
| 10:47:32 | | Webuser505408 quits [Client Quit] |
| 11:37:03 | | tekulvw (tekulvw) joins |
| 11:39:37 | | Cornelius7 (Cornelius) joins |
| 11:41:15 | | Cornelius quits [Ping timeout: 272 seconds] |
| 11:41:15 | | Cornelius7 is now known as Cornelius |
| 11:41:53 | | tekulvw quits [Ping timeout: 272 seconds] |
| 11:47:35 | | irisfreckles13 joins |
| 11:58:52 | | APOLLO03 quits [Read error: Connection reset by peer] |
| 11:59:51 | | APOLLO03 joins |
| 12:00:03 | | Bleo1826007227196234552220 quits [Quit: The Lounge - https://thelounge.chat] |
| 12:02:44 | | Bleo1826007227196234552220 joins |
| 12:37:37 | | petrichor quits [Client Quit] |
| 12:51:47 | <irisfreckles13> | how do i request yt video to be archived? |
| 12:51:47 | <irisfreckles13> | possible? |
| 12:53:37 | <klea> | irisfreckles13: Depends if it's in scope, see https://wiki.archiveteam.org/index.php/YouTube#Scope and if it's in scope you can query it to #down-the-tube |
| 13:02:28 | <h2ibot> | Bear created Philips (+1114, Philips - more like Sorryps): https://wiki.archiveteam.org/?oldid=60489 |
| 13:17:48 | | Shard111 quits [Quit: Im doing something rq. Il brb] |
| 13:19:14 | | Shard1115 (Shard) joins |
| 13:23:15 | | petrichor (petrichor) joins |
| 13:37:50 | | Arcorann quits [Ping timeout: 268 seconds] |
| 13:38:29 | <justauser> | ericgallager: Doesn't look too actionable... |
| 13:41:34 | | Webuser660697 joins |
| 14:03:07 | | irisfreckles13 quits [Ping timeout: 272 seconds] |
| 14:12:44 | | Dada joins |
| 14:18:20 | | Dango360 (Dango360) joins |
| 14:25:45 | <h2ibot> | Justauser edited Discourse/active (+148, Added https://forums.kicksecure.com/…): https://wiki.archiveteam.org/?diff=60490&oldid=60465 |
| 14:31:32 | | tekulvw (tekulvw) joins |
| 14:36:25 | | tekulvw quits [Ping timeout: 268 seconds] |
| 14:52:01 | <@arkiver> | imer: are you able to see something in your logs that is queuing the googleapis.com URLs? |
| 14:54:54 | <@imer> | arkiver: (assuming #//) no, don't think its related to the other spam though |
| 14:56:09 | <@arkiver> | right, sorry, this was for #// |
| 15:09:01 | | irisfreckles13 joins |
| 15:11:35 | | Webuser116786 joins |
| 15:11:58 | | Webuser116786 quits [Client Quit] |
| 16:02:33 | | tekulvw (tekulvw) joins |
| 16:07:15 | | tekulvw quits [Ping timeout: 272 seconds] |
| 16:13:57 | | Island joins |
| 16:28:01 | <h2ibot> | Bear edited Mortis (+17, Provided by [[User:BouleBoule]] but not…): https://wiki.archiveteam.org/?diff=60491&oldid=58254 |
| 16:30:01 | <h2ibot> | Bear edited Mortis (-3, misplaced pipes): https://wiki.archiveteam.org/?diff=60492&oldid=60491 |
| 16:37:02 | <h2ibot> | Bear edited List of websites excluded from the Wayback Machine (+356, More details on Philips.com ([[Philips]])): https://wiki.archiveteam.org/?diff=60493&oldid=60371 |
| 16:40:37 | | Goofybally9 quits [Quit: The Lounge - https://thelounge.chat] |
| 16:41:23 | | Goofybally joins |
| 16:42:12 | | Goofybally quits [Client Quit] |
| 16:42:43 | | Goofybally (Goofybally) joins |
| 16:46:03 | <h2ibot> | Bear edited List of websites excluded from the Wayback Machine (+181, steampunkal.com excluded between 2013-11-12 and…): https://wiki.archiveteam.org/?diff=60494&oldid=60493 |
| 16:52:03 | | DogsRNice joins |
| 17:03:31 | | tekulvw (tekulvw) joins |
| 17:08:07 | | tekulvw quits [Ping timeout: 268 seconds] |
| 17:26:02 | <HP_Archivist> | RE: WARC captures. JAA apologies I'm just now responding to this. But thank you. I have used webrecorder for captures before, a few years ago. SingleFilez was merged into just SingleFile now, I think. |
| 17:26:39 | <HP_Archivist> | I used Webrecorder for these individual captures https://archive.org/details/@archivist_goals?query=warc |
| 17:27:04 | <HP_Archivist> | But going forward, I am leaning on trying browsertrix, to do things right. |
| 17:27:56 | <HP_Archivist> | Oh, but you said warcprox, too. Hm. |
| 17:30:11 | <HP_Archivist> | SingleFile's options are a little obtuse, though I remember using that too a few years back. |
| 17:33:50 | | tekulvw (tekulvw) joins |
| 17:42:02 | | tekulvw quits [Ping timeout: 268 seconds] |
| 17:44:31 | <justauser> | Fun https://pomf.lain.la/robots.txt |
| 17:45:04 | <justauser> | Apparently Google interpreted it as Disallow: /, but whatever is behind DDG didn't. |
| 17:54:22 | | corentin quits [Ping timeout: 268 seconds] |
| 17:54:27 | | tekulvw (tekulvw) joins |
| 17:59:01 | <justauser> | https://transfer.archivete.am/IjrDe/pomf.lain.la_ddg_nitter.txt |
| 17:59:02 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/IjrDe/pomf.lain.la_ddg_nitter.txt |
| 17:59:19 | <justauser> | Google, Yandex: nothing; Bind: unrelated websites. |
| 17:59:21 | | tekulvw quits [Ping timeout: 272 seconds] |
| 18:07:57 | <klea> | justauser: did you get pomf*.lain.la too?, iirc i've seen some pomf urls with pomf2 instead. |
| 18:08:18 | <justauser> | GitHub code, CDX: nothing |
| 18:08:27 | <justauser> | No, will check. |
| 18:08:47 | <klea> | ok, I think pomf3 and maybe check pomf[0-9] too I guess. |
| 18:09:32 | <klea> | apparently only pomf2 has valid tls. |
| 18:09:56 | | Webuser660697 quits [Quit: Ooops, wrong browser tab.] |
| 18:10:35 | | Webuser810542 joins |
| 18:13:36 | <justauser> | pomf2 is so much more abundant on Nitter, but only 1 link in DDG. |
| 18:14:38 | <klea> | huh |
| 18:27:13 | | @rewby quits [Ping timeout: 272 seconds] |
| 18:29:22 | | rewby (rewby) joins |
| 18:29:22 | | @ChanServ sets mode: +o rewby |
| 19:04:40 | | APOLLO03 quits [Ping timeout: 268 seconds] |
| 19:04:43 | | APOLLO03 joins |
| 19:05:32 | | tekulvw (tekulvw) joins |
| 19:17:37 | | Wohlstand1 (Wohlstand) joins |
| 19:18:14 | | tekulvw quits [Ping timeout: 268 seconds] |
| 19:19:59 | | Wohlstand1 is now known as Wohlstand |
| 19:27:56 | <justauser> | https://transfer.archivete.am/51vfP/pomf.lain.la_ddg_nitter_gharchive_2.txt |
| 19:27:56 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/51vfP/pomf.lain.la_ddg_nitter_gharchive_2.txt |
| 19:29:49 | <justauser> | pomf.lain.la is Wayback-excluded, but pomf2 is not. |
| 19:30:20 | <justauser> | So I suggest saving everything as pomf2 no matter which URL is has originally? |
| 19:31:31 | <justauser> | pomf2.lain.la has huge CDX records, in fact. Does it make sense to scrape them? |
| 19:31:50 | <justauser> | One URL = one file, so everything available in CDX is already saved. |
| 19:32:17 | <justauser> | But making a copy as AB WARC could help if pomf2 gets excluded too. |
| 19:32:28 | <klea> | yeah. |
| 19:32:59 | <klea> | Also, whilst at it, it might be neat to archive philips stuff (re Bear's wiki page) |
| 19:33:02 | | Wohlstand quits [Ping timeout: 268 seconds] |
| 19:33:43 | | twiswist quits [Ping timeout: 272 seconds] |
| 19:54:53 | | APOLLO03a joins |
| 19:57:42 | | APOLLO03 quits [Ping timeout: 268 seconds] |
| 20:07:02 | | tekulvw (tekulvw) joins |
| 20:10:38 | | Doranwen quits [Read error: Connection reset by peer] |
| 20:11:01 | | Doranwen (Doranwen) joins |
| 20:11:05 | | rohvani joins |
| 20:11:43 | | tekulvw quits [Ping timeout: 272 seconds] |
| 20:14:01 | <TheoH7> | I'm in the process of uploading one of the crawls I did of https://community.jisc.ac.uk to IA. I can't see the websites collection. Is the best match "Community data" or do I need to request permissions? |
| 20:15:44 | <@JAA> | HP_Archivist: Browsertrix has the same issues of writing incorrect WARCs as far as I know. Haven't actually tested it though. |
| 20:17:10 | <@JAA> | (The readme of Browsertrix Crawler explicitly mentions capturing data through CDP, which can't possibly be correct because CDP doesn't expose the necessary data to write valid WARCs.) |
| 20:21:51 | | iPwnedYourIOTSmartdog quits [Ping timeout: 272 seconds] |
| 20:32:34 | <h2ibot> | Cooljeanius edited ArchiveBot/Ignore (+4, /* Substack */ red link to create article from): https://wiki.archiveteam.org/?diff=60495&oldid=59094 |
| 20:34:34 | <h2ibot> | Cooljeanius created Substack (+154, Created page with "{{stub}} '''Substack''' is…): https://wiki.archiveteam.org/?oldid=60496 |
| 20:41:53 | <nicolas17> | TheoH7: make sure you set mediatype to "web" |
| 20:42:10 | <nicolas17> | someone with more permissions can change the collection later |
| 20:42:19 | <nicolas17> | it still won't appear in wayback machine |
| 20:49:10 | | DlugasnyPL joins |
| 20:50:28 | <DlugasnyPL> | HI, does archiveteam has any team which downloading rendered pages ? |
| 20:51:15 | | Wohlstand1 (Wohlstand) joins |
| 20:53:37 | | Wohlstand1 is now known as Wohlstand |
| 20:58:09 | | Webuser081957 joins |
| 20:58:26 | | Webuser081957 quits [Client Quit] |
| 20:58:49 | | twiswist (twiswist) joins |
| 21:02:27 | <pokechu22> | What do you mean by "rendered pages"? |
| 21:04:06 | | tekulvw (tekulvw) joins |
| 21:11:15 | | tekulvw quits [Ping timeout: 272 seconds] |
| 21:11:37 | <DlugasnyPL> | as I see in the documentation, warrior using wget to download pages. WGET is ok, but cannot execute for example java scripts. rendered pages = browsertrix ? |
| 21:13:08 | <DlugasnyPL> | since few months I`m creating archives using browsertrix. Thats why I`m asking if You have any group here which is working with this kind of archiving |
| 21:18:08 | <pokechu22> | Yes, https://wiki.archiveteam.org/index.php/User:TheTechRobo/Mnbot (though this is not suitable for super large sites). Warrior projects also use lua so they can do things slightly smarter (e.g. look for specific strings in the page source and generate new requests off of those), though for particularly complicated sites that's insufficient. |
| 21:25:01 | <DlugasnyPL> | Interesting. But as I see its still in development phase. I will observe it. |
| 21:25:21 | | Island_ joins |
| 21:25:49 | | Ryz quits [Ping timeout: 272 seconds] |
| 21:25:53 | | Ryz (Ryz) joins |
| 21:27:14 | <DlugasnyPL> | I have started first docker container with warrior on one of my servers. is there any parameter to increase parallel downloads/uploads ? I have a lot of resources, but I do not know how to set it smart. |
| 21:28:58 | | Island quits [Ping timeout: 268 seconds] |
| 21:29:51 | <klea> | --concurrent iirc, but also keep in mind if the limit you set is too high the site may ban you. |
| 21:41:00 | | tekulvw (tekulvw) joins |
| 21:41:13 | <DlugasnyPL> | I do not want to open parallel connections to one site. I would like to increase number of pages which my warrior will crawl - 1-2connections per page, Multiple different domains - how to setup ? |
| 21:42:16 | <DlugasnyPL> | 1-2 connections per domain, just to avoid ban, multiple domains parallel |
| 21:45:37 | | tekulvw quits [Ping timeout: 268 seconds] |
| 21:45:43 | <klea> | You probably want to run project workers directly instead of the docker warrior then. |
| 21:45:47 | <klea> | I believe? |
| 21:46:14 | <klea> | If you haven't set a choice, IIRC AT's default choice is normally telegram. |
| 21:48:48 | <DlugasnyPL> | project workers - yes, that sounds better ;) |
| 21:50:34 | <klea> | I don't know how people typically automate that process tho. |
| 21:54:28 | <DlugasnyPL> | i have asked chatpgt but it also do not know |
| 21:56:56 | | Dada quits [Remote host closed the connection] |
| 22:03:30 | | DlugasnyPL quits [Remote host closed the connection] |
| 22:11:52 | | Dada joins |
| 22:44:28 | | Dada quits [Remote host closed the connection] |
| 22:47:19 | | DlugasnyPL joins |
| 22:50:30 | <DlugasnyPL> | if I will create multiple instances of warrior on my server, you said that I will endup with multiple Telegram crawlers... is there any chance to download list of my pages using this tool ? |
| 22:53:55 | <DlugasnyPL> | In my opinion end user should have a choice what his instance will archive. I know that archive team is organised group with process, but is there any chance to do archiving (properly) as "lonely wolf", using stand alone version of warrior with user list of the domains ? |
| 22:56:52 | <klea> | You can choose the project, you can't choose what specific items you'll get for a project. |
| 22:59:33 | | APOLLO03a quits [Ping timeout: 272 seconds] |
| 22:59:33 | | ^ quits [Ping timeout: 272 seconds] |
| 23:00:39 | <DlugasnyPL> | I do not understand context of this word "project". What is mean ? Does it mean that somebody must set some project for specified domain ? for example for telegram.org ? this is huge site, so i believe task must be deveided for small pieces and delegated to the end users. project - is something like configuration requreid to download specified page, correct ? |
| 23:01:37 | | tekulvw (tekulvw) joins |
| 23:02:06 | <klea> | A project is normally, but not always, a specific website, or combination of websites that work the in the same manner. |
| 23:02:49 | <klea> | The website https://tracker.archiveteam.org/ show some projects which are on the list of things people running Warriors can choose to do. |
| 23:03:26 | <klea> | Things like URLTeam2 or YouTube have some counter-recomendations against running them at home due to the fact they leak IP addresses. |
| 23:03:56 | | ^ (^) joins |
| 23:04:46 | | Wohlstand quits [Client Quit] |
| 23:04:53 | | nexussfan (nexussfan) joins |
| 23:06:23 | <DlugasnyPL> | leak IP ? what do You mean ? |
| 23:06:24 | | tekulvw quits [Ping timeout: 268 seconds] |
| 23:06:56 | <DlugasnyPL> | archive team system trying to hide end user ips ? |
| 23:07:01 | <klea> | URLTeam2 also does lots of requests per second to urls that may not be fully vetted against. |
| 23:07:56 | <DlugasnyPL> | thats why You are spreading job for multiple end users |
| 23:07:58 | <DlugasnyPL> | ok |
| 23:08:48 | <klea> | no, also I'm not doing things right now. |
| 23:09:21 | <@JAA> | klea is confusing URLTeam with URLs again. |
| 23:09:34 | <klea> | Sorry. |
| 23:09:40 | <klea> | The webui is confusing. |
| 23:09:48 | <klea> | it'd be neat to show the warrior readmes on the webpage too. |
| 23:10:23 | <klea> | s/URLTeam2 or/URLs or/ |
| 23:11:07 | <DlugasnyPL> | that would be perfect |
| 23:11:40 | <DlugasnyPL> | but discussion here is also nice :) |
| 23:13:51 | | APOLLO03 joins |
| 23:14:26 | | Dada joins |
| 23:15:52 | <DlugasnyPL> | I have list of 36000 domains not archived yet, polish sites. I come from Poland. Is there any chance to create project for them ? |
| 23:17:11 | <klea> | depends on what kind of sites, and how they work. |
| 23:17:34 | <klea> | it could be done slowly on #archivebot most likely if they're not using javascript too much. |
| 23:17:49 | <DlugasnyPL> | they are using a lot js |
| 23:18:02 | <DlugasnyPL> | formulars, search pages |
| 23:18:04 | <DlugasnyPL> | etc.etc. |
| 23:18:51 | | Arcorann (Arcorann) joins |
| 23:19:11 | <DlugasnyPL> | we have started archiving one time per month on most of that sites using browsertrix with crawl depth control but even with this it is very time consuming process |
| 23:19:57 | <@JAA> | Maybe it could slowly be run through #jseater but we don't currently have a distributed setup for JS-heavy things. |
| 23:20:13 | <@JAA> | No recursion there though. |
| 23:22:10 | <klea> | Should I add my jseater url list thingy to the wiki page? |
| 23:22:20 | | etnguyen03 (etnguyen03) joins |
| 23:22:23 | <klea> | maybe I should make it not give who queued stuff? |
| 23:22:59 | <DlugasnyPL> | do You know what exactly, which problem discrediting browsertrix ? |
| 23:24:02 | <@JAA> | Same problem as ArchiveWeb.Page and anything else that uses Chrome Debugging Protocol to produce WARCs: it can't accurately capture the actual HTTP data received from the server, only a parsed version of it. |
| 23:24:11 | <@JAA> | See https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem |
| 23:28:04 | <DlugasnyPL> | parsed version, you mean output from browser ? |
| 23:28:50 | <@JAA> | Parsed representation of the HTTP responses, e.g. key-value pair of headers instead of the raw bytes. |
| 23:33:57 | <DlugasnyPL> | what kind of impact it is generating on the final result ? when user will open archive from browsertrix and wget-at - what will be the difference ? for example warc from warrior, search field with js will not work and will not dispaly any items, correct ? Browsertrix will show "rendered" js page, right and wget-at not, but wget-at will have something which is not visibale for an end user but important for WARC stanadard correct ? ? |
| 23:35:12 | | Dada quits [Remote host closed the connection] |
| 23:35:16 | <klea> | WARC is a raw capture of the raw bytes of the HTTP requests and responses, not of rendered webpages. |
| 23:35:57 | <DlugasnyPL> | just trying to understand the idea of WARC - visual effect, archive to browse by people or dry standard to collect raw data from the http servers |
| 23:36:34 | <DlugasnyPL> | ok |
| 23:37:02 | <@JAA> | Capturing the data and playing back pages from it are two entirely separate issues. |
| 23:38:22 | <DlugasnyPL> | os if data saved in WARC are frinedly for human eye but not friendly for machine, then it is not a write standard. so You are trying to record raw output from the servers, write it in the form which can be used afterwards to "render" the page ? |
| 23:38:56 | <DlugasnyPL> | right standard |
| 23:47:07 | <DlugasnyPL> | I think thats all for today. I will keep warrior running and help you to crawl US pages, but any way I would like to start archiving process for planty of polish pages. Thank You for this nice introduction. |
| 23:48:31 | <@JAA> | WARC is both very human- and machine-readable, in the same way as HTTP. |
| 23:50:07 | <@JAA> | If you capture all HTTP requests/responses that occur during a page load, it should be possible to play that back again later as well. There are a lot of edge cases though. |
| 23:50:22 | <@JAA> | You can still store a screenshot or a DOM dump in the WARC as well if you want to do that. |
| 23:50:55 | <klea> | It'd be fun to make something to take tcpdump/wireshark output and make warcs out of it, but I don't have time. |
| 23:51:51 | <DlugasnyPL> | one more question before I will go sleep - where warrior uploading all files ? directly to archive.org ? |