00:01:29 | | TheTechRobo quits [Read error: Connection reset by peer] |
00:01:29 | | ScenarioPlanet quits [Read error: Connection reset by peer] |
00:01:29 | | Pedrosso quits [Read error: Connection reset by peer] |
00:01:54 | | Pedrosso joins |
00:01:59 | | ScenarioPlanet (ScenarioPlanet) joins |
00:02:11 | | TheTechRobo (TheTechRobo) joins |
00:27:37 | | sec^nd quits [Remote host closed the connection] |
00:27:59 | | sec^nd (second) joins |
00:40:47 | | etnguyen03 quits [Client Quit] |
00:41:10 | | lunik1 quits [Ping timeout: 255 seconds] |
00:48:06 | | NIC007a83 quits [Client Quit] |
01:02:21 | | etnguyen03 (etnguyen03) joins |
01:20:50 | | Arcorann (Arcorann) joins |
01:45:17 | <plcp> | thuban: I'll ask him, brb soon™ |
02:26:38 | | etnguyen03 quits [Client Quit] |
02:45:36 | | etnguyen03 (etnguyen03) joins |
02:49:47 | | p666v joins |
02:51:55 | | p666v quits [Client Quit] |
03:20:41 | | DogsRNice joins |
03:27:44 | | icedice quits [Client Quit] |
03:59:53 | | pillopillo joins |
04:00:51 | | pillopillo quits [Client Quit] |
04:08:49 | <thuban> | plcp: cool, thanks! |
04:22:25 | | deadorbit quits [Client Quit] |
04:22:26 | | Unholy2361 quits [Client Quit] |
04:22:59 | | Unholy2361 (Unholy2361) joins |
04:25:53 | | SootBector quits [Remote host closed the connection] |
04:27:00 | | SootBector (SootBector) joins |
04:29:28 | | etnguyen03 quits [Client Quit] |
04:30:44 | | etnguyen03 (etnguyen03) joins |
04:48:30 | | kiryu (kiryu) joins |
04:50:27 | | midou quits [Remote host closed the connection] |
04:50:38 | | midou joins |
04:59:53 | | etnguyen03 quits [Remote host closed the connection] |
05:11:37 | | beastbg8 (beastbg8) joins |
05:29:28 | | datechnoman quits [Quit: The Lounge - https://thelounge.chat] |
05:36:45 | | datechnoman (datechnoman) joins |
06:16:10 | | DogsRNice quits [Read error: Connection reset by peer] |
07:05:03 | | Unholy2361 quits [Remote host closed the connection] |
07:06:10 | | Unholy2361 (Unholy2361) joins |
07:09:43 | | BlueMaxima quits [Read error: Connection reset by peer] |
07:24:06 | | sec^nd quits [Ping timeout: 255 seconds] |
08:10:37 | | arjie joins |
08:17:28 | <arjie> | Hey, guys, I've got an instance of the Warrior running. What I was hoping to do is to target it at some URL and have it spider that domain from there. Is there a feasible way to configure it to do this? I assume in the best case this is something like: |
08:17:29 | <arjie> | 1. Run the tracker unofficial Docker container https://wiki.archiveteam.org/index.php/Dev/Tracker |
08:17:29 | <arjie> | 2. Figure out how to modify that to have a custom project file |
08:17:30 | <arjie> | 3. Point my Warrior at that tracker |
08:17:30 | <arjie> | But just in case I'm overthinking this is there a straightforward way for me to just do whatever smart rate-limited crawling that the Warrior applies to some particular URL and spider out of there? |
08:22:02 | <thuban> | arjie: you probably want to look at https://github.com/ArchiveTeam/grab-site/ instead |
08:22:37 | | Island quits [Read error: Connection reset by peer] |
08:23:43 | <arjie> | Oh that sounds _exactly_ like what I want. Is there something I can pair it with to submit the pages to the Internet Archive at archive.org as well? |
08:26:14 | <arjie> | Ah I've found https://gist.github.com/Asparagirl/6206247 that helps upload a WARC. I'm going to try this out. |
08:26:49 | <thuban> | arjie: you can upload the results to the internet archive yourself, but that won't make them available in the wayback machine; only whitelisted accounts are trusted sources for the wbm. |
08:27:53 | <thuban> | the archiveteam account is whitelisted, though, so if we crawl it with archivebot it will show up. care to share the domain? |
08:28:09 | <arjie> | Ah I see. For good reason, I suppose. I'll just leave the warrior running on auto then. I was hoping to spider pages that are part of the longer tail of Internet websites that don't get much SEO. |
08:29:14 | <arjie> | Sort of struck me when I saw this comment https://news.ycombinator.com/item?id=40020345 on the Hacker News post for http://ascii.textfiles.com/archives/5591 |
08:30:37 | <thuban> | well, feel free to make suggestions in #archivebot for full sites, or #// if you have big lists of individual pages |
08:31:06 | <arjie> | Okay, thank you! I imagine the usual ones like https://github.com/kagisearch/smallweb are already in there? |
08:33:59 | <thuban> | it's definitely been mentioned, let me see if we have it covered |
08:34:47 | <fireonlive> | we don't seem to have it in https://github.com/ArchiveTeam/urls-sources |
08:37:21 | <fireonlive> | welcome arjie :) |
08:37:29 | <thuban> | right, although it's been suggested. we have done a one-time capture of all the indexed feeds/homepages, though https://archive.fart.website/archivebot/viewer/?q=kagisearch |
08:38:42 | <arjie> | Thank you, fireonlive :) |
08:38:42 | <arjie> | Good to see it's already covered, thuban! |
09:00:03 | | Bleo182600 quits [Client Quit] |
09:01:22 | | Bleo182600 joins |
09:29:46 | | grid joins |
09:41:55 | | Larsenv quits [Quit: The Lounge - https://thelounge.chat] |
09:59:42 | | driib quits [Client Quit] |
10:01:36 | | tony joins |
10:05:53 | | driib (driib) joins |
10:22:56 | | bladem quits [Read error: Connection reset by peer] |
10:30:40 | | tony quits [Client Quit] |
11:16:07 | | lunik1 joins |
11:49:36 | | grid quits [Client Quit] |
12:11:59 | | eightthree quits [Ping timeout: 272 seconds] |
12:13:01 | | eightthree joins |
12:19:20 | | fuzzy8021 quits [Read error: Connection reset by peer] |
12:19:53 | | decky joins |
12:21:02 | | fuzzy8021 (fuzzy8021) joins |
12:22:45 | | decky_e quits [Ping timeout: 272 seconds] |
12:31:25 | | etnguyen03 (etnguyen03) joins |
12:53:28 | | Bleo182600 quits [Client Quit] |
12:54:08 | | Notrealname1234 (Notrealname1234) joins |
12:54:41 | | Bleo182600 joins |
12:57:24 | | Notrealname1234 quits [Client Quit] |
13:20:02 | | Jackster joins |
13:22:06 | <Jackster> | Anyone got a heritrix3 config that is reasonably fast at arching? Got a site with 2m urls and it is doing it at 0.25 per second atm. I don't fancy waiting 10 years :p |
13:22:32 | | etnguyen03 quits [Client Quit] |
13:25:13 | | sec^nd (second) joins |
14:04:45 | | pixel (pixel) joins |
14:06:37 | | Arcorann quits [Ping timeout: 272 seconds] |
14:10:38 | | grid joins |
14:30:22 | | Notrealname1234 (Notrealname1234) joins |
14:34:10 | | Notrealname1234 quits [Client Quit] |
14:41:03 | | nicholl joins |
14:42:42 | <nicholl> | I want imagsrc pics |
14:43:47 | | Jackster quits [Client Quit] |
14:53:16 | | decky_e joins |
14:55:16 | | decky quits [Ping timeout: 255 seconds] |
15:19:29 | | nicholl quits [Ping timeout: 265 seconds] |
15:22:59 | | etnguyen03 (etnguyen03) joins |
15:33:30 | | Guest quits [Ping timeout: 265 seconds] |
15:48:51 | | Guest joins |
16:08:01 | | blue_0000ff is now authenticated as blue_0000ff |
16:25:25 | | Wohlstand (Wohlstand) joins |
16:27:26 | | Bleo182600 quits [Client Quit] |
16:27:46 | | Bleo182600 joins |
16:39:55 | | DogsRNice joins |
16:43:10 | | knecht4 quits [Client Quit] |
16:46:02 | | knecht4 joins |
17:03:52 | | Deewiant quits [Remote host closed the connection] |
17:04:59 | | Deewiant (Deewiant) joins |
17:09:38 | | BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
17:10:28 | | grid quits [Client Quit] |
17:15:00 | | etnguyen03 quits [Client Quit] |
17:22:13 | | grid joins |
18:04:38 | | etnguyen03 (etnguyen03) joins |
18:17:03 | | etnguyen03 quits [Client Quit] |
18:25:20 | | Wohlstand quits [Client Quit] |
18:33:14 | | etnguyen03 (etnguyen03) joins |
18:39:56 | | BearFortress joins |
18:49:10 | | Wohlstand (Wohlstand) joins |
18:51:48 | | Island joins |
19:02:05 | | ^ quits [Remote host closed the connection] |
19:02:20 | | ^ (^) joins |
19:09:58 | | zhongfu quits [Ping timeout: 255 seconds] |
19:10:37 | | lflare quits [Ping timeout: 272 seconds] |
19:21:02 | | zhongfu (zhongfu) joins |
19:40:28 | | grid quits [Client Quit] |
19:43:33 | | lflare (lflare) joins |
19:51:21 | | etnguyen03 quits [Client Quit] |
20:00:25 | | etnguyen03 (etnguyen03) joins |
20:04:18 | | zhongfu quits [Client Quit] |
20:07:12 | | zhongfu (zhongfu) joins |
20:22:25 | | etnguyen03 quits [Client Quit] |
20:23:55 | | jerm joins |
20:24:28 | | jerm quits [Client Quit] |
20:40:06 | | etnguyen03 (etnguyen03) joins |
21:00:00 | | tapos joins |
21:00:40 | | andrew7 (andrew) joins |
21:01:29 | | andrew quits [Killed (NickServ (GHOST command used by andrew7))] |
21:01:31 | | andrew7 is now known as andrew |
21:03:52 | | Unholy2361 quits [Client Quit] |
21:05:25 | | Unholy2361 (Unholy2361) joins |
21:24:28 | | andrew1 (andrew) joins |
21:26:19 | | andrew quits [Ping timeout: 255 seconds] |
21:26:19 | | andrew1 is now known as andrew |
22:26:09 | | BlueMaxima joins |
23:13:59 | | arjie quits [Client Quit] |
23:38:14 | | benjinsm joins |
23:41:41 | | benjins quits [Ping timeout: 272 seconds] |
23:50:05 | | Larsenv (Larsenv) joins |
23:50:05 | | Larsenv quits [Client Quit] |
23:51:39 | | atphoenix (atphoenix) joins |
23:51:45 | | Larsenv (Larsenv) joins |