00:00:06<sudesu>ah, thank you! but I need historical data ^_^;
00:03:37<sudesu>I was hoping to find an updated version of https://boards.4chan.org/t/thread/1153106#p1153107
00:08:06<sudesu>are there any guides to running your own 4chan archive around here? I have the storage and resources to do so
00:27:13etnguyen03 quits [Client Quit]
00:27:36arch quits [Ping timeout: 256 seconds]
00:36:50Terbium quits [Quit: No Ping reply in 180 seconds.]
00:38:43<@JAA>There's a large page about 4chan on the wiki.
00:38:55<@JAA>Including mentions of mirroring tools etc., IIRC.
00:42:57arch (arch) joins
00:45:12Terbium joins
00:51:47opl3 (opl) joins
00:53:29opl quits [Ping timeout: 272 seconds]
00:53:29opl3 is now known as opl
00:58:37<klea>afaik it's not tools but AT archiving archives :p
00:58:59<sudesu>anon it's really hard to read through the wiki, I don't even know what to search for, I am just looking for a data dump (*/_\)
00:59:01<klea>oh there are tools sorry
00:59:18<klea>afaik you'd have to make it yourself?
00:59:43<sudesu>surely every archive isn't making their own dataset, right? @_@
01:01:33<klea>https://github.com/eksopl/fuuka?
01:01:38etnguyen03 (etnguyen03) joins
01:02:09Webuser874861 joins
01:02:24Webuser874861 quits [Client Quit]
01:02:29<sudesu>mostly I am just looking for some tar file with all the posts ever posted to 4chan, so I can then search through it
01:09:41<hexagonwin>sudesu: maybe https://archive.4plebs.org/_/articles/credits/ can help?
01:10:36<hexagonwin>4plebs has data dumps on IA so at least you can get /pol/ i believe
01:27:24nimaje1 joins
01:27:57nimaje quits [Read error: Connection reset by peer]
01:30:29<nicolas17>I wouldn't even expect "archive of all posts ever posted to 4chan" to be a thing that exists
01:32:12<nicolas17>I remember some website that archived /b/ threads (individual threads specifically requested by users) which had special measures to refresh the archive every *second* when the thread was close to expiration in order to catch the last posts
01:37:11<pabs>arkiver: btw, would getting wiki diff traffic on project channels be feasible/good?
01:37:15<pabs>guess the script would have to parse the template, and somehow make sure not to spam channels it shouldn't
01:40:07<nicolas17>I'm getting another file listing of the parti bucket to diff it later
01:40:58<nicolas17>so far it looks like the channels directory had many new videos as people keep streaming and many deleted videos as they expire, but the ivs directory (listing still in progress) had zero changes
01:44:50khaoohs joins
01:48:35Island joins
02:05:55guest1 joins
02:09:13<sudesu>hexagonwin: thanks for the link
02:09:13<sudesu>I was hoping there was something more recently updated ^_^;
02:09:13<sudesu>I found https://archive.org/details/4plebsimagedump has "data dump 2024"
02:09:13<sudesu>I suppose the best way to get my hands on the data is just asking the people running the archives?
02:09:13<sudesu>nicolas17: tbh I was expecting all the posts to already be concatenation into a single archive already @_@
02:09:13<sudesu>is the current state of 4chan archives really so uncoordinated?
02:10:58<nicolas17>¯\_(ツ)_/¯ I didn't even know of the ones listed in the archiveteam wiki
02:11:24<nicolas17>and there doesn't seem to be any current 4chan archival project as part of archiveteam?
02:12:36<nicolas17>seems kind of hard with our infrastructure given the thread expiration thing
02:13:55arch quits [Ping timeout: 272 seconds]
02:13:58<nicolas17>with a normal forum, when we archive everything, years-old threads aren't gonna change anymore and we save them in full, and recent threads may get new posts after archiving and we'll lose those posts
02:14:34<nicolas17>with 4chan, archiving a thread too early would miss new posts but *also* archiving too late means it could expire
02:15:08<nicolas17>and our DPoS stuff has no guarantees of how long it may take for something to get archived once added to the queue...
02:16:33<nicolas17>doing "continuous" archival of 4chan would take quite some effort
02:19:07<sudesu>ah, I see... sorry, I didn't realize archiveteam had so much on its plate. I honestly thought it was only focused on 4chan ^_^; my bad lol
02:23:57arch (arch) joins
02:26:42hexagonwin_ joins
02:28:18hexagonwin quits [Ping timeout: 256 seconds]
02:29:35<klea>it'd be interesting to try to make the warrior thingy allow working on more than one project at a time, allowing some kind of language to make tasks be able to be delegated to specific systems based on things like if they have a lot of storage, where they're going out to the internet from, etc (basically allow individually picking warriors to use for specific projects)
03:00:42<nicolas17>sudesu: #archivebot is a fun place to watch for a while :P
03:01:49<Guest>adding to this ^^ - i think the simplest way is running a cronjob every x hours to update "warrior characteristics" (what you're describing), and having it pull the latest AT warrior allocation characteristics ("WAC" i guess). then, the warrior filters through all of those and decides the best project to work on. so most of the work is client-side besides fetching and maintaining the WAC from AT servers. cc: klea
03:02:31<Guest>WAC is a cool name
03:02:46<nicolas17>wireless auto config
03:03:50<Guest>isnt everything wireless now
03:04:10<nicolas17>>wireless device
03:04:11<nicolas17>>look inside
03:04:13<nicolas17>>wires
03:05:09guest1 quits [Client Quit]
03:05:12<Guest>or "warrior auto config", but it isnt neccesarily a config
03:23:00Guest58 quits [Quit: My Mac has gone to sleep. ZZZzzz…]
03:30:56<klea>it's even more fun to connect to ws://archivebot.archivingyoursh.it/stream and see it move
03:31:54etnguyen03 quits [Client Quit]
03:37:29<that_lurker>katia++
03:37:30<eggdrop>[karma] 'katia' now has 111 karma!
03:37:48etnguyen03 (etnguyen03) joins
03:51:21<h2ibot>Cooljeanius edited Adobe Aero (+41, /* Archival progress */ linkify): https://wiki.archiveteam.org/?diff=58460&oldid=58226
03:53:02etnguyen03 quits [Remote host closed the connection]
04:13:17Island quits [Read error: Connection reset by peer]
04:15:23Wohlstand (Wohlstand) joins
04:38:06<nicolas17>parti bucket grew from 197M files to 208M files
04:38:55DogsRNice quits [Read error: Connection reset by peer]
04:41:50Guest58 joins
04:47:33Guest58 quits [Client Quit]
05:05:30andrewnyr quits [Quit: Ping timeout (120 seconds)]
05:07:03Guest58 joins
05:08:19cyanbox joins
05:10:12andrewnyr joins
05:14:42Guest58 quits [Client Quit]
05:31:35Guest58 joins
05:33:27Guest58 quits [Client Quit]
05:34:04Guest58 joins
05:40:10Guest58 quits [Client Quit]
05:42:30Guest58 joins
05:43:56HackMii (hacktheplanet) joins
05:47:17SootBector quits [Remote host closed the connection]
05:48:24SootBector (SootBector) joins
05:51:25Guest58 quits [Client Quit]
05:53:53Guest58 joins
05:57:39ericgallager joins
05:58:07cooljeanius quits [Ping timeout: 272 seconds]
06:07:25Guest58 quits [Client Quit]
06:24:28Guest58 joins
06:27:20Guest58 quits [Client Quit]
06:31:19nexussfan quits [Quit: Konversation terminated!]
06:31:19Guest58 joins
06:33:33Guest58 quits [Client Quit]
06:34:22Guest58 joins
06:36:07agtsmith quits [Ping timeout: 272 seconds]
06:36:22Guest58 quits [Client Quit]
06:40:04Guest58 joins
06:43:00Guest58 quits [Client Quit]
06:44:59ericgallager quits [Ping timeout: 272 seconds]
06:50:53Guest58 joins
06:51:34agtsmith joins
06:54:49Guest58 quits [Client Quit]
06:58:33Guest58 joins
07:00:16Guest58 quits [Client Quit]
07:29:26benjins3 quits [Read error: Connection reset by peer]
08:08:07Boppen_ quits [Read error: Connection reset by peer]
08:12:17HackMii quits [Remote host closed the connection]
08:12:34HackMii (hacktheplanet) joins
08:13:28Boppen (Boppen) joins
08:30:59Guest58 joins
08:40:21PredatorIWD25 quits [Read error: Connection reset by peer]
08:40:32PredatorIWD25 joins
08:56:25Webuser445398 joins
08:59:29Dada joins
09:47:06Webuser445398 leaves
10:51:43BornOn420_ (BornOn420) joins
10:51:59BornOn420 quits [Ping timeout: 272 seconds]
10:56:42<cruller>https://archiveready.com/ 's concept shares similarities with that of https://wiki.archiveteam.org/index.php/Obstacles
10:57:21<cruller>I guess this tool focuses on obstacles not intentionally created by the site owner and is similar in functionality to an SEO checker.
11:15:45tertu (tertu) joins
11:18:16Webuser356513 joins
11:18:42tertu2 quits [Ping timeout: 256 seconds]
11:19:10Webuser356513 quits [Client Quit]
11:51:38<h2ibot>Manu edited Distributed recursive crawls (+70, Candidates: Add www.artinliverpool.com): https://wiki.archiveteam.org/?diff=58462&oldid=58218
11:57:41sudesu quits [Quit: Ooops, wrong browser tab.]
12:14:55<cruller>Probably a stupid question: Why are custom downloaders like "foo-grab" usually executed via DPoS?
12:17:56<cruller>There are many tasks that don't require DPoS manpower but require custom downloaders.
12:36:31benjins3 joins
12:46:42HP_Archivist quits [Read error: Connection reset by peer]
13:31:32<masterx244|m>those that require custom stuff but not a DPoS are usually run by core AT members directly. reference any [J]AA qwarc shenanigans
13:52:43Shyy46 quits [Quit: The Lounge - https://thelounge.chat]
13:52:56Shyy46 joins
13:53:21Shyy46 quits [Client Quit]
13:53:39Shyy46 joins
13:58:47Shyy46 quits [Client Quit]
13:59:01Shyy46 joins
14:34:43Wohlstand quits [Quit: Wohlstand]
14:34:49sec^nd quits [Ping timeout: 276 seconds]
14:40:33sec^nd (second) joins
14:48:28sec^nd quits [Ping timeout: 276 seconds]
14:58:30<cruller>masterx244: Yeah, I know a few such examples too. Those make sense.
15:01:58TheEnbyperor quits [Ping timeout: 256 seconds]
15:02:02<cruller>To be honest, I myself don't have a clear idea about how such tasks should be handled...
15:02:47TheEnbyperor_ quits [Ping timeout: 272 seconds]
15:03:01sec^nd (second) joins
15:03:35<justauser>https://archiveready.com/ and other sites of the owner look pretty abandoned...
15:03:57<justauser>With a bit of irony, I should probably feed them to AB.
15:09:25Cuphead2527480 (Cuphead2527480) joins
15:13:36<cruller>koichi: When discussing facts rather than ideals, I guess many people archive them independently of AT.
15:19:11<cruller>justauser: Fan fact: The owner, Vangelis Banos, is the author of SPN2 API Docs.
15:23:19TheEnbyperor joins
15:23:30TheEnbyperor_ (TheEnbyperor) joins
15:24:10gosc joins
15:38:49Webuser380713 joins
15:39:10Webuser380713 quits [Client Quit]
15:39:34nathang2184 quits [Quit: The Lounge - https://thelounge.chat]
15:54:58Webuser748015 joins
15:56:25Webuser748015 quits [Client Quit]
15:58:40HackMii quits [Ping timeout: 276 seconds]
15:59:58SootBector quits [Ping timeout: 276 seconds]
16:00:14HackMii (hacktheplanet) joins
16:00:43SootBector (SootBector) joins
16:07:16nathang2184 joins
16:09:16<@arkiver>pabs: that would be nice!
16:09:22<@arkiver>if it can be easily done
16:12:06ATinySpaceMarine quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
16:12:40ATinySpaceMarine joins
16:14:12ATinySpaceMarine quits [Client Quit]
16:15:57ATinySpaceMarine joins
16:50:33<h2ibot>OrIdow6 edited Microsoft Download Center (+193): https://wiki.archiveteam.org/?diff=58463&oldid=58403
16:59:05Wohlstand (Wohlstand) joins
17:01:01Radzig quits [Quit: ZNC 1.10.1 - https://znc.in]
17:05:14Radzig joins
17:11:31Wohlstand quits [Client Quit]
17:12:01Wohlstand (Wohlstand) joins
17:19:11Cuphead2527480 quits [Client Quit]
17:21:56devkev0 quits [Ping timeout: 256 seconds]
17:33:54<nulldata>Speaking of which - looks like the Windows Update grab files are locked :/
17:44:33hackbug quits [Remote host closed the connection]
17:46:49<pokechu22>cruller: for smaller custom jobs, I usually write a script that lists all of the URLs needed, and then run that list in archivebot. (Generally I'll only locally download pagination, while the list contains both pagination and file URLs that wouldn't be needed to discover more stuff)
18:06:20<nicolas17>same
18:06:45<nicolas17>and then I can get away with the scripts still needing a bunch of manual steps
18:07:51Sanqui_ is now known as Sanqui
18:07:51Sanqui quits [Changing host]
18:07:51Sanqui (Sanqui) joins
18:07:51@ChanServ sets mode: +o Sanqui
18:09:39cyanbox quits [Read error: Connection reset by peer]
18:12:02unlobito quits [Remote host closed the connection]
18:13:01unlobito (unlobito) joins
18:33:41sg72 quits [Remote host closed the connection]
18:37:12sg72 joins
18:38:40<cruller>I generally prefer to send all HTTP requests from a single computer, but is there any benefit to combining a local scraper (or ArchiveBot) with a remote/SaaS-like scraper such as Scraperr or RSSHub?
18:43:44<cruller>For example, if AB GETs https://selfhostedscraper.example.com?url=target.example.com&extractor={extractor_id}, it will receive a URL list and automatically grab those URLs.
18:44:29<justauser>If there is a public instance, it might make life simpler for folks.
18:45:08<justauser>Otherwise, the only reason I can see is to evade ratelimits.
18:59:38<cruller>I regret that HTTP communications between the remote scraper and the target site aren't recorded.
19:07:59<justauser>If you need it, it's not that hard.
19:08:14Webuser938602 joins
19:08:16<justauser>Use warcprox or, when ready, warc-for-humans.
19:09:39<cruller>The local scraper normally fetch it again though.
19:16:57whimsysciences (whimsysciences) joins
19:22:36NF885 (NF885) joins
19:23:03<cruller>warcprox++
19:23:03<eggdrop>[karma] 'warcprox' now has 1 karma!
19:32:01NF885 quits [Client Quit]
19:39:16gosc quits [Quit: Leaving]
19:46:36Webuser938602 quits [Client Quit]
20:14:09<h2ibot>Klea created Gitlab (+20, Create lowercase version redirect): https://wiki.archiveteam.org/?title=Gitlab
20:16:09<h2ibot>Klea edited Discourse/uncategorized (+49, Add https://discourse.nixos.org/): https://wiki.archiveteam.org/?diff=58465&oldid=58316
20:17:12sg72 quits [Remote host closed the connection]
20:18:21sg72 joins
20:32:49Hackerpcs quits [Quit: Hackerpcs]
20:44:39HP_Archivist (HP_Archivist) joins
20:47:14<h2ibot>Nintendofan885 edited Wikibot (+12, link back to wiki software pages): https://wiki.archiveteam.org/?diff=58466&oldid=56529
20:48:30Wohlstand quits [Quit: Wohlstand]
20:50:14<h2ibot>Nintendofan885 edited MediaWiki (+160, archiving): https://wiki.archiveteam.org/?diff=58467&oldid=42151
20:51:14<h2ibot>Nintendofan885 edited MediaWiki (-8, duplicate word): https://wiki.archiveteam.org/?diff=58468&oldid=58467
20:52:15<h2ibot>Nintendofan885 edited MediaWiki (+21, see [[WikiTeam]] to avoid duplication): https://wiki.archiveteam.org/?diff=58469&oldid=58468
21:07:03<pabs>arkiver: is the h2ibot config in a repo somewhere? /cc JAA
21:07:17<h2ibot>Nintendofan885 edited Distributed Preservation of Service (+46, link to the talk): https://wiki.archiveteam.org/?diff=58470&oldid=52450
21:10:32<klea>pabs: i believe not since it contains secrets
21:10:48<klea>it's source code is at https://gitea.arpa.li/ArchiveTeam/http2irc tho
21:11:22<@arkiver>pabs: h2ibot is run by JAA
21:11:41<@arkiver>i just use h2ibot (the GET and POST endpoint to communicate over IRC with qubert)
21:12:01pabs assumes secrets would be not recorded in the git repo but in external configs :)
21:12:50<klea>well, the config is a toml file
21:12:59<pabs>hmm I'm assuming the wiki diffs don't come from the h2ibot code though? but some wiki extension?
21:13:16<klea>oh if you mean that, i think it's from JAA's jaabot thingy
21:13:29<klea>https://github.com/JustAnotherArchivist/atwikibot
21:13:52<pabs>klea++
21:13:52<eggdrop>[karma] 'klea' now has 2 karma!
21:13:53<klea>specifically recentchanges.py
21:18:23<pabs>thanks. looks like the IRC part isn't in that repo though
21:20:06<pabs>looks like the mediawiki URL it uses has no page content
21:20:18<klea>i suppose it's a pipe like ./recentchanges.py | while IFS= read -r line; do curl "https://username:password@endpoint/channel" --data-binary @-;done
21:20:38<klea>yeah, it doesn't need it to create the diffs if it has the change count
21:24:37<pabs>hmmm the mediawiki rc API doesn't offer the text content, only the sha1 https://www.mediawiki.org/wiki/API:Recentchanges
21:27:31<pabs>so might require a mediawiki plugin
21:28:48cyanbox joins
21:32:47<pabs>PHP--
21:32:48<eggdrop>[karma] 'PHP' now has -2 karma!
21:43:02<pabs>looks like there is no extension point for this, so might need an MW patch... https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Extension.json/Schema
21:56:43<@JAA>pabs: That repo is not public.
21:57:14<@JAA>Yes, h2ibot is just the messenger; the wiki edits use the messenger but don't originate from that code.
21:58:49HP_Archivist quits [Client Quit]
22:36:05beastbg8 (beastbg8) joins
22:38:47beastbg8__ quits [Ping timeout: 272 seconds]
22:40:37etnguyen03 (etnguyen03) joins