00:08:47 | | kuroger quits [Quit: ZNC 1.9.1 - https://znc.in] |
00:11:04 | | nyakase quits [Remote host closed the connection] |
00:11:13 | | kuroger (kuroger) joins |
00:12:43 | | nyakase (nyakase) joins |
00:16:55 | | wickedplayer494 is now authenticated as wickedplayer494 |
00:23:46 | | kuroger quits [Client Quit] |
00:27:13 | | kuroger (kuroger) joins |
00:34:01 | | egallager joins |
00:34:01 | | etnguyen03 quits [Client Quit] |
00:47:45 | | FiTheArchiver quits [Read error: Connection reset by peer] |
00:58:20 | | lennier2 quits [Read error: Connection reset by peer] |
00:58:36 | | lennier2 joins |
01:09:33 | | Bleo18260072271962345 quits [Quit: Ping timeout (120 seconds)] |
01:09:47 | | Bleo18260072271962345 joins |
01:13:48 | | adamus1red quits [Quit: SigTerm] |
01:15:18 | | etnguyen03 (etnguyen03) joins |
01:17:33 | | adamus1red (adamus1red) joins |
01:20:33 | | BornOn420 quits [Remote host closed the connection] |
01:21:09 | | BornOn420 (BornOn420) joins |
01:30:50 | | Ryz quits [Read error: Connection reset by peer] |
01:31:40 | | Ryz (Ryz) joins |
01:40:51 | | @Fusl quits [Ping timeout: 260 seconds] |
01:53:31 | | Fusl (Fusl) joins |
01:53:31 | | @ChanServ sets mode: +o Fusl |
02:12:57 | | notarobot1 joins |
02:17:08 | | gust quits [Read error: Connection reset by peer] |
02:19:55 | | etnguyen03 quits [Client Quit] |
02:33:06 | | etnguyen03 (etnguyen03) joins |
02:38:34 | <Gareth48> | pokechu22 any way around the issue? Can archivebot be reconfigured to go depth first or can we possibly tweak its methodology to grab images? |
02:38:51 | <Gareth48> | Since I agree getting the images off this site is fairly important |
02:39:12 | <pokechu22> | There isn't any good way of doing that with archivebot. Not sure if JAA can do it with pull or something |
02:40:09 | <Gareth48> | I guess its good I'm attempting my own scrape though I'd much rather add what I have to a wayback machine compatible one than have a huge dump on my computer in an unuseable format |
02:40:59 | <Gareth48> | Thanks TheTechRobo I'll take a look into Matrix since I'm going to need the chat history visible |
02:48:02 | | etnguyen03 quits [Remote host closed the connection] |
02:51:34 | | gareth48|m joins |
02:51:46 | <gareth48|m> | Okay this should be from my Matrix clien |
02:51:57 | <gareth48|m> | s/clien/client/ |
02:52:59 | <gareth48|m> | Okay note to self: not all features are going to be supported. |
03:00:03 | <gareth48|m> | In theory if you all ping me and I'm offline I'll see it now, going to test that real quick |
03:00:48 | <Gareth48> | gareth48|m Hello from the real IRC client |
03:01:24 | <gareth48|m> | That works, awesome! Okay now anyone can ping me and I'll see it overnight. Thanks for the recommendation TheTechRobo |
03:16:10 | | Webuser457589 joins |
03:16:25 | | Webuser457589 quits [Client Quit] |
03:44:00 | | kuroger quits [Quit: ZNC 1.9.1 - https://znc.in] |
03:49:57 | | kuroger (kuroger) joins |
04:10:46 | | Island quits [Read error: Connection reset by peer] |
04:13:50 | | Naruyoko5 joins |
04:17:46 | | Naruyoko quits [Ping timeout: 260 seconds] |
05:27:34 | <@arkiver> | projects are suffering currently due to a target not functioning correctly |
05:27:55 | | Naruyoko joins |
05:31:16 | | Naruyoko5 quits [Ping timeout: 260 seconds] |
05:32:10 | | Naruyoko5 joins |
05:34:56 | | Naruyoko quits [Ping timeout: 250 seconds] |
05:40:31 | | ThetaDev quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
05:40:55 | | ThetaDev joins |
05:47:56 | | BlueMaxima quits [Ping timeout: 250 seconds] |
06:11:47 | | egallager quits [Quit: This computer has gone to sleep] |
06:37:06 | <datechnoman> | Thanks for the update |
06:37:16 | <datechnoman> | Was thinking it was IA. Not them for once :P |
06:46:46 | | Naruyoko joins |
06:49:26 | | Naruyoko5 quits [Ping timeout: 260 seconds] |
07:03:59 | | loug83181422 joins |
07:16:55 | | loug83181422 quits [Read error: Connection reset by peer] |
07:17:20 | | loug83181422 joins |
07:17:28 | | Gareth48 quits [Quit: Ooops, wrong browser tab.] |
07:29:01 | | Naruyoko5 joins |
07:33:11 | | Naruyoko quits [Ping timeout: 260 seconds] |
07:48:21 | | Shyy4 quits [Ping timeout: 260 seconds] |
08:16:44 | | Ketchup901 quits [Remote host closed the connection] |
08:17:18 | | Ketchup901 (Ketchup901) joins |
08:23:43 | <@JAA> | pokechu22, gareth48|m: Hmm, unless there's heavy rate limiting or similar, we should be able to do the <10k item pages quickly enough to not have images expire, maybe? Or do it in a couple chunks. That won't help with the shop and category listings etc., but at least it'd cover the critical part. |
09:20:55 | | hexa- quits [Ping timeout: 272 seconds] |
09:36:37 | | hexa- (hexa-) joins |
09:48:47 | | hexa- quits [Ping timeout: 272 seconds] |
10:11:39 | | FiTheArchiver joins |
10:15:10 | | hexa (hexa-) joins |
10:47:06 | | dendory3 joins |
10:49:06 | | dendory quits [Ping timeout: 250 seconds] |
10:49:06 | | dendory3 is now known as dendory |
10:58:53 | | Webuser959856 joins |
10:59:50 | | Webuser959856 quits [Client Quit] |
11:00:04 | | Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat] |
11:02:54 | | Bleo18260072271962345 joins |
11:03:31 | | BornOn420 quits [Ping timeout: 276 seconds] |
11:13:46 | | BornOn420 (BornOn420) joins |
11:16:36 | | @imer quits [Ping timeout: 260 seconds] |
11:31:44 | | imer (imer) joins |
11:31:44 | | @ChanServ sets mode: +o imer |
11:34:20 | | @imer quits [Killed (NickServ (GHOST command used by imer6))] |
11:34:32 | | imer (imer) joins |
11:34:32 | | @ChanServ sets mode: +o imer |
11:58:37 | | egallager joins |
13:16:19 | | snel joins |
13:54:02 | | ikki quits [Quit: Going offline, see ya! (www.adiirc.com)] |
13:55:37 | <h2ibot> | Liuxinyu970226 edited Deathwatch (+696, /* 2019 */): https://wiki.archiveteam.org/?diff=55001&oldid=54975 |
13:58:28 | <@arkiver> | bzc6p is handling indafoto |
14:00:36 | <gareth48|m> | JAA: There isn't a huge rate limit that I know of, however the tokens expire 10 minutes from their point of generation. If the archivebot is breadth first by the time it gets to 90% of the images it'll already be too late to download them. It would need to prioritize the images on a per page basis as it goes versus creating a giant cache of pages to dump and iterating on that (which is how I've been told it currently works). If you look |
14:00:36 | <gareth48|m> | at the https://store.vket.com/en job you'll notice ~50% or so of all requests are failing and it's basically all images. If the archive bot can do something like that in any way that'll be the strategy. |
14:00:36 | <gareth48|m> | Another issue that might be relevant is that most of the carousel images on the shop page are added to the DOM after the site loads (unpacked from a weird data block). This was a problem that tripped my own webscraper up before I modified it to extract the tokens from the block before handling them before the images were properly initialized. Not sure if the way you all are downloading sites will have the same weakness but I figured I'd |
14:00:36 | <gareth48|m> | mention it. Let me know your thoughts! |
14:06:38 | <@JAA> | gareth48|m: Yeah, my point is that grabbing a few thousand pages should be feasible within 10 minutes, so the images wouldn't be expired by the time it finishes with those. Prioritisation isn't currently possible with AB. |
14:07:30 | <@JAA> | This would be a separate !ao < job specifically for the item pages and their page requisites only. |
14:07:57 | <@JAA> | Well, or jobs, depending on whether we can safely do the 10k quickly enough or need to split it up. |
14:08:50 | <@JAA> | I can take a look at this several hours from now (unless pokechu22 does it earlier). |
14:10:54 | <gareth48|m> | JAA: Okay that makes sense, sounds like the way to go, thanks for breaking it down for me. I agree, if you focused just on archiving a few sets of pages, i.e. the gallery pages for default tags and all 8000 product pages, that would get probably 90% of the way there. Appreciate y'all looking into this. I'll keep monitoring the jobs as they go and let you know what is and isn't working. Feel free to ping me when the jobs start, I have |
14:10:54 | <gareth48|m> | element watching this channel so I'll see it. |
14:11:08 | <gareth48|m> | * way there in 10 minutes. Appreciate |
14:16:32 | | VoynichCR (VoynichCR) joins |
14:17:02 | | SootBector quits [Remote host closed the connection] |
14:17:25 | | SootBector (SootBector) joins |
14:21:29 | <@arkiver> | i don't think arzon.jp is going to finish in time |
14:21:40 | <@arkiver> | JAA: it's sequential IDs mostly, is it something for qwarc perhaps? |
14:21:44 | <@arkiver> | else i'll set a project up for it |
14:26:42 | <h2ibot> | VoynichCr created WikiBot (+21, Redirected page to [[Wikibot]]): https://wiki.archiveteam.org/?title=WikiBot |
15:04:29 | | ducky quits [Ping timeout: 260 seconds] |
15:05:32 | | ducky (ducky) joins |
15:14:19 | | DopefishJustin quits [Remote host closed the connection] |
15:23:27 | <nulldata> | arkiver - did you see my message here a few days ago regarding indafoto.hu? We may need a project for it too as the AB job isn't going to finish in time at its current rate. |
15:24:29 | | snel quits [Client Quit] |
15:24:48 | <nulldata> | Oh never mind - according to the wiki bzc6p is grabbing them |
15:25:14 | | DopefishJustin joins |
15:25:14 | | DopefishJustin is now authenticated as DopefishJustin |
15:25:55 | <@arkiver> | nulldata: yeah bzc6p is working on archiving it |
15:26:08 | <@arkiver> | i'm in contact with them, will check in tomorrow with them and see if we do need a project for it |
15:26:27 | <nulldata> | Thanks! :) |
15:26:51 | | VoynichCR quits [Client Quit] |
15:27:31 | <@arkiver> | !remindme 10h indafoto bzc6p |
15:27:32 | <eggdrop> | [remind] ok, i'll remind you at 2025-03-21T01:27:32Z |
15:37:51 | <nyuuzyou> | nuum.ru (ex wasd.tv) shuts down on June 1 - https://www.rbc.ru/technology_and_media/20/03/2025/67daa7c39a79472d3feee484 |
16:45:23 | | sparky14925 (sparky1492) joins |
16:47:26 | | kuroger quits [Quit: ZNC 1.9.1 - https://znc.in] |
16:48:46 | | sparky1492 quits [Ping timeout: 250 seconds] |
16:48:47 | | sparky14925 is now known as sparky1492 |
16:55:55 | | kuroger (kuroger) joins |
16:56:14 | | kansei quits [Quit: ZNC 1.9.1 - https://znc.in] |
17:08:41 | | kansei (kansei) joins |
17:13:35 | | sparky14922 (sparky1492) joins |
17:13:43 | | NeonGlitch (NeonGlitch) joins |
17:17:41 | | sparky1492 quits [Ping timeout: 260 seconds] |
17:17:41 | | sparky14922 is now known as sparky1492 |
18:06:53 | | ATWF_notcarl joins |
18:24:43 | | VoynichCR (VoynichCR) joins |
18:49:00 | | Island joins |
18:49:57 | | FiTheArchiver quits [Quit: Leaving] |
19:18:01 | | nyakase quits [Remote host closed the connection] |
19:20:28 | | nyakase (nyakase) joins |
19:25:12 | | BennyOtt quits [Ping timeout: 250 seconds] |
19:27:08 | | BennyOtt (BennyOtt) joins |
19:42:00 | | Jake quits [Quit: Leaving for a bit!] |
19:45:14 | | Webuser161207 quits [Quit: Ooops, wrong browser tab.] |
19:45:56 | | sec^nd quits [Remote host closed the connection] |
19:46:10 | | sec^nd (second) joins |
19:50:46 | | Webuser603791 joins |
19:51:20 | <pokechu22> | VoynichCR: IIRC pabs ran a bunch of tuxfamily stuff in #archivebot |
19:52:03 | <VoynichCR> | nice |
19:57:26 | | SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962] |
20:06:30 | | SkilledAlpaca418962 joins |
20:23:03 | | nyakase quits [Remote host closed the connection] |
20:27:09 | | nyakase (nyakase) joins |
20:27:34 | | nyakase quits [Remote host closed the connection] |
20:30:28 | | nyakase (nyakase) joins |
20:34:54 | <h2ibot> | Tech234a edited Deathwatch (+352, /* 2025 */ Add Chrome Web Store Manifest V2): https://wiki.archiveteam.org/?diff=55004&oldid=55001 |
20:36:54 | <h2ibot> | Tech234a edited Chrome Web Store (+114, Minor updates to timeline): https://wiki.archiveteam.org/?diff=55005&oldid=52369 |
20:36:55 | <steering> | VoynichCR: https://lwn.net/Articles/1004988/ |
20:37:37 | <steering> | looks like a bunch was done in jan-feb around it |
20:39:12 | <tech234a> | re Chrome Web Store: as a note https://chrome-stats.com/manifest-v3-migration indicates that about 1/3 of extensions on the Chrome Web Store still haven't been migrated to Manifest V3 |
20:39:13 | <steering> | (AB and down-the-tube mostly) |
20:42:20 | | ThreeHM quits [Ping timeout: 250 seconds] |
20:44:12 | | ThreeHM (ThreeHeadedMonkey) joins |
20:52:50 | | SkilledAlpaca418962 quits [Client Quit] |
21:01:36 | | SkilledAlpaca418962 joins |
21:05:28 | | VoynichCR quits [Client Quit] |
21:06:30 | <@JAA> | arkiver: I can take a look at Arzon tomorrow. You say 'mostly' sequential; what does that mean? Is there anything important to grab beyond /item_${id}.html + the images? |
21:32:42 | | gust joins |
21:42:02 | | gust quits [Remote host closed the connection] |
21:42:21 | | gust joins |
21:50:42 | | chrismeller quits [Quit: chrismeller] |
21:51:14 | | chrismeller (chrismeller) joins |
22:06:26 | | tek_dmn quits [Ping timeout: 260 seconds] |
22:12:58 | | NeonGlitch quits [Quit: My Mac Mini has gone to sleep. ZZZzzz…] |
22:16:31 | | tek_dmn (tek_dmn) joins |
22:22:31 | | etnguyen03 (etnguyen03) joins |
22:23:35 | <joepie91|m> | errrr. wget-at is probably not supposed to be segfaulting...? |
22:23:39 | <joepie91|m> | (in the context of the warrior docker container) |
22:23:51 | <joepie91|m> | or dumping core at least |
22:24:22 | <@JAA> | That's not very typical, I'd like to make that point. |
22:24:53 | <@JAA> | There have been reports of it before, but it's not supposed to happen, yeah. |
22:26:57 | <joepie91|m> | the stacktrace is incredibly unhelpful but it is constantly coredumping on at least one VPS of mine |
22:27:03 | <joepie91|m> | #0 0x00007f8e8f6b7ebc n/a (/usr/lib/x86_64-linux-gnu/libc.so.6 + 0x8aebc) |
22:27:05 | <joepie91|m> | that's the only line in the stacktrace |
22:35:16 | <@JAA> | Hmm |
22:36:02 | <@JAA> | Well, if it's reproducible, that's a start at least. |
22:36:39 | | ATinySpaceMarine quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
22:37:28 | | ATinySpaceMarine joins |
22:40:44 | <@JAA> | Could be worth a debugging session (without communication with the real tracker etc., of course). |
22:40:47 | <@JAA> | Cc arkiver |
22:58:26 | | lennier2_ joins |
23:01:00 | | lennier2 quits [Ping timeout: 250 seconds] |
23:13:30 | | etnguyen03 quits [Client Quit] |
23:48:16 | | etnguyen03 (etnguyen03) joins |