00:04:36 | <thuban> | ok, i'm currently scraping all the page images, and i'll feed them back into archivebot once i've got the list |
00:04:45 | <thuban> | (it'll take a little while, i'm being gentle) |
00:07:32 | <h2ibot> | JustAnotherArchivist edited Deathwatch (+159, /* 2023 */ Add OneHallyu): https://wiki.archiveteam.org/?diff=51153&oldid=51152 |
00:08:12 | <vokunal|m> | It looks to be about 1.1M posts |
00:08:21 | <vokunal|m> | nvm |
00:09:18 | <thuban> | Peroniko: did you want any advice about converting to pdf and/or uploading to ia? |
00:09:20 | <thuban> | the former is easy, but the latter will probably require some manual work due to the lack / inconsistent formatting of metadata |
00:10:14 | <vokunal|m> | ~11.7M posts on OneHallyu |
00:10:33 | <@JAA> | Yeah, 11.8 million is what the homepage says. |
00:11:38 | <vokunal|m> | someday i'll learn. First time i helped with a forum, I didn't see any number, the second time, i count it manually. Fool me twice, I better learn the third time |
00:11:45 | <@JAA> | :-) |
00:11:46 | <Peroniko> | thuban: I have uploaded some thing to IA, but if there is a guide for better metadata it would be helpful. |
00:11:55 | <Peroniko> | I've uploaded this for example: https://archive.org/details/arhitektura-graficki-dio |
00:13:11 | <thuban> | Peroniko: the general metadata documentation is here: https://archive.org/developers/metadata-schema/index.html |
00:19:01 | | Peroniko quits [Client Quit] |
00:19:25 | | Peroniko (Peroniko) joins |
00:19:26 | | Peroniko quits [Max SendQ exceeded] |
00:19:52 | | Peroniko (Peroniko) joins |
00:24:25 | <@JAA> | OneHallyu is running through AB now. We'll see how that goes. |
00:24:32 | | Peroniko quits [Client Quit] |
00:24:40 | <@JAA> | Buttflare is involved. |
00:25:05 | | Peroniko (Peroniko) joins |
00:35:17 | | Peroniko quits [Ping timeout: 272 seconds] |
00:35:52 | | Peroniko (Peroniko) joins |
00:39:57 | <kpcyrd> | from -ot: how do I archive videos hosted on sharepoint? it's going to be deleted in a few days: https://kth-my.sharepoint.com/:v:/g/personal/longz_ug_kth_se/EesSEHqiHHtQabKFAAAx5PsB-5r8MVtnp5NECOtKN-YGsA?e=8s0kaj |
00:42:55 | <nicolas17> | it's probably some temporary signed URL that will change on every load and can't be archived in a way that lets the original link work in WBM |
00:44:58 | <nicolas17> | oof that seems to be DASH even |
00:50:33 | <thuban> | kpcyrd: i don't think there's anything reliably plug-and-play for sharepoint |
00:50:41 | <nicolas17> | transcoded on the fly from an original .mp4 that seems impossible to access |
00:51:11 | <thuban> | you could try https://github.com/kylon/Sharedown or some of the workarounds suggested in sharepoint-related issues at https://github.com/snobu/destreamer |
00:51:26 | <thuban> | (or punt and screen-record it) |
00:52:19 | <nulldata> | There's an open PR for yt-dlp to add SharePoint. https://github.com/yt-dlp/yt-dlp/pull/6531 |
00:52:52 | <nicolas17> | I *could* grab the DASH but it sucks that I can't access the original .mp4 :/ |
00:54:11 | <kpcyrd> | I'm trying to get it into the wayback machine specifically: |
00:54:13 | <kpcyrd> | https://web.archive.org/web/20231116002236/https://kth-my.sharepoint.com/personal/longz_ug_kth_se/_layouts/15/stream.aspx?id=%2Fpersonal%2Flongz%5Fug%5Fkth%5Fse%2FDocuments%2Fbox%5Ffiles%2FKTH%20SR%20Meetup%2F2020%2D11%2D24%2013%2E06%2E13%20Localization%20of%20Unreproducible%20Builds%2F2020%2D11%2D24%2013%2E06%2E13%20Localization%20of%20Unreproducible%20Builds%20%2D%20Jifeng% |
00:54:15 | <kpcyrd> | 20Xuan%2Emp4&ga=1 |
00:54:19 | <nicolas17> | that's not going to work |
00:54:25 | <kpcyrd> | rip |
00:54:26 | | etnguyen03 quits [Ping timeout: 265 seconds] |
00:54:32 | <nicolas17> | there's timestamped, signed URLs that change every time you load the page |
00:55:04 | <thuban> | ia item not adequate? |
00:55:41 | <nicolas17> | ffmpeg doesn't do parallel requests (in fact I'm not sure if it does proper HTTP keepalive) so this DASH remux is taking me forever |
01:04:37 | | etnguyen03 (etnguyen03) joins |
01:07:35 | | igloo22225 quits [Ping timeout: 272 seconds] |
01:09:30 | <nicolas17> | oh great I got some 503 Service Unavailable too |
01:12:49 | | lunik173 quits [Client Quit] |
01:18:45 | | Peroniko quits [Remote host closed the connection] |
01:19:49 | | andrew (andrew) joins |
01:22:22 | | Peroniko (Peroniko) joins |
01:28:25 | | Peroniko quits [Client Quit] |
01:49:41 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
01:56:22 | | rohvani5 joins |
01:56:32 | | andrew5 (andrew) joins |
01:56:46 | | TheTechRobo9 (TheTechRobo) joins |
01:57:25 | | monoxane quits [Quit: estoy fuera] |
01:57:37 | | CraftByte quits [Client Quit] |
01:57:37 | | andrew quits [Client Quit] |
01:57:37 | | rohvani quits [Client Quit] |
01:57:37 | | TheTechRobo quits [Client Quit] |
01:57:37 | | andrew5 is now known as andrew |
01:57:38 | | rohvani5 is now known as rohvani |
01:57:38 | | TheTechRobo9 is now known as TheTechRobo |
01:57:38 | | h3ndr1k quits [Client Quit] |
01:57:52 | | h3ndr1k (h3ndr1k) joins |
01:58:04 | | monoxane (monoxane) joins |
01:58:09 | | monoxane1 (monoxane) joins |
01:58:09 | | monoxane4 (monoxane) joins |
01:58:14 | | monoxane quits [Remote host closed the connection] |
01:58:14 | | monoxane4 quits [Remote host closed the connection] |
01:58:14 | | monoxane1 is now known as monoxane |
02:12:55 | <nulldata> | Gitlab is now requiring new users to verify using a phone number or credit card, or account will be deleted. So far only seems to apply to new accounts, but something to keep an eye on if they expand it to existing accounts. https://lemmy.world/post/8297909 |
02:23:49 | <flashfire42|m> | https://www.androidpolice.com/ensuring-high-quality-apps-on-google-play/ |
02:26:41 | <flashfire42|m> | https://www.animenewsnetwork.com/news/2023-11-11/crunchyroll-ends-digital-manga-app-on-mobile-web-on-december-11/ |
02:28:41 | | wyatt8740 quits [Remote host closed the connection] |
02:28:42 | <@JAA> | Correct link for the latter: https://www.animenewsnetwork.com/news/2023-11-11/crunchyroll-ends-digital-manga-app-on-mobile-web-on-december-11/.204339 |
02:32:33 | | wyatt8740 joins |
02:40:06 | | wyatt8740 quits [Remote host closed the connection] |
02:45:12 | | wyatt8740 joins |
03:34:31 | | yano quits [Ping timeout: 272 seconds] |
03:48:35 | | yano (yano) joins |
04:10:49 | | Wohlstand quits [Client Quit] |
04:39:14 | | DogsRNice_ quits [Read error: Connection reset by peer] |
04:44:11 | | dumbgoy__ quits [Ping timeout: 272 seconds] |
04:48:05 | | BlueMaxima quits [Read error: Connection reset by peer] |
05:00:32 | <h2ibot> | JAABot edited CurrentWarriorProject (-4): https://wiki.archiveteam.org/?diff=51154&oldid=51143 |
05:32:57 | | etnguyen03 quits [Ping timeout: 272 seconds] |
05:37:09 | | etnguyen03 (etnguyen03) joins |
05:40:53 | | wickedplayer494 is now authenticated as wickedplayer494 |
05:58:48 | | etnguyen03 quits [Client Quit] |
06:09:03 | | icedice2 quits [Ping timeout: 272 seconds] |
06:13:26 | | Hackerpcs quits [Ping timeout: 265 seconds] |
06:14:19 | | Hackerpcs (Hackerpcs) joins |
06:18:10 | | atphoenix__ (atphoenix) joins |
06:21:05 | | atphoenix_ quits [Ping timeout: 272 seconds] |
06:21:15 | | superkuh_ joins |
06:22:12 | | atphoenix_ (atphoenix) joins |
06:23:42 | | nicolas17 quits [Client Quit] |
06:24:15 | | atphoenix__ quits [Ping timeout: 272 seconds] |
06:24:15 | | superkuh quits [Ping timeout: 272 seconds] |
06:38:46 | | Island quits [Read error: Connection reset by peer] |
06:40:27 | | Arcorann (Arcorann) joins |
06:45:48 | | atphoenix_ quits [Remote host closed the connection] |
06:45:48 | | superkuh_ quits [Remote host closed the connection] |
06:45:55 | | superkuh_ joins |
06:46:12 | | atphoenix_ (atphoenix) joins |
07:02:57 | | mindstrut joins |
07:03:01 | | mindstrut quits [Remote host closed the connection] |
07:53:10 | <that_lurker> | https://bird.makeup would be a nice alternative way to grab twitter(x) stuff. They create a mastodon account where all the tweets are posted. |
07:54:10 | <that_lurker> | also means there is no rate limiting |
07:54:26 | <that_lurker> | other than on their end of course |
09:25:34 | | superkuh__ joins |
09:25:42 | | wyatt8740 quits [Client Quit] |
09:25:42 | | andrew quits [Client Quit] |
09:25:42 | | Pedrosso quits [Client Quit] |
09:25:42 | | TheTechRobo quits [Client Quit] |
09:25:42 | | superkuh_ quits [Remote host closed the connection] |
09:25:48 | | Pedrosso joins |
09:25:54 | | wyatt8740 joins |
09:26:01 | | andrew (andrew) joins |
09:26:12 | | TheTechRobo (TheTechRobo) joins |
09:35:53 | | jacksonchen666 (jacksonchen666) joins |
09:37:59 | | TheTechRobo quits [Client Quit] |
09:38:25 | | TheTechRobo (TheTechRobo) joins |
09:40:11 | | TheTechRobo quits [Excess Flood] |
09:40:40 | | TheTechRobo (TheTechRobo) joins |
09:43:40 | | monoxane3 (monoxane) joins |
09:43:45 | | TheTechRobo quits [Excess Flood] |
09:43:45 | | andrew quits [Client Quit] |
09:43:45 | | monoxane quits [Read error: Connection reset by peer] |
09:43:45 | | Pedrosso quits [Read error: Connection reset by peer] |
09:43:45 | | monoxane3 is now known as monoxane |
09:43:50 | | Pedrosso joins |
09:43:53 | | andrew (andrew) joins |
09:44:13 | | TheTechRobo (TheTechRobo) joins |
09:44:21 | | Peroniko (Peroniko) joins |
09:51:28 | | Pedrosso quits [Client Quit] |
09:51:28 | | Peroniko quits [Remote host closed the connection] |
09:51:32 | | Pedrosso joins |
10:00:01 | | Bleo1 quits [Client Quit] |
10:01:24 | | Bleo18 joins |
10:14:27 | | jacksonchen666 quits [Client Quit] |
10:36:23 | | icedice (icedice) joins |
10:48:07 | | icedice quits [Client Quit] |
11:33:59 | | Megame (Megame) joins |
11:45:26 | | sec^nd quits [Ping timeout: 245 seconds] |
11:51:12 | | Megame1_ (Megame) joins |
11:51:44 | | TheTechRobo quits [Client Quit] |
11:51:44 | | Pedrosso quits [Client Quit] |
11:51:44 | | Megame quits [Remote host closed the connection] |
11:51:48 | | Pedrosso joins |
11:52:09 | | TheTechRobo (TheTechRobo) joins |
11:52:34 | | sec^nd (second) joins |
11:58:30 | | Megame1_ is now known as Megame |
12:15:05 | | BornOn420_ (BornOn420) joins |
12:18:55 | | BornOn420 quits [Ping timeout: 272 seconds] |
13:12:12 | | jodizzle quits [Remote host closed the connection] |
13:12:57 | | jodizzle (jodizzle) joins |
13:16:03 | | ScenarioPlanet (ScenarioPlanet) joins |
13:16:33 | | Arcorann quits [Ping timeout: 272 seconds] |
13:26:41 | | benjinsm quits [Ping timeout: 272 seconds] |
13:28:29 | <null> | https://blog.opensubtitles.com/opensubtitles/saying-goodbye-to-opensubtitles-org-api-embrace-the-20-black-friday-treat |
13:28:35 | | null is now known as rawktucc |
13:28:47 | | rawktucc quits [Client Quit] |
13:29:41 | | rktk (rktk) joins |
13:29:54 | <rktk> | stupid sexy nickserv |
13:29:56 | <rktk> | https://blog.opensubtitles.com/opensubtitles/saying-goodbye-to-opensubtitles-org-api-embrace-the-20-black-friday-treat |
13:30:03 | <rktk> | Does anyone know of a full dump or half dump of open subtitles |
13:30:06 | <rktk> | this is a real slap in the face |
13:34:18 | <h2ibot> | MasterX244 edited List of websites excluded from the Wayback Machine (+28): https://wiki.archiveteam.org/?diff=51155&oldid=51036 |
13:42:27 | | lunik173 joins |
13:54:32 | | lunik173 quits [Ping timeout: 265 seconds] |
13:55:27 | | lunik173 joins |
14:00:23 | <h2ibot> | JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=51156&oldid=51155 |
14:01:35 | | etnguyen03 (etnguyen03) joins |
14:08:33 | | ScenarioPlanet quits [Client Quit] |
14:36:10 | | benjins joins |
14:57:12 | | lunik1731 joins |
14:57:12 | | benjins quits [Remote host closed the connection] |
14:57:12 | | TheTechRobo quits [Client Quit] |
14:57:12 | | lunik173 quits [Client Quit] |
14:57:12 | | Pedrosso quits [Client Quit] |
14:57:12 | | lunik1731 is now known as lunik173 |
14:57:16 | | Pedrosso joins |
14:57:17 | | benjins joins |
14:57:36 | | TheTechRobo (TheTechRobo) joins |
14:59:33 | | TheTechRobo quits [Excess Flood] |
14:59:33 | | benjins quits [Remote host closed the connection] |
14:59:36 | | benjins joins |
15:00:09 | | TheTechRobo (TheTechRobo) joins |
15:00:55 | | TheTechRobo quits [Excess Flood] |
15:01:37 | | TheTechRobo (TheTechRobo) joins |
15:12:24 | | automato83 quits [Read error: Connection reset by peer] |
15:19:33 | <@arkiver> | opensubtitles closing themselves? |
15:23:00 | | null joins |
15:23:21 | | wyatt8750 joins |
15:23:31 | | rktk quits [Remote host closed the connection] |
15:23:31 | | aismallard quits [Remote host closed the connection] |
15:23:31 | | h3ndr1k quits [Remote host closed the connection] |
15:23:31 | | Pedrosso quits [Client Quit] |
15:23:31 | | TheTechRobo quits [Client Quit] |
15:23:31 | | JensRex quits [Remote host closed the connection] |
15:23:31 | | wyatt8740 quits [Client Quit] |
15:23:35 | | Pedrosso joins |
15:23:57 | | TheTechRobo (TheTechRobo) joins |
15:24:26 | | aismallard joins |
15:24:29 | | h3ndr1k (h3ndr1k) joins |
15:24:44 | | JensRex (JensRex) joins |
15:31:23 | | null quits [Client Quit] |
15:40:41 | | CraftByte (DragonSec|CraftByte) joins |
15:41:02 | | xkey quits [Remote host closed the connection] |
15:41:12 | | xkey (xkey) joins |
15:44:11 | | xkey quits [Remote host closed the connection] |
15:44:18 | | xkey (xkey) joins |
15:45:03 | | xkey quits [Remote host closed the connection] |
15:45:10 | | xkey (xkey) joins |
16:08:34 | | Island joins |
16:18:07 | | Wohlstand (Wohlstand) joins |
16:19:11 | | lader joins |
16:19:33 | | lader quits [Remote host closed the connection] |
16:20:53 | | Naruyoko5 quits [Quit: Leaving] |
16:31:29 | <Hans5958> | Has anyone backed this up yet? https://pabio.com/blog/company/bankruptcy/ |
16:43:55 | <murb> | oh talking of which https://www.bleed-clothing.com/de/info # "Wir sind insolvent." |
16:45:33 | | etnguyen03 quits [Ping timeout: 272 seconds] |
16:52:12 | | dumbgoy__ joins |
16:55:53 | | dumbgoy joins |
16:58:41 | | dumbgoy__ quits [Ping timeout: 265 seconds] |
17:06:49 | | Dango360 (Dango360) joins |
17:09:55 | | Dango360 quits [Read error: Connection reset by peer] |
17:10:42 | | BearFortress quits [Client Quit] |
17:12:09 | <@JAA> | arkiver: 'Only' the API, as I understand it? |
17:12:27 | | icedice (icedice) joins |
17:13:06 | | Dango360 (Dango360) joins |
17:20:39 | <Megame> | Hans5958, murb I threw them in AB |
17:21:06 | <murb> | ta |
17:44:55 | | BearFortress joins |
17:47:28 | | TheTechRobo quits [Client Quit] |
17:48:02 | | TheTechRobo (TheTechRobo) joins |
17:55:25 | | CraftByte quits [Client Quit] |
17:55:25 | | icedice quits [Remote host closed the connection] |
17:55:31 | | CraftByte (DragonSec|CraftByte) joins |
17:55:37 | | icedice (icedice) joins |
17:56:49 | | TheTechRobo quits [Client Quit] |
17:57:24 | | Pedrosso4 joins |
17:57:31 | | Pedrosso quits [Client Quit] |
17:57:31 | | Pedrosso4 is now known as Pedrosso |
17:57:45 | | TheTechRobo (TheTechRobo) joins |
17:59:43 | | icedice quits [Remote host closed the connection] |
17:59:56 | | icedice (icedice) joins |
18:10:37 | | Megame quits [Client Quit] |
18:16:40 | <fireonlive> | here's the forum post about it: https://forum.opensubtitles.org/viewtopic.php?t=17930#p47873 |
18:17:16 | <fireonlive> | looks like the 'new rest api' still has a free tier: https://opensubtitles.stoplight.io/docs/opensubtitles-api/a7d25b650b784-api-subscription-prices |
18:17:42 | <fireonlive> | though those prices are.. hm. |
18:17:54 | <@JAA> | XML-RPC... Ok, yeah, I agree that needs to die already. |
18:19:02 | <fireonlive> | https://blog.opensubtitles.com/opensubtitles/saying-goodbye-to-opensubtitles-org-api-embrace-the-20-black-friday-treat posted earlier says "This decision, initially disclosed in a forum post, will primarily affect non-VIP users, while VIP members will continue to enjoy access to the API." so i guess they're keeping it around for VIP people for a |
18:19:02 | <fireonlive> | bit longer at least? |
18:19:12 | <fireonlive> | and yeah it does haha |
18:55:53 | | rktk (rktk) joins |
19:04:15 | | aninternettroll quits [Ping timeout: 272 seconds] |
19:08:02 | | icedice quits [Client Quit] |
19:13:26 | | Naruyoko joins |
19:25:31 | | parfait (kdqep) joins |
19:36:42 | | aninternettroll (aninternettroll) joins |
19:49:53 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
19:52:15 | | nicolas17 joins |
19:58:19 | | DogsRNice joins |
20:15:40 | <anarchat> | so uh |
20:16:24 | <anarchat> | apparently, i have a blogspot blog, two actually... and i learned this because google/blogger.com wrote me to tell me i haven't logged in since 2007 and so they will delete my shit... i wonder if we need to do something about this |
20:16:36 | <anarchat> | my two blogs are totally irrelevant and empty, but there might be others facing destruction out there |
20:18:50 | <nicolas17> | anarchat: it has been discussed before; how do we find "all blogs"? |
20:20:17 | <anarchat> | i have no idea |
20:30:16 | <fireonlive> | #frogger :) |
20:31:00 | | DogsRNice_ joins |
20:31:10 | | DogsRNice quits [Remote host closed the connection] |
20:31:10 | | parfait quits [Remote host closed the connection] |
20:31:10 | | Naruyoko quits [Remote host closed the connection] |
20:31:10 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
20:31:10 | | Naruyoko5 joins |
20:31:25 | | parfait (kdqep) joins |
20:33:33 | <nicolas17> | kpcyrd: https://data.nicolas17.xyz/localization-unreproducible-builds.mp4 this is from the DASH stream on sharepoint |
20:34:08 | <nicolas17> | it wasn't easy to get because if I just fed the DASH manifest to ffmpeg, one or two segments would randomly give a "503 service unavailable" and ffmpeg doesn't retry |
20:34:17 | <nicolas17> | so I got a gap in the video |
20:34:36 | <nicolas17> | I had to download all segments and rewrite the manifest to use those local files |
20:35:21 | <nicolas17> | take it and figure out what to do with it; archive.org item or whatever :P |
20:41:54 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
20:46:41 | <Pedrosso> | https://transfer.archivete.am/J2GVQ/sporeforums1.txt does this list of largely unarchived spore forums have any URLs that the bot wouldn't be able to archive properly? In either case could the viable URLs be fed to AB? |
20:48:17 | <@JAA> | The ones that aren't entire domains could be problematic recursion-wise. |
20:48:30 | <Pedrosso> | Problematic in what way? |
20:48:31 | <@JAA> | Other than that, not sure. |
20:49:18 | <@JAA> | Not recursing properly. If you !a https://example.org/foo/ and it has a link to /bar, that won't be followed. |
20:49:31 | <Pedrosso> | not even with offsite links allowed? |
20:49:46 | <@JAA> | No, because they're not offsite. |
20:50:21 | <@JAA> | Offsite = different host |
20:50:54 | <Pedrosso> | I getcha, had hoped it was just offsite (named after different host) = outside recursion |
20:51:40 | <@JAA> | For example, it wouldn't recurse anywhere useful from https://www.mobygames.com/forum/game/36030/spore/ because those URLs aren't in .../spore/. |
20:52:11 | <@JAA> | In that case, !a https://www.mobygames.com/forum/game/36030/spore would work though. |
20:52:22 | <@JAA> | But yeah, each of those needs to be looked at individually. |
20:52:31 | <@JAA> | And some might simply not be possible. |
20:52:42 | <Pedrosso> | what does the / at the end do? |
20:53:25 | <@JAA> | It's a path segment delimiter. For the purpose of AB, the last slash in the path part of the URL determines where it'll recurse onsite. |
20:53:43 | | parfait quits [Remote host closed the connection] |
20:53:43 | | CraftByte quits [Client Quit] |
20:53:48 | | CraftByte (DragonSec|CraftByte) joins |
20:53:50 | <@JAA> | From https://www.mobygames.com/forum/game/36030/spore, it would recurse to any link starting with https://www.mobygames.com/forum/game/36030/ . |
20:53:57 | | parfait (kdqep) joins |
20:54:00 | <Pedrosso> | I see |
20:56:05 | <Pedrosso> | Should I send an transfer.archivete.am link with just the full-domain ones then? |
20:57:17 | <@JAA> | No need, this one is fine. |
20:58:28 | <Pedrosso> | Alright. Thanks a lot then |
21:02:41 | | lennier2_ quits [Ping timeout: 272 seconds] |
21:04:48 | <@JAA> | I wonder whether https://blog.seamonkey-project.org/2023/11/14/migrating-off-archive-mozilla-org/ only applies to SeaMonkey or also to other projects or even the entire archive.mozilla.org. |
21:04:56 | <@JAA> | (It's already running through AB courtesy of arkiver.) |
21:05:22 | <@JAA> | Cc pabs ^ |
21:07:15 | | lennier2_ joins |
21:12:50 | <@JAA> | I'm currently listing all of archive.mozilla.org. It's ... large. |
21:13:09 | <@JAA> | I'll have a size estimate later. |
21:20:27 | <@arkiver> | JAA: maybe it's too large for ArchiveBot, i wonder how large it is. hope we can archive it entirely |
21:24:12 | <@JAA> | I'm already up to over 1.2 million *directories* after only processing 17k. |
21:24:15 | <@JAA> | So yeah... |
21:29:16 | <@JAA> | To rephrase it a bit clearer: I've processed 17k directories and discovered over 1.2 million directories from those. I'm recursing through the dir tree, obviously. |
21:30:07 | <@JAA> | And those numbers are now at 32k done, 2.1M discovered. |
21:30:13 | <@JAA> | It'll be a while... |
21:31:00 | <Pedrosso> | How long will it stay up? Assuming it has any sort of shutdown date |
21:31:55 | <@JAA> | See link above |
21:32:43 | <@JAA> | Beware of https://archive.mozilla.org/pub/firefox/tinderbox-builds/ , those subdirs are *massive*. Like, 100 MB dir listings massive. |
21:33:48 | <@JAA> | There's also at least one which doesn't finish loading within a minute. |
21:34:13 | <Pedrosso> | does AB ignore something if it doesn't load within a minute? |
21:34:14 | <project10> | mod_autoindex like 😰 |
21:34:39 | <@JAA> | It's complicated. |
21:35:27 | <@JAA> | AB expects the HTTP headers within 20 seconds and the complete response within 6 hours, but slow processing of parallel requests (such as link extraction or compressing for WARC) can break the retrieval. |
21:35:45 | <@JAA> | I bet most of the dirs in there were not listed correctly on the first attempt by AB. |
21:36:54 | <@JAA> | The 1 minute timeout is the default in qwarc, which I'm using for listing this more efficiently. |
21:37:09 | <nicolas17> | 100MB *listings*? |
21:38:29 | <@JAA> | Running into a problem, will need to restart the listing. |
21:40:52 | <Pedrosso> | I bid you good luck with this, lookin' forward to seeing just how big the listing file will be. |
21:41:50 | <@JAA> | nicolas17: Yes, autoland-linux64 is that one, it contains 195k entries. |
21:42:37 | <@JAA> | autoland-macosx64-debug times out on the server side after a bit over a minute with a 502. |
21:50:49 | <@JAA> | Listing restarted, going more faster now. |
21:51:27 | | dumbgoy quits [Ping timeout: 272 seconds] |
21:54:51 | <@JAA> | (I hope there are no loops via symlinks.) |
21:59:36 | <@JAA> | Oh, this time autoland-linux64 repeatedly timed out as well, yay. |
22:02:12 | <@JAA> | I think I'm running into SQLite lock contention at this point. But processing 7-9k dirs per minute isn't bad. |
22:08:45 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
22:09:40 | <Pedrosso> | As for what I sent of spore forums, here are a few archive-related comments about the domains of the few that weren't directly in the domain https://transfer.archivete.am/mawW8/sporeforums%20addendum.txt |
22:14:22 | | etnguyen03 (etnguyen03) joins |
22:16:14 | | nicolas17 quits [Ping timeout: 265 seconds] |
22:17:30 | <Pedrosso> | An addendum to that addendum; https://gamefaqs.gamespot.com/ has an archive but https://gamefaqs.gamespot.com/boards/926714-spore/72994456 (posted before the archive) is missing (https://gamefaqs.gamespot.com/boards/926714-spore has 1 archive from ArchiveTeam though) |
22:19:54 | | nicolas17 joins |
22:20:32 | <nicolas17> | my modem rebooted... maybe because of telegrab at high concurrency /o\ |
22:21:09 | <@JAA> | I don't think we ever fully archived GameFAQs. I believe there were unsuccessful/incomplete attempts only. |
22:21:19 | <nicolas17> | (reddit is much more prone to doing that) |
22:22:31 | <nicolas17> | does wget-at use keepalive? |
22:25:45 | <Pedrosso> | Ah, I see. |
22:32:34 | | Arcorann (Arcorann) joins |
22:50:33 | <@JAA> | Now doing over 10k dirs per minute. Brrrrr |
22:50:55 | <@JAA> | Still going to take at least 3 hours to get through the remaining queue. lol |
22:51:06 | <@JAA> | So yes, it is marginally too big for AB. :-P |
22:52:14 | <Pedrosso> | What alternatives are there then? |
22:52:46 | <@JAA> | It does depend a bit on how many files there are and how large they are. |
22:52:53 | <@JAA> | DPoS would be an option. |
22:53:13 | <@JAA> | Or maybe it can be done with AB with a few !ao < jobs rather than one big recursive one. |
22:53:42 | <@JAA> | The listings I've retrieved so far are already over 1 GiB of WARC, i.e. after compression. |
22:54:04 | <Pedrosso> | o_o |
22:54:10 | <Pedrosso> | how many would "a few" be? |
22:54:11 | <h2ibot> | Arkiver uploaded File:Blogger-icon.png: https://wiki.archiveteam.org/?title=File%3ABlogger-icon.png |
22:54:45 | <@JAA> | And arkiver spoke: 'let there be an icon!', and there was an icon. |
22:54:57 | <Pedrosso> | That is how it be |
22:55:17 | <fireonlive> | and it was glorious |
22:55:56 | <Flashfire42> | And the administrators of those websites said "Did anybody hear that?, Must have been the wind" |
22:56:25 | | parfait quits [Remote host closed the connection] |
22:56:25 | | CraftByte quits [Client Quit] |
22:56:30 | | CraftByte (DragonSec|CraftByte) joins |
22:56:41 | | parfait (kdqep) joins |
22:57:57 | <Flashfire42> | Whenever a new project is about to start I always imagine some kind of eldritch abomination machine just slowly whirring to life. With eyes of red blink into existence and start a march towards their target |
22:58:22 | | parfait quits [Remote host closed the connection] |
22:58:44 | | parfait (kdqep) joins |
22:59:13 | | Wohlstand quits [Ping timeout: 272 seconds] |
22:59:17 | | Wohlstand1 (Wohlstand) joins |
23:00:08 | <@JAA> | Pedrosso: 'A few' would be more than 'a couple' but not 'many'. :-P I don't know, it depends on the output of the listing. |
23:00:12 | <h2ibot> | JAABot edited CurrentWarriorProject (+4): https://wiki.archiveteam.org/?diff=51159&oldid=51154 |
23:01:42 | | Wohlstand1 is now known as Wohlstand |
23:01:44 | | parfait quits [Remote host closed the connection] |
23:01:44 | | CraftByte quits [Client Quit] |
23:01:49 | | CraftByte (DragonSec|CraftByte) joins |
23:02:57 | <@arkiver> | yep :) |
23:03:19 | <@arkiver> | JAA: what is your opinion on already writing a WARC-TLS-Cipher-Suite field before it's standardised? |
23:03:31 | <@arkiver> | (related to that issue on the warc specs github repo) |
23:04:49 | <@arkiver> | or actually |
23:05:05 | <@arkiver> | WARC-Cipher-Suite (the value starting with TLS_ already makes it clear it's for TLS) |
23:05:21 | <fireonlive> | (thank you for not calling it SSL) |
23:06:01 | <@arkiver> | i'm glad i made your day fireonlive :) |
23:06:07 | <fireonlive> | :D |
23:11:28 | <Pedrosso> | JAA: oh, well it's nice that there are such convenient solutions |
23:26:09 | <pabs> | JAA: I expect archive.mozilla.org has a lot of stuff that isn't that useful to archive, like millions of test results :) |
23:27:50 | <@JAA> | arkiver: Fine with me, it's not a violation of the spec to write fields that aren't specified. Might be worth leaving a comment about the intent on https://github.com/iipc/warc-specifications/issues/86 though and seeing if anyone else has concerns about that. |
23:29:22 | <@JAA> | pabs: Yeah, I'm sure there are more and less useful parts to it. |
23:29:40 | <@JAA> | Have you possibly seen another announcement from Mozilla themselves about it? |
23:29:54 | <pabs> | not yet, but I did just wake up :) |
23:30:46 | <pabs> | nothing on https://planet.mozilla.org/ |
23:31:22 | <pabs> | nothing on https://blog.thunderbird.net/ either |
23:31:32 | <@JAA> | Ah yes, time zones. :-) |
23:31:35 | <pabs> | maybe it is only ex-Mozilla projects moving? |
23:33:26 | | dumbgoy joins |
23:49:23 | <thuban> | repeating some requests related to old.dlib.me here, since they got lost in #archivebot: |
23:49:33 | <thuban> | https://transfer.archivete.am/AGArb/www.old.dlib.me-document-viewers-nom - yet another slightly different viewer url |
23:49:45 | <thuban> | https://transfer.archivete.am/ej3GO/www.old.dlib.me-item-pdfs - a small number of items available as pdf rather than through the document viewer |
23:50:04 | <thuban> | https://transfer.archivete.am/y7EDo/www.old.dlib.me-item-info-byname - item info pages, as linked from the library index (extracted from post xhr--not that we can duplicate that, but it's what external links are likely to be). media items like photos and videos are included in page assets |
23:50:19 | <thuban> | https://transfer.archivete.am/157ht6/www.old.dlib.me-item-info-byid - item info pages, by document id (this is the only way to see metadata for some items, mostly newspapers) |
23:50:39 | <thuban> | i believe that's everything that will actually work |
23:56:23 | <@JAA> | I'll run them shortly. |
23:57:12 | | Wohlstand1 (Wohlstand) joins |
23:58:29 | | nicolas17 quits [Read error: Connection reset by peer] |
23:59:03 | | nicolas17 joins |
23:59:23 | | Wohlstand quits [Ping timeout: 272 seconds] |
23:59:23 | | Wohlstand1 is now known as Wohlstand |