#archiveteam-bs log for 2021-05-24

Home Search Previous day Next day

00:20:24	<thuban>	ah yeah, i heard something about it being moved to the same servers as the sister site fictionpress
00:23:00	<thuban>	man, it looks like https://github.com/ArchiveTeam/ffnet-grab wasn't even warrior-based?
00:24:12	<Doranwen>	Fanfiction.net is a massive bit of history for fandom, so that's a huge priority for us
00:28:05	<thuban>	looks like there's never been a scrape of the forums, either
00:42:01		KRG (KRG) joins
00:43:28	<@OrIdow6>	This does sound to me like another panic-site-is-shutting-down-over-a-single-social-media-post
00:44:17	<thuban>	not really
00:44:28	<@OrIdow6>	Obviously from how it's going, it seems less than likely it'll last another year or so, but AFAICT the thing in the last ~2 days does not reflect a days-left state before shutdown?
00:45:55	<thuban>	the thing is that most people have _already_ fled ffn, because it's done multiple rounds of purges in the past and it's no longer trusted in fandom circles--that post isn't so much a news announcement as a psa that gets repeated occasionally
00:46:04	<Doranwen>	no, I don't think it's imminent - but it's not looking great, and fandom would not like to take any chances
00:46:05		Doranwen nods
00:46:42	<thuban>	anyway: sequential user and story ids, max ~15m (!) and ~14m respectively, can be accessed without slugs (not a redirect, just same page content). forums are user-created and do seem to require slugs; topics don't require slugs but do include the forum id in the url and appear to use a single counter across all forums, so both would probably have to be enumerated through the
00:46:45	<thuban>	pagination
00:46:46	<Doranwen>	there's still a lot of very good stuff on there - and the reviews are very valuable too
00:51:42	<thuban>	"communities" are just user-curated story collections; there's a separate beta-reader profile system that isn't linked from the main user profiles (not all users are "registered beta readers") but uses the same ids. i think that's about it
01:02:32		dm4v quits [Read error: Connection reset by peer]
01:02:55		dm4v joins
01:02:57		dm4v is now authenticated as dm4v
01:02:57		dm4v quits [Changing host]
01:02:57		dm4v (dm4v) joins
01:09:28		hexa- quits [Quit: WeeChat 3.1]
01:11:00		hexa- (hexa-) joins
01:13:41	<cm>	I know this isn't archive.org, but does anyone know how to determine the size of a page that is saved on the wayback machine?
01:14:25	<Doranwen>	this is more the channel for questions like that, though I can't answer it, hopefully someone can!
01:14:41	<cm>	with a live web page you can look at the bandwidth usage in firefox dev tools, but on the wayback machine you also pick up stuff from archive.org
01:16:30	<@JAA>	cm: Define 'size'? Byte size of the original HTML? Total size with images etc.? Something else?
01:18:13	<cm>	JAA: the bandwidth it takes to load the page completely
01:20:21	<@JAA>	In the WBM? Then you need to include the WBM's own things. And it rewrites links and references as well, which will also change the size.
01:20:48	<@JAA>	What are you trying to do?
01:21:19	<cm>	compare an updated site to an older version of the same site
01:21:56	<@JAA>	If it's just about the difference, including the WBM's scripts etc. shouldn't matter.
01:22:13	<@JAA>	Since they'll be included in both.
01:25:09	<cm>	ah yeah I can just use the wayback machine version of both
01:25:16	<cm>	good idea
01:26:29	<cm>	ah that doesn't work though, since the wayback machine is now IP blocked by the site I want to measure :(
02:51:42		KRG quits [Client Quit]
02:58:45	<@JAA>	hook54321: https://lists.mozilla.org/ was discontinued last month, apparently. Found that while looking into news.mozilla.org. :-\|
02:59:14	<@hook54321>	i thought we got that one
02:59:18	<@JAA>	At least all the content's still there it seems.
02:59:19	<@JAA>	Did we?
02:59:35	<@JAA>	Grepping my logs only yielded an AB job from 2017.
02:59:54	<@hook54321>	ah
03:00:00	<@hook54321>	must be that i'm thinking of
03:00:05	<webdownload>	Why would Mozilla do such a thing?
03:00:39	<@JAA>	They've been doing this for at least a couple years now.
03:01:17	<webdownload>	They don't seem like the type.
03:01:45	<@JAA>	Inb4 Google Groups shuts down and they move back to their own infra.
03:02:26	<@JAA>	This is where I found out about it, by the way: https://groups.google.com/g/alt.comp.freeware/c/FWKDhmVClv0
03:03:42		BlueMaxima_ joins
03:05:16		Viniter quits [Ping timeout: 250 seconds]
03:06:30		Viniter (Viniter) joins
03:07:37		BlueMaxima quits [Ping timeout: 258 seconds]
03:12:41		BlueMaxima__ joins
03:15:14		Viniter quits [Ping timeout: 250 seconds]
03:15:37		Viniter (Viniter) joins
03:16:32		BlueMaxima_ quits [Ping timeout: 250 seconds]
03:16:45		Viniter7 (Viniter) joins
03:20:16		Viniter quits [Ping timeout: 258 seconds]
03:20:16		Viniter7 is now known as Viniter
03:28:42		qw3rty__ quits [Ping timeout: 258 seconds]
04:03:20		BlueMaxima_ joins
04:07:25		BlueMaxima__ quits [Ping timeout: 258 seconds]
04:09:15		NIC007a83 joins
04:09:23		BlueMaxima__ joins
04:10:26		NIC007a83 quits [Client Quit]
04:12:29	<Doranwen>	thuban: also a note - the default index pages for each category exclude the M-rated fics
04:12:34	<Doranwen>	for fanfiction.net
04:12:50	<Doranwen>	one has to change that setting to see all of them
04:13:10		BlueMaxima_ quits [Ping timeout: 258 seconds]
04:13:32	<Doranwen>	it's a simple string that is added to the URL, so that's not hard to apply, but it's a consideration, because it'd be easy to just go through the default ones and miss those in the index pages
04:15:01	<@JAA>	But those stories themselves are accessible normally, right? It's just excluded from the lists?
04:15:01	<thuban>	not relevant for enumeration over fic ids
04:15:15	<thuban>	yeah
04:16:43	<thuban>	do we usually get that sort of pagination data?
04:16:47	<@JAA>	Does something similar exist in the forums, where we can't just enumerate topics?
04:17:57	<@JAA>	Usually not, no. It'd be nice to preserve the entire website 'experience', but it's not easily possible often, and the unique content is way more important obviously.
04:18:14	<Doranwen>	yeah, it's only really a consideration for the WBM, I think
04:20:17	<thuban>	^^ site policy is "All forum posts must be suitable for teens", and topics don't have ratings, so i presume not
04:20:21	<Doranwen>	that's what I saw on a Reddit thread discussing whether this latest round of paranoia has any substance - and the general consensus was something like "ff.n has been dying for ages, it's not imminent at all but eventually it may go", but there was one fandom user (who, incidentally, helped us with the Yahoo Groups project) that was really bothered that the WBM never got those fics in the index pages
04:21:35	<@JAA>	The index can always be rebuilt from the stories anyway.
04:21:59	<@JAA>	Pagination is particularly horrible to properly archive on websites that are still live.
04:22:14	<thuban>	yeah, was thinking that myself
04:22:28	<@JAA>	You virtually always end up with an incomplete list because things get shifted around while you're iterating through the pages.
04:22:44	<@JAA>	So some stories would appear twice and others would be missing.
04:24:03	<Doranwen>	yeah, if there was a way to reverse the order so the oldest appeared first, one can set it to sort by publish date instead of update date - but any new story posted will throw it off
04:24:12	<Doranwen>	AO3 is nicer in that you can set that
04:24:44	<@JAA>	Stories sometimes get deleted, and then it'll shift everything anyway.
04:25:05	<Doranwen>	oof, yeah
04:25:08	<@JAA>	Offset-based pagination always has that problem.
04:25:45	<@JAA>	You need cursor-based pagination instead, but that's messier to implement, so many smaller sites don't use it.
04:26:35	<@JAA>	What you do is grab all the stories and then generate an independent index from that.
04:27:22	<@JAA>	Anyway, the primary concern is making sure that the unique data, i.e. the actual stories, are safe.
04:27:36	<thuban>	yeah. (i don't think ffn will show you the exact timestamp, unfortunately)
04:28:38	<@JAA>	It doesn't display it, but it's in the HTML in a data-xutime attribute.
04:28:50	<thuban>	oh! i should have checked
04:30:44	<@JAA>	What is the URL tweak needed for the M-rated stories in lists?
04:32:20	<thuban>	it's the 'r' parameter
04:32:21	<@JAA>	Ah, found it in the filters. r=10 param
04:32:31	<thuban>	if you click the "Filters"--yeah
04:33:43	<@JAA>	Looks like https://www.fanfiction.net/j/0/0/0/ always includes them. That's good.
04:33:58	<@JAA>	Thinking about how to continuously cover the site until it inevitably dies.
04:38:52	<@JAA>	Hmm, shouldn't all entries on https://www.fanfiction.net/j/0/2/0/ ('Updated Stories') also be in https://www.fanfiction.net/j/0/0/0/ ('All Types')?
04:39:52	<thuban>	i think "Updated Stories" excludes newly published one-shots
04:41:01	<thuban>	wait, no it doesn't.
04:41:25	<@JAA>	Hmm, I'm sensing some caching bugs there.
04:42:14	<@JAA>	But I suppose the best strategy would be to retrieve all five of the 'Just In' pages regularly to collect story IDs that need to be regrabbed.
04:45:40	<thuban>	the first entry in "Updated Stories" as i'm looking at it right now says "Updated: Oct 17, 2020" but actually clicking on the fic shows "5m ago"
04:45:46	<thuban>	caching bugs indeed :)
04:49:18	<@JAA>	Yeah
04:49:52	<@JAA>	But there's a 'Ghost of Love (Reylo Fanfic)' on Updated now that doesn't appear on All while the two around it do.
04:50:12		nothere quits [Quit: Leaving]
04:51:36	<thuban>	there are also rss feeds, but i think they're only per-category and they appear to be equally flaky
04:52:46	<Doranwen>	yeah, you can customize them quite a bit, I think
04:52:53		nothere joins
04:53:03	<Doranwen>	like, I think they generate a feed based on whatever filter you have set to browse with
04:56:00	<nerdguy1138>	JAA: i'm actually working on this! i can send you a list of inprogress fics
04:56:31	<nerdguy1138>	ive been archiving fics for years now
05:04:12	<nerdguy1138>	fanfiction.net has really been amping up the cloudflare blocking recently. ive moved on to AO3, wattpad, and quotev.
05:12:04	<@JAA>	That's disappointing. What kind of rate limit are we looking at there?
05:15:49	<nerdguy1138>	somewhere between 5-10 seconds, the sript i was using just completely gave up, and honestly if they want to consign themselves to the dustbin of history that badly , i'm inclined to let them. i already have millions of stories from there.
05:16:05	<nerdguy1138>	they only started doing this in the last few months , afaik
05:19:57	<@JAA>	Like, one request per 5-10 seconds? Ew.
05:22:22	<@OrIdow6>	Question is whether it works based on total load
05:22:36	<@OrIdow6>	Though of course that does mean that it's out of QWarc's reach
05:23:21	<nerdguy1138>	i was focused on saving as many stories as i could, i got almost 9 million, so i think i dd pretty well.
05:23:43	<@JAA>	OrIdow6: Nothing is out of qwarc's reach. :-)
05:24:01	<@JAA>	All it needs is a bunch of IPs and a bit of iptables magic. :-P
05:27:32		rsn_ quits [Ping timeout: 258 seconds]
05:43:09	<@OrIdow6>	JAA: Oh, I suppose
05:43:15	<@OrIdow6>	QWarc does seem pretty nice
05:43:42	<@JAA>	Thanks, just don't look too closely or you'll see the nasty intestines. :-P
05:44:27	<@OrIdow6>	Well, that's all useful software
05:45:07	<@OrIdow6>	*That's a feature of all useful software (original makes sense pronounced, not written)
05:45:47	<@JAA>	Guess so, but this is particularly bad. Monkey-patching internal parts of a third-party package. :-P
05:47:42	<@JAA>	(Also, I'm a bit picky in that regrad, but the correct spelling of the name is qwarc, not QWarc.)
05:52:35	<Ryz>	Oh good, that means I can continue pronouncing it as 'qwark' x3
05:57:30	<@JAA>	While I'm at it, the correct pronunciation is exactly like quark (the subatomic particle). :-)
06:10:00		AnotherIki joins
06:13:55		Iki1 quits [Ping timeout: 258 seconds]
06:21:56		TigerbotHesh quits [Quit: ZNC - https://znc.in]
06:22:04		TigerbotHesh joins
06:30:01		rsn joins
06:49:57		AnotherIki quits [Ping timeout: 258 seconds]
06:51:34		Viniter quits [Client Quit]
06:51:50		Viniter (Viniter) joins
07:34:05		TigerbotHesh quits [Client Quit]
07:39:19		TigerbotHesh joins
08:31:47	<rewby>	JAA: re "All it needs is a bunch of IPs" HCross or me can likely assist with that if you need
08:32:01	<@HCross>	not for a few days
08:32:15	<@HCross>	that box is going in for open-server-surgery later tonight
08:33:35	<rewby>	Ah, well I have a /24 on hand anyway
08:34:19	<@HCross>	(the fear of dread dawns as I realise it involves a fight with _that_ KVM)
08:34:29	<rewby>	Need me to help?
08:34:33	<@HCross>	Potentially tomorrow
08:34:39	<rewby>	Lmk
08:36:38	<@JAA>	rewby: Thanks, might get back to one of you about that sometime. Scanning all FFN stories at one request per 10 seconds per IP from a /24 would take about a week. Some would have pagination of course, but seems feasible enough.
08:37:01	<rewby>	Aight! Let us know.
08:47:23	<@EggplantN>	I can provide lots of IPs JAA
08:47:35	<@EggplantN>	If needed before HCross’s box is back
08:48:53	<@JAA>	Not particularly urgent and needs some more investigation and code writing first anyway, so we'll see. But sounds great, thanks.
09:26:44		Wayward quits [Ping timeout: 258 seconds]
09:50:32	<s-crypt>	Is there a way to download quite a bit of larger videos from dropbox? I cannot download all of them because it says "the zip is too large"
09:53:13	<s-crypt>	I have this that would be nice to archive but I really dont want to go and download all of them manually
09:53:13	<s-crypt>	https://transfer.notkiska.pw/jwU2x/Wintergatan%20Video%20Masters%20Archive%20Link%20V2.pdf
09:57:13		BlueMaxima__ quits [Read error: Connection reset by peer]
10:43:31		Wayward (wayward) joins
10:51:17		Daloader joins
12:32:53		AnotherIki joins
12:44:23		qw3rty joins
13:40:20		Nick joins
13:40:38		Nick quits [Remote host closed the connection]
15:04:36		betamax quits [Ping timeout: 250 seconds]
15:06:14		betamax (betamax) joins
16:39:56		nerdguy1138 quits [Ping timeout: 250 seconds]
16:55:01		nerdguy1138 (nerdguy1138) joins
16:56:38		HiccupJul (HiccupJul) joins
16:58:09		spirit joins
17:01:12	<HiccupJul>	does archivebot function differently to the wayback machine's "save page"/built-in scraping features?
17:01:46	<HiccupJul>	if so, how are those two datasets reconciled when an archivebot warc is added to the wayback machine?
17:03:46	<masterX244>	Wayback uses the closest WARC record to the given timestamp from a WARC file
17:05:57	<HiccupJul>	so if there was one made by the wayback machine itself, that maybe missed some sort of dynamic content, and then one a few seconds afterwards made with archivebot, which might have better support for some dynamic content (i guess), then changing the datetime you are currently viewing would switch between a "complete" and "incomplete" page?
17:08:36	<HiccupJul>	also, i want to archive a reddit thread, but neither archive.org or archive.is get anything apart from the top 100 or so comments. what's a good tool for archiving a whole thread?
17:13:41	<@EggplantN>	maybe #archivebot ?
17:14:32	<@EggplantN>	#shreddit may have already archived it
17:20:53	<thuban>	s-crypt: script https://github.com/dropbox/dbxcli to get them one at a time automatically?
17:23:14		Vukky joins
17:23:29	<HiccupJul>	i'll try archivebot
17:24:57	<HiccupJul>	is there a known reason why archivebot works better on some pages? is it a purposeful decision by the internet archive to limit the wayback machine? (i know Archive Team isn't the wayback machine, but IA stopped responding to emails)
17:27:56	<thuban>	what do you mean exactly by "works better"?
17:32:03	<HiccupJul>	i think i misremembered something about archivebot working better on dynamic content.
17:32:24	<HiccupJul>	*possibly misremembered
17:35:29	<thuban>	i think so. "Save Page Now" is based on brozzler, a browser emulator (which can execute javascript), while archivebot is based on wpull (which can't), so spn is generically going to be better for sites that depend on dynamic content. (it's not perfect, though, so sites that serve a dynamic version to spn and a non-dynamic version (eg because of the user-agent) to archivebot
17:35:31	<thuban>	may work better in the latter)
17:35:47		godane quits [Quit: Leaving]
17:36:16		Wayward quits [Ping timeout: 250 seconds]
17:36:48	<thuban>	warcs from archiveteam _projects_ do include dynamic content, since they're manually scripted to do so
17:37:12	<HiccupJul>	ah i see
17:37:23		Wayward (wayward) joins
17:37:50	<HiccupJul>	thanks, that makes things much clearer
17:43:38	<thuban>	so no, i don't think archivebot would get you the entire thread automatically. i was going to suggest that you peek at the 'more comments' network requests and submit all of them to ab, since that should make them play back properly in the wbm, but it seems that they're POSTs. iirc wbm can handle those but archivebot can't
17:49:04	<HiccupJul>	the more comment links seem to use form data and/or cookies
17:49:11	<HiccupJul>	so it doesn't seem like the url is enough
17:49:28	<HiccupJul>	url is just "https://www.reddit.com/api/morechildren" for expanding sub-comments
17:51:26	<thuban>	that's what i said, yes :)
17:54:30	<HiccupJul>	ah right "Form Data" = POST
18:28:14		HP_Archivist (HP_Archivist) joins
18:28:21	<wessel1512>	is there a project that can help me archive the home.xs4all.nl homepages
18:29:11	<thuban>	can you describe the problem in more detail? (do you already have a list of urls?)
18:29:37	<wessel1512>	a small first one yes
18:31:43	<thuban>	how big? do the urls represent entire "homepages" or just the front pages of small sites? what would you need to do to get a more complete list?
18:32:00	<wessel1512>	i believe that there probably upwords of a half million sites
18:32:28	<thuban>	in total or in your list
18:32:43	<wessel1512>	in total
18:32:56	<wessel1512>	just begon scraping
18:33:23	<wessel1512>	my fist list has 350 urls
18:33:35		HiccupJul quits [Client Quit]
18:33:54	<wessel1512>	the sites can be very big
18:34:14	<wessel1512>	but most of them are pretty small
18:35:38	<wessel1512>	example http://apperljh.home.xs4all.nl
18:36:08	<thuban>	ok. that puts it out of scope for urls (no recursion) but it should be doable with archivebot. is that right, JAA?
18:37:22	<wessel1512>	archivebot is off limits for his project
18:37:48	<thuban>	why's that?
18:38:59	<wessel1512>	probably because it wood clog the system
18:39:36	<thuban>	i'm not sure what you mean. did someone tell you that?
18:40:16	<wessel1512>	yea jaa sed no
18:41:15	<@JAA>	I'm sure 350 is nowhere near the actual number.
18:41:35	<@JAA>	Also, let's take this to #webroasting because that's exactly what the channel exists for.
18:52:32		ThreeHeadedMonkey quits [Ping timeout: 258 seconds]
18:54:13		nertzy quits [Quit: Leaving]
18:54:23		ThreeHeadedMonkey (ThreeHeadedMonkey) joins
19:40:27		KRG (KRG) joins
19:48:26		Daloader quits [Ping timeout: 250 seconds]
20:33:17		Catyak096 joins
20:36:58		Catyak096 quits [Remote host closed the connection]
21:02:42		holbrooke quits [Quit: ZNC 1.6.6+deb1ubuntu0.2 - http://znc.in]
21:08:59		etnguyen03 (etnguyen03) joins
21:10:52		Jake quits [Quit: Leaving for a bit!]
21:11:41		Jake (Jake) joins
21:20:24		spirit quits [Client Quit]
21:29:56		Jake quits [Client Quit]
21:33:35		Jake (Jake) joins
21:43:02		yano quits [Quit: WeeChat, the better IRC client, https://weechat.org/]
21:43:13		yanome quits [Quit: The Lounge - https://thelounge.chat]
21:43:19		dm4v quits [Client Quit]
21:44:25		yanome (yano) joins
21:44:55		yano (yano) joins
21:45:46		dm4v joins
21:45:48		dm4v is now authenticated as dm4v
21:45:48		dm4v quits [Changing host]
21:45:48		dm4v (dm4v) joins
21:45:55		Jake quits [Client Quit]
21:49:19		Jake (Jake) joins
22:21:37		webdownload quits [Remote host closed the connection]
22:24:07		@Kaz quits [Remote host closed the connection]
22:32:06		Kaz (Kaz) joins
22:32:06		@ChanServ sets mode: +o Kaz
22:38:17		JensRex quits [.net .split]
22:38:19		HP_Archivist quits [Ping timeout: 258 seconds]
22:38:31		JensRex (JensRex) joins
22:52:46		second (second) joins
22:56:14		Kaz1 (Kaz) joins
22:56:14		@ChanServ sets mode: +o Kaz1
22:56:16		sec^nd quits [Ping timeout: 255 seconds]
22:56:16		second is now known as sec^nd
22:59:51		@Kaz1 leaves [The Lounge - https://thelounge.chat]
23:03:37		AnotherIki quits [Ping timeout: 258 seconds]
23:03:44		lennier1 quits [Client Quit]
23:04:20		lennier1 (lennier1) joins
23:09:21		Kaz1 (Kaz) joins
23:09:21		@ChanServ sets mode: +o Kaz1
23:11:30		BlueMaxima joins
23:14:31		HP_Archivist (HP_Archivist) joins
23:42:21		balrog quits [Quit: Bye]
23:53:10		balrog joins
23:53:11		balrog is now authenticated as balrog

Home Search Previous day Next day