#archiveteam-bs log for 2023-05-18

Home Search Previous day Next day

00:10:11		eroc1990 (eroc1990) joins
00:13:35	<pabs>	JAA: how does one use socialbot to grab tweet threads? I looked at the code and didn't see a command for it
00:14:16	<@JAA>	pabs: socialbot doesn't support it.
00:14:56	<@JAA>	It only implements a subset of snscrape's scrapers.
00:15:41	<pabs>	ah, so how would we do twitter threads then?
00:16:11	<pabs>	these ones in particular https://transfer.archivete.am/EZdNi/cve-twitter-refs.txt
00:16:29	<pabs>	just AB?
00:17:33	<@JAA>	Not really possible I guess. AB doesn't grab tweets correctly, and only socialbot feeds into the background thing that does.
00:19:04	<pabs>	we/I/someone could add a twitter-thread < command though?
00:20:32	<pabs>	or we can just twitter-profile every user in the list and hope for the best :)
00:21:47	<pabs>	todb: cve-refs.txt AB job running now. look at archivebot.com or archivebot.com/3 to watch the progress
00:22:20	<@JAA>	I don't think a recursive tweet list command in socialbot is a good idea. Recursive scrapes can take a very long time, and lists make that worse. socialbot isn't really meant for that.
00:24:25	<pabs>	ok. I guess twitter-profile is the only option then
00:31:44		tbc1887 (tbc1887) joins
00:40:03		tbc1887_ joins
00:41:05		eroc1990 quits [Client Quit]
00:42:06		tbc1887 quits [Ping timeout: 265 seconds]
00:48:24		eroc1990 (eroc1990) joins
00:50:13		birdjj quits [Client Quit]
00:50:19		Somebody2 quits [Ping timeout: 265 seconds]
00:50:30		birdjj joins
00:57:47		cascode quits [Ping timeout: 252 seconds]
01:00:13		birdjj quits [Read error: Connection reset by peer]
01:00:18		birdjj joins
01:00:49		cascode joins
01:02:49		birdjj3 joins
01:02:57		birdjj quits [Read error: Connection reset by peer]
01:02:57		birdjj3 is now known as birdjj
01:04:10		Somebody2 joins
01:18:20		cascode quits [Ping timeout: 252 seconds]
01:19:28		cascode joins
01:44:33		icedice quits [Client Quit]
01:51:55		needlenose quits [Quit: Leaving]
02:05:20		cascode quits [Read error: Connection reset by peer]
02:05:33		cascode joins
02:10:02		cascode quits [Ping timeout: 252 seconds]
02:11:05		cascode joins
02:20:49	<andrew>	I finally got grab-site running and wow it's quite a bit faster than wget
02:25:03		cascode quits [Ping timeout: 265 seconds]
02:25:17		cascode joins
02:31:12	<andrew>	how many millions of pages is too many pages for grab-site?
02:37:51	<fireonlive>	https://i.postimg.cc/8cRLSG9h/image.png hopefully not 1
02:37:57	<fireonlive>	though am only getting like 3r/s :(
02:41:36		cascode quits [Read error: Connection reset by peer]
02:41:53		cascode joins
02:43:38	<fireonlive>	oops forgo to lower con
02:56:12	<andrew>	hopefully it can handle 60 million pages (my rough estimate for this crawl)
03:06:08		dumbgoy quits [Ping timeout: 265 seconds]
03:11:35	<fireonlive>	what's your r/s aobut?
03:11:38	<fireonlive>	abouts*
03:12:12	<andrew>	fireonlive: it says 4.5 r/s
03:12:17	<andrew>	at 6 concurrency
03:12:19	<fireonlive>	oh ntb
03:12:28	<andrew>	but it wildly varies because the server isn't particularly fast on some pages
03:12:30	<fireonlive>	i'm at 3.9 with 5 atm
03:12:40	<fireonlive>	ye i guess it depends on server, backend, etc
03:12:46	<fireonlive>	grab-site is very neat!
03:13:15	<fireonlive>	also uh thanks to everyone here for not minding the stuff i add in lol
03:13:29	<fireonlive>	or "waste my time with" to add
03:17:30	<fireonlive>	oh i thought this was imgone
03:26:00		pikablu joins
03:28:59		pikablu quits [Client Quit]
04:21:18	<todb>	pabs: thanks so much for running the AB job!
04:26:03	<pabs>	todb: once the job completes, you will be able to look at what URLs worked/didn't by finding the job on https://archive.fart.website/archivebot/viewer/
04:45:23		decay quits []
04:58:20		cascode quits [Read error: Connection reset by peer]
04:59:09		cascode joins
05:09:40		Jake quits [Quit: Leaving for a bit!]
05:10:06		Jake (Jake) joins
05:15:40		Jake quits [Client Quit]
05:16:00		Jake (Jake) joins
05:19:41		lexikiq quits [Client Quit]
05:20:20		Jake quits [Client Quit]
05:24:01		Jake (Jake) joins
05:25:03		decky_e quits [Remote host closed the connection]
05:25:20		decky_e joins
05:27:36		Jake quits [Client Quit]
05:28:10		Jake (Jake) joins
06:08:44		MrTumnus quits [Ping timeout: 252 seconds]
06:15:22		parfait (kdqep) joins
06:23:49		BlueMaxima quits [Read error: Connection reset by peer]
06:29:49		sec^nd quits [Remote host closed the connection]
06:30:19		sec^nd (second) joins
06:30:24		Arachnophine quits [Remote host closed the connection]
06:31:59		Arachnophine (Arachnophine) joins
06:34:09		Island quits [Read error: Connection reset by peer]
06:46:31		ave quits [Read error: Connection reset by peer]
06:46:34		igloo222252 joins
06:46:38		lun42 (lun4) joins
06:46:52		ave (ave) joins
06:48:20		lun4 quits [Ping timeout: 252 seconds]
06:48:20		lun42 is now known as lun4
06:48:42		igloo22225 quits [Ping timeout: 252 seconds]
06:48:42		igloo222252 is now known as igloo22225
06:52:09		nicolas17 joins
06:53:55		nicolas17 quits [Client Quit]
07:08:07		parfait quits [Client Quit]
07:28:17	<h2ibot>	PaulWise edited ArchiveBot/CVE (+692, add a few issues): https://wiki.archiveteam.org/?diff=49798&oldid=49796
07:28:33	<pabs>	todb: ^
07:29:17	<h2ibot>	PaulWise edited ArchiveBot/CVE (-9, drop http): https://wiki.archiveteam.org/?diff=49799&oldid=49798
07:30:17	<h2ibot>	PaulWise edited ArchiveBot/CVE (+28, reference URLs project): https://wiki.archiveteam.org/?diff=49800&oldid=49799
07:31:17	<h2ibot>	PaulWise edited ArchiveBot/CVE (+69, mention AB job): https://wiki.archiveteam.org/?diff=49801&oldid=49800
07:31:50		ave quits [Client Quit]
07:32:09		ave (ave) joins
07:45:57		Arcorann (Arcorann) joins
07:48:11		decky_e quits [Read error: Connection reset by peer]
07:48:53		Somebody2 quits [Ping timeout: 265 seconds]
08:02:04		Ivan226 joins
11:22:47		AnotherIki joins
11:26:26		Iki1 quits [Ping timeout: 252 seconds]
11:31:30		Somebody2 joins
11:46:27	<todb>	Ty for minding that and noting the failures, pabs !
12:01:11	<JTL>	Am I misremembering things, or did I see a project somewhere (might've been HN or Reddit) that provided an interface to do semi automated bulk archival of YouTube channels using the usual projects? (youtube-dl/yt-dlp)
12:01:13		M--mlv\|m joins
12:03:37	<pabs>	there is #down-the-tube (things end up in archive.org) and #youtubearchive (things end up in a private archive IIRC)
12:03:56	<JTL>	Not quite what I was looking for, but thanks
12:04:20	<JTL>	well, should've asked in -ot because I was referring to a frontend type project, not AT, sorry!
12:10:28		sonick quits [Client Quit]
12:12:58	<pabs>	ah ok
12:41:40		icedice (icedice) joins
12:45:04		that_lurker quits [Client Quit]
12:45:48		that_lurker (that_lurker) joins
13:19:36		dumbgoy joins
13:41:10	<todb>	pabs: so how do I link archivebot job 10p5imv06zrzlrxm7wcneyhw0 to something searchable on https://archive.fart.website/archivebot/viewer/ ? The JobID syntax is different there, so I'm not sure how to discover it there.
13:46:46		ThetaDev joins
14:02:04	<pabs>	todb: the job isn't completed yet so it won't show up, but when it is, you can truncate the jobid and look it up on the viewer
14:17:45	<todb>	pabs: ah gotcha. So how do you do things like notice those marc.info errors? I can see the job scrolling on http://archivebot.com/ but it's not obvious to me how you see those errors. If you don't mind me asking.
14:21:50		Arcorann quits [Ping timeout: 265 seconds]
14:23:41		sonick (sonick) joins
14:26:28	<pabs>	todb: I saw it in the scrolling, I guess it got past marc.info by now
14:26:55	<pabs>	and you can get the full log from the meta warc after it is done
14:29:35	<todb>	ah ok. So if you happen to see it on the scroll, you're good, otherwise you have to wait until the end
14:37:19	<pabs>	yep
14:44:36		hitgrr8 joins
14:53:37		michaelblob_ (michaelblob) joins
14:54:53		michaelblob quits [Ping timeout: 252 seconds]
15:49:11		Island joins
15:56:18		Icyelut quits [Read error: Connection reset by peer]
15:56:37		Icyelut (Icyelut) joins
16:01:04		decky_e (decky_e) joins
16:10:46	<h2ibot>	Entartet edited List of websites excluded from the Wayback Machine (+19, Added krot.org.): https://wiki.archiveteam.org/?diff=49802&oldid=49727
17:00:20		killsushi joins
17:21:26		nicolas17 joins
17:25:38		spirit joins
17:35:30		decky_e quits [Read error: Connection reset by peer]
17:36:01		decky_e (decky_e) joins
18:23:57	<lennier1>	JAA: My abandoned verified users Twitter scrape is complete. Format current_username, old_username, user_id. https://transfer.archivete.am/rXjVK/twitter_legacy_verified_abandoned_users_cutoff_2022-06-01.txt
18:24:08	<lennier1>	The list must have had accounts verified earlier at the end, because the rate went up as it got further down the list. I found 33938 users using the tweet cutoff date 2022-06-01. I saved the .json output as well, so could get a different cutoff date in the future. A year is pretty arbitrary. The Twitter TOS recommends logging in every 30 days. Who knows what they'll actually do.
18:24:16	<lennier1>	Additionally, rocketdive requested 10 users that didn't show up in my list because they were unverified or had tweeted somewhat later than the cutoff date. Some are more prominent than others. https://transfer.archivete.am/8TVfu/twitter_unverified_abandoned_users_from_rocketdive.txt
18:27:40	<icedice>	JAA: Were Imgur links on pokemon-trainer.com grabbed? It's a tiny forum, so they might have been. If not, that WARC would be nice to scrape after Bulbagarden Forums and Serebii Forums if there's time
18:32:15	<spirit>	https://github.com/ArchiveTeam/ArchiveBot/blob/master/INSTALL.pipeline says to mail yipdw, could someone privately msg me their mail address? i dont see them in the irc channels anymore
18:33:05	<pokechu22>	Pipelines from other people generally aren't accepted nowadays because of complications with setup and stuff
18:33:15	<pokechu22>	JAA can probably explain better
18:34:24	<spirit>	ah
18:34:42	<spirit>	i should have asked before ordering and trying to setup a server...
18:34:50	<spirit>	seems broken with python 3.9 anyways
18:37:14	<xkey>	have you seen this https://twitter.com/elonmusk/status/1659255118196355073
18:37:27	<xkey>	textfiles quite angry obv https://twitter.com/textfiles/status/1659259005888282624
18:40:23	<pokechu22>	Yeah, unfortunately there's a lot of legacy stuff with it (and very outdated/missing documentation) :\|
18:56:49	<pokechu22>	spirit: I don't think you actually ever mentioned what site you were interested in having archived via archivebot
18:57:34	<spirit>	it's https://www.artdoxa.com/ , one of the operators contacted me
18:57:56	<spirit>	i should have good ignore rules ready but it will be more than 500G for sure, maybe 1TB
18:58:18	<spirit>	should be something like: --no-offsite-links --large --explain "Requested by operators before shutdown, contact via spirit@irc"
18:58:21	<spirit>	!ig TODO \?(random\|return_to)=
18:58:25	<spirit>	!ig TODO \/(search\|recent_activities)\?
18:58:57	<pokechu22>	Oh yeah, JAA said this after you went offline last time: 20:46 <@JAA> spirit, pokechu22: Total size rarely matters; number of URLs and size of the largest files does.
18:59:07	<spirit>	ah nice
18:59:14		Megame (Megame) joins
18:59:18	<spirit>	let me calculate some estimate for the URLs
18:59:34	<spirit>	the biggest file should be below 100MB unless someone managed to upload a huuuuge image :)
19:00:34	<pokechu22>	Yeah, that should definitely be safe - things pause if the overall disk space gets below 5GB which is generally safe unless there's a bunch of things downloading somewhat big files at once or a single thing tries to download a multi-gigabyte file
19:03:16	<h2ibot>	Pokechu22 edited Deathwatch (+52, /* 2023 */ ARTDOXA): https://wiki.archiveteam.org/?diff=49803&oldid=49795
19:03:29	<pokechu22>	When not signed in https://www.artdoxa.com/Thomas_Sowa/large?page=1#203664 gives e.g. https://www.artdoxa.com/session/new?return_to=%2Fusers%2F10972%2Fartworks%2F203664 for "add tags", probably we want to ignore /session/new in general
19:03:49	<pokechu22>	and /users/new, too, it seems
19:04:31	<spirit>	my first ignore regex matches those
19:04:51	<pokechu22>	ah, via return_to, that works too
19:05:24	<pokechu22>	I also see that the pages embed https://artdoxa-images.s3.amazonaws.com/uploads/artwork/image/203683/normal_bild_1117.jpg but link to https://artdoxa-images.s3.amazonaws.com/uploads/artwork/image/203683/watermark_bild_1117.jpg -- probably we don't want --no-offsite-links in that case if we want to save the full sized ones (which seems reasonable)
19:05:38	<spirit>	oh it needs https://artdoxa-images.s3.amazonaws.com
19:05:42	<spirit>	haha, same thought
19:05:45	<spirit>	there are ~150,000 artworks, i think there are 4 image sizes so ~600,000, surely <1,000,000 for images
19:06:42	<spirit>	there are ~6,000 users, who have a) avatar images in several sizes, that's maybe 25000 URLs and b) favorite pages which might be 5 pages on average maybe?, that's maybe 30000 URLs. So for that another maybe 100,000 URLs
19:07:21	<spirit>	some artists have catalog pages but that should also be a lowish number, surely no 100,000 URLs
19:07:37	<spirit>	i would be surprised if it is more than 2 million URLs in total
19:08:17	<spirit>	forgot artwork and contact pages for artists, maybe another 100,000 in total
19:09:02	<spirit>	maybe twice that for thumbnail and large image previews in pagination
19:10:16	<pokechu22>	Yeah, that should be fine overall, we've definitely done much bigger. And the images being hosted on amazon should mean that things will be fairly stable
19:12:13	<pokechu22>	Where do you get ~150,000 artworks? The most recent stuff on the main page gives https://www.artdoxa.com/users/matl1/artworks/203685, https://www.artdoxa.com/users/matl1/artworks/203684 which would make me think there's 203685 (but the one after that is https://www.artdoxa.com/users/johann_i/artworks/174344 so maybe it's not as incremental as it appears?)
19:13:28	<spirit>	40 image per page on https://artdoxa.com/artworks?sort=most_recent with 3268 pages
19:13:58	<spirit>	phew the site is slow atm
19:14:53	<pokechu22>	Interesting, https://artdoxa.com/hanskostercom/large?page=1#203643 is a YouTube video instead
19:15:48	<spirit>	yeah i found some but i'd say they are not too important
19:16:18	<spirit>	if it's easy to include them, nice, but if not, no issue
19:17:04	<pokechu22>	Looking at https://artdoxa.com/pages/imprint it seems like https://www.artdoxa.com/ is probably preferable over https://artdoxa.com/ (both serve the same content and use relative links, but the imprint links to the www version)
19:21:37	<spirit>	yeah that would be the canonical one
19:28:16		Somebody2 quits [Ping timeout: 265 seconds]
19:35:02		cascode quits [Ping timeout: 252 seconds]
19:35:51		cascode joins
20:08:18		cascode quits [Read error: Connection reset by peer]
20:08:22	<@JAA>	spirit: Re AB pipelines: https://wiki.archiveteam.org/index.php/ArchiveBot#Volunteer_to_run_a_Pipeline And yes, only up to Py 3.6 supported. But thanks for pointing out INSTALL.pipeline; that doesn't belong there.
20:08:40		cascode joins
20:08:57	<@JAA>	I've been trying to separate 'ArchiveBot the software' from 'ArchiveBot the AT instance of the software'.
20:17:11		Mateon1 quits [Ping timeout: 252 seconds]
20:18:17		Island_ joins
20:18:24		sonick quits [Client Quit]
20:18:50		Island quits [Ping timeout: 252 seconds]
20:19:03		Mateon1 joins
20:32:14	<spirit>	JAA: cheers! i'd be happy to donate a server for a month to get this site archived but paying for longer would not be a good plan for me. should i or someone else just throw this job into the existing archivebot "pool" with --large?
20:35:23	<pokechu22>	I can probably just throw it in. --large isn't actually a thing anymore, but I can manually pick a pipeline that works for it. I'll probably do it in 30 minutes-an hour, eating right now
20:35:43	<@JAA>	spirit: Yeah, doesn't make sense to add a server for a month. --large should simply be purged; it hasn't been used in years.
20:35:57	<@JAA>	AB is down right now though due to an issue with the control node.
20:41:11	<spirit>	=)
20:41:23	<spirit>	pokechu22: thanks!
20:41:49	<spirit>	no rush, it will be up for a month at least
21:07:12		decky_e quits [Read error: Connection reset by peer]
21:07:40		decky_e (decky_e) joins
21:19:37		hitgrr8 quits [Client Quit]
21:28:20		cascode quits [Read error: Connection reset by peer]
21:35:41		Somebody2 joins
21:54:14		nicolas17 quits [Ping timeout: 265 seconds]
21:56:07	<Barto>	https://twitter.com/elonmusk/status/1659255118196355073 lol btw
21:56:48	<Barto>	complaining about deleting the past when he's doing all his power to screw snscrape and delete twitter account :p
22:21:29		Megame quits [Ping timeout: 252 seconds]
22:23:39		MrTumnus joins
22:30:27		MrTumnus quits [Client Quit]
22:30:53		MrTumnus joins
22:54:08		killsushi quits [Ping timeout: 252 seconds]
22:58:32		decky_e quits [Ping timeout: 252 seconds]
22:59:10		decky_e (decky_e) joins
23:02:52		MrTumnus quits [Ping timeout: 265 seconds]
23:05:48		MrTumnus joins
23:07:59		MrTumnus quits [Client Quit]
23:26:56		Dango360 quits [Ping timeout: 252 seconds]
23:31:31		Dango360 (Dango360) joins
23:34:17	<andrew>	What the fuck is going on with Elon/the alt-right/Internet Archive now
23:34:26	<andrew>	I can't even
23:52:10		Arcorann (Arcorann) joins
23:54:30		nicolas17 joins

Home Search Previous day Next day