00:10:11eroc1990 (eroc1990) joins
00:13:35<pabs>JAA: how does one use socialbot to grab tweet threads? I looked at the code and didn't see a command for it
00:14:16<@JAA>pabs: socialbot doesn't support it.
00:14:56<@JAA>It only implements a subset of snscrape's scrapers.
00:15:41<pabs>ah, so how would we do twitter threads then?
00:16:11<pabs>these ones in particular https://transfer.archivete.am/EZdNi/cve-twitter-refs.txt
00:16:29<pabs>just AB?
00:17:33<@JAA>Not really possible I guess. AB doesn't grab tweets correctly, and only socialbot feeds into the background thing that does.
00:19:04<pabs>we/I/someone could add a twitter-thread < command though?
00:20:32<pabs>or we can just twitter-profile every user in the list and hope for the best :)
00:21:47<pabs>todb: cve-refs.txt AB job running now. look at archivebot.com or archivebot.com/3 to watch the progress
00:22:20<@JAA>I don't think a recursive tweet list command in socialbot is a good idea. Recursive scrapes can take a very long time, and lists make that worse. socialbot isn't really meant for that.
00:24:25<pabs>ok. I guess twitter-profile is the only option then
00:31:44tbc1887 (tbc1887) joins
00:40:03tbc1887_ joins
00:41:05eroc1990 quits [Client Quit]
00:42:06tbc1887 quits [Ping timeout: 265 seconds]
00:48:24eroc1990 (eroc1990) joins
00:50:13birdjj quits [Client Quit]
00:50:19Somebody2 quits [Ping timeout: 265 seconds]
00:50:30birdjj joins
00:57:47cascode quits [Ping timeout: 252 seconds]
01:00:13birdjj quits [Read error: Connection reset by peer]
01:00:18birdjj joins
01:00:49cascode joins
01:02:49birdjj3 joins
01:02:57birdjj quits [Read error: Connection reset by peer]
01:02:57birdjj3 is now known as birdjj
01:04:10Somebody2 joins
01:18:20cascode quits [Ping timeout: 252 seconds]
01:19:28cascode joins
01:44:33icedice quits [Client Quit]
01:51:55needlenose quits [Quit: Leaving]
02:05:20cascode quits [Read error: Connection reset by peer]
02:05:33cascode joins
02:10:02cascode quits [Ping timeout: 252 seconds]
02:11:05cascode joins
02:20:49<andrew>I finally got grab-site running and wow it's quite a bit faster than wget
02:25:03cascode quits [Ping timeout: 265 seconds]
02:25:17cascode joins
02:31:12<andrew>how many millions of pages is too many pages for grab-site?
02:37:51<fireonlive>https://i.postimg.cc/8cRLSG9h/image.png hopefully not 1
02:37:57<fireonlive>though am only getting like 3r/s :(
02:41:36cascode quits [Read error: Connection reset by peer]
02:41:53cascode joins
02:43:38<fireonlive>oops forgo to lower con
02:56:12<andrew>hopefully it can handle 60 million pages (my rough estimate for this crawl)
03:06:08dumbgoy quits [Ping timeout: 265 seconds]
03:11:35<fireonlive>what's your r/s aobut?
03:11:38<fireonlive>abouts*
03:12:12<andrew>fireonlive: it says 4.5 r/s
03:12:17<andrew>at 6 concurrency
03:12:19<fireonlive>oh ntb
03:12:28<andrew>but it wildly varies because the server isn't particularly fast on some pages
03:12:30<fireonlive>i'm at 3.9 with 5 atm
03:12:40<fireonlive>ye i guess it depends on server, backend, etc
03:12:46<fireonlive>grab-site is very neat!
03:13:15<fireonlive>also uh thanks to everyone here for not minding the stuff i add in lol
03:13:29<fireonlive>or "waste my time with" to add
03:17:30<fireonlive>oh i thought this was imgone
03:26:00pikablu joins
03:28:59pikablu quits [Client Quit]
04:21:18<todb>pabs: thanks so much for running the AB job!
04:26:03<pabs>todb: once the job completes, you will be able to look at what URLs worked/didn't by finding the job on https://archive.fart.website/archivebot/viewer/
04:45:23decay quits []
04:58:20cascode quits [Read error: Connection reset by peer]
04:59:09cascode joins
05:09:40Jake quits [Quit: Leaving for a bit!]
05:10:06Jake (Jake) joins
05:15:40Jake quits [Client Quit]
05:16:00Jake (Jake) joins
05:19:41lexikiq quits [Client Quit]
05:20:20Jake quits [Client Quit]
05:24:01Jake (Jake) joins
05:25:03decky_e quits [Remote host closed the connection]
05:25:20decky_e joins
05:27:36Jake quits [Client Quit]
05:28:10Jake (Jake) joins
06:08:44MrTumnus quits [Ping timeout: 252 seconds]
06:15:22parfait (kdqep) joins
06:23:49BlueMaxima quits [Read error: Connection reset by peer]
06:29:49sec^nd quits [Remote host closed the connection]
06:30:19sec^nd (second) joins
06:30:24Arachnophine quits [Remote host closed the connection]
06:31:59Arachnophine (Arachnophine) joins
06:34:09Island quits [Read error: Connection reset by peer]
06:46:31ave quits [Read error: Connection reset by peer]
06:46:34igloo222252 joins
06:46:38lun42 (lun4) joins
06:46:52ave (ave) joins
06:48:20lun4 quits [Ping timeout: 252 seconds]
06:48:20lun42 is now known as lun4
06:48:42igloo22225 quits [Ping timeout: 252 seconds]
06:48:42igloo222252 is now known as igloo22225
06:52:09nicolas17 joins
06:53:55nicolas17 quits [Client Quit]
07:08:07parfait quits [Client Quit]
07:28:17<h2ibot>PaulWise edited ArchiveBot/CVE (+692, add a few issues): https://wiki.archiveteam.org/?diff=49798&oldid=49796
07:28:33<pabs>todb: ^
07:29:17<h2ibot>PaulWise edited ArchiveBot/CVE (-9, drop http): https://wiki.archiveteam.org/?diff=49799&oldid=49798
07:30:17<h2ibot>PaulWise edited ArchiveBot/CVE (+28, reference URLs project): https://wiki.archiveteam.org/?diff=49800&oldid=49799
07:31:17<h2ibot>PaulWise edited ArchiveBot/CVE (+69, mention AB job): https://wiki.archiveteam.org/?diff=49801&oldid=49800
07:31:50ave quits [Client Quit]
07:32:09ave (ave) joins
07:45:57Arcorann (Arcorann) joins
07:48:11decky_e quits [Read error: Connection reset by peer]
07:48:53Somebody2 quits [Ping timeout: 265 seconds]
08:02:04Ivan226 joins
11:22:47AnotherIki joins
11:26:26Iki1 quits [Ping timeout: 252 seconds]
11:31:30Somebody2 joins
11:46:27<todb>Ty for minding that and noting the failures, pabs !
12:01:11<JTL>Am I misremembering things, or did I see a project somewhere (might've been HN or Reddit) that provided an interface to do semi automated bulk archival of YouTube channels using the usual projects? (youtube-dl/yt-dlp)
12:01:13M--mlv|m joins
12:03:37<pabs>there is #down-the-tube (things end up in archive.org) and #youtubearchive (things end up in a private archive IIRC)
12:03:56<JTL>Not quite what I was looking for, but thanks
12:04:20<JTL>well, should've asked in -ot because I was referring to a frontend type project, not AT, sorry!
12:10:28sonick quits [Client Quit]
12:12:58<pabs>ah ok
12:41:40icedice (icedice) joins
12:45:04that_lurker quits [Client Quit]
12:45:48that_lurker (that_lurker) joins
13:19:36dumbgoy joins
13:41:10<todb>pabs: so how do I link archivebot job 10p5imv06zrzlrxm7wcneyhw0 to something searchable on https://archive.fart.website/archivebot/viewer/ ? The JobID syntax is different there, so I'm not sure how to discover it there.
13:46:46ThetaDev joins
14:02:04<pabs>todb: the job isn't completed yet so it won't show up, but when it is, you can truncate the jobid and look it up on the viewer
14:17:45<todb>pabs: ah gotcha. So how do you do things like notice those marc.info errors? I can see the job scrolling on http://archivebot.com/ but it's not obvious to me how you see those errors. If you don't mind me asking.
14:21:50Arcorann quits [Ping timeout: 265 seconds]
14:23:41sonick (sonick) joins
14:26:28<pabs>todb: I saw it in the scrolling, I guess it got past marc.info by now
14:26:55<pabs>and you can get the full log from the meta warc after it is done
14:29:35<todb>ah ok. So if you happen to see it on the scroll, you're good, otherwise you have to wait until the end
14:37:19<pabs>yep
14:44:36hitgrr8 joins
14:53:37michaelblob_ (michaelblob) joins
14:54:53michaelblob quits [Ping timeout: 252 seconds]
15:49:11Island joins
15:56:18Icyelut quits [Read error: Connection reset by peer]
15:56:37Icyelut (Icyelut) joins
16:01:04decky_e (decky_e) joins
16:10:46<h2ibot>Entartet edited List of websites excluded from the Wayback Machine (+19, Added krot.org.): https://wiki.archiveteam.org/?diff=49802&oldid=49727
17:00:20killsushi joins
17:21:26nicolas17 joins
17:25:38spirit joins
17:35:30decky_e quits [Read error: Connection reset by peer]
17:36:01decky_e (decky_e) joins
18:23:57<lennier1>JAA: My abandoned verified users Twitter scrape is complete. Format current_username, old_username, user_id. https://transfer.archivete.am/rXjVK/twitter_legacy_verified_abandoned_users_cutoff_2022-06-01.txt
18:24:08<lennier1>The list must have had accounts verified earlier at the end, because the rate went up as it got further down the list. I found 33938 users using the tweet cutoff date 2022-06-01. I saved the .json output as well, so could get a different cutoff date in the future. A year is pretty arbitrary. The Twitter TOS recommends logging in every 30 days. Who knows what they'll actually do.
18:24:16<lennier1>Additionally, rocketdive requested 10 users that didn't show up in my list because they were unverified or had tweeted somewhat later than the cutoff date. Some are more prominent than others. https://transfer.archivete.am/8TVfu/twitter_unverified_abandoned_users_from_rocketdive.txt
18:27:40<icedice>JAA: Were Imgur links on pokemon-trainer.com grabbed? It's a tiny forum, so they might have been. If not, that WARC would be nice to scrape after Bulbagarden Forums and Serebii Forums if there's time
18:32:15<spirit>https://github.com/ArchiveTeam/ArchiveBot/blob/master/INSTALL.pipeline says to mail yipdw, could someone privately msg me their mail address? i dont see them in the irc channels anymore
18:33:05<pokechu22>Pipelines from other people generally aren't accepted nowadays because of complications with setup and stuff
18:33:15<pokechu22>JAA can probably explain better
18:34:24<spirit>ah
18:34:42<spirit>i should have asked before ordering and trying to setup a server...
18:34:50<spirit>seems broken with python 3.9 anyways
18:37:14<xkey>have you seen this https://twitter.com/elonmusk/status/1659255118196355073
18:37:27<xkey>textfiles quite angry obv https://twitter.com/textfiles/status/1659259005888282624
18:40:23<pokechu22>Yeah, unfortunately there's a lot of legacy stuff with it (and very outdated/missing documentation) :|
18:56:49<pokechu22>spirit: I don't think you actually ever mentioned what site you were interested in having archived via archivebot
18:57:34<spirit>it's https://www.artdoxa.com/ , one of the operators contacted me
18:57:56<spirit>i should have good ignore rules ready but it will be more than 500G for sure, maybe 1TB
18:58:18<spirit>should be something like: --no-offsite-links --large --explain "Requested by operators before shutdown, contact via spirit@irc"
18:58:21<spirit>!ig TODO \?(random|return_to)=
18:58:25<spirit>!ig TODO \/(search|recent_activities)\?
18:58:57<pokechu22>Oh yeah, JAA said this after you went offline last time: 20:46 <@JAA> spirit, pokechu22: Total size rarely matters; number of URLs and size of the largest files does.
18:59:07<spirit>ah nice
18:59:14Megame (Megame) joins
18:59:18<spirit>let me calculate some estimate for the URLs
18:59:34<spirit>the biggest file should be below 100MB unless someone managed to upload a huuuuge image :)
19:00:34<pokechu22>Yeah, that should definitely be safe - things pause if the overall disk space gets below 5GB which is generally safe unless there's a bunch of things downloading somewhat big files at once or a single thing tries to download a multi-gigabyte file
19:03:16<h2ibot>Pokechu22 edited Deathwatch (+52, /* 2023 */ ARTDOXA): https://wiki.archiveteam.org/?diff=49803&oldid=49795
19:03:29<pokechu22>When not signed in https://www.artdoxa.com/Thomas_Sowa/large?page=1#203664 gives e.g. https://www.artdoxa.com/session/new?return_to=%2Fusers%2F10972%2Fartworks%2F203664 for "add tags", probably we want to ignore /session/new in general
19:03:49<pokechu22>and /users/new, too, it seems
19:04:31<spirit>my first ignore regex matches those
19:04:51<pokechu22>ah, via return_to, that works too
19:05:24<pokechu22>I also see that the pages embed https://artdoxa-images.s3.amazonaws.com/uploads/artwork/image/203683/normal_bild_1117.jpg but link to https://artdoxa-images.s3.amazonaws.com/uploads/artwork/image/203683/watermark_bild_1117.jpg -- probably we don't want --no-offsite-links in that case if we want to save the full sized ones (which seems reasonable)
19:05:38<spirit>oh it needs https://artdoxa-images.s3.amazonaws.com
19:05:42<spirit>haha, same thought
19:05:45<spirit>there are ~150,000 artworks, i think there are 4 image sizes so ~600,000, surely <1,000,000 for images
19:06:42<spirit>there are ~6,000 users, who have a) avatar images in several sizes, that's maybe 25000 URLs and b) favorite pages which might be 5 pages on average maybe?, that's maybe 30000 URLs. So for that another maybe 100,000 URLs
19:07:21<spirit>some artists have catalog pages but that should also be a lowish number, surely no 100,000 URLs
19:07:37<spirit>i would be surprised if it is more than 2 million URLs in total
19:08:17<spirit>forgot artwork and contact pages for artists, maybe another 100,000 in total
19:09:02<spirit>maybe twice that for thumbnail and large image previews in pagination
19:10:16<pokechu22>Yeah, that should be fine overall, we've definitely done much bigger. And the images being hosted on amazon should mean that things will be fairly stable
19:12:13<pokechu22>Where do you get ~150,000 artworks? The most recent stuff on the main page gives https://www.artdoxa.com/users/matl1/artworks/203685, https://www.artdoxa.com/users/matl1/artworks/203684 which would make me think there's 203685 (but the one after that is https://www.artdoxa.com/users/johann_i/artworks/174344 so maybe it's not as incremental as it appears?)
19:13:28<spirit>40 image per page on https://artdoxa.com/artworks?sort=most_recent with 3268 pages
19:13:58<spirit>phew the site is slow atm
19:14:53<pokechu22>Interesting, https://artdoxa.com/hanskostercom/large?page=1#203643 is a YouTube video instead
19:15:48<spirit>yeah i found some but i'd say they are not too important
19:16:18<spirit>if it's easy to include them, nice, but if not, no issue
19:17:04<pokechu22>Looking at https://artdoxa.com/pages/imprint it seems like https://www.artdoxa.com/ is probably preferable over https://artdoxa.com/ (both serve the same content and use relative links, but the imprint links to the www version)
19:21:37<spirit>yeah that would be the canonical one
19:28:16Somebody2 quits [Ping timeout: 265 seconds]
19:35:02cascode quits [Ping timeout: 252 seconds]
19:35:51cascode joins
20:08:18cascode quits [Read error: Connection reset by peer]
20:08:22<@JAA>spirit: Re AB pipelines: https://wiki.archiveteam.org/index.php/ArchiveBot#Volunteer_to_run_a_Pipeline And yes, only up to Py 3.6 supported. But thanks for pointing out INSTALL.pipeline; that doesn't belong there.
20:08:40cascode joins
20:08:57<@JAA>I've been trying to separate 'ArchiveBot the software' from 'ArchiveBot the AT instance of the software'.
20:17:11Mateon1 quits [Ping timeout: 252 seconds]
20:18:17Island_ joins
20:18:24sonick quits [Client Quit]
20:18:50Island quits [Ping timeout: 252 seconds]
20:19:03Mateon1 joins
20:32:14<spirit>JAA: cheers! i'd be happy to donate a server for a month to get this site archived but paying for longer would not be a good plan for me. should i or someone else just throw this job into the existing archivebot "pool" with --large?
20:35:23<pokechu22>I can probably just throw it in. --large isn't actually a thing anymore, but I can manually pick a pipeline that works for it. I'll probably do it in 30 minutes-an hour, eating right now
20:35:43<@JAA>spirit: Yeah, doesn't make sense to add a server for a month. --large should simply be purged; it hasn't been used in years.
20:35:57<@JAA>AB is down right now though due to an issue with the control node.
20:41:11<spirit>=)
20:41:23<spirit>pokechu22: thanks!
20:41:49<spirit>no rush, it will be up for a month at least
21:07:12decky_e quits [Read error: Connection reset by peer]
21:07:40decky_e (decky_e) joins
21:19:37hitgrr8 quits [Client Quit]
21:28:20cascode quits [Read error: Connection reset by peer]
21:35:41Somebody2 joins
21:54:14nicolas17 quits [Ping timeout: 265 seconds]
21:56:07<Barto>https://twitter.com/elonmusk/status/1659255118196355073 lol btw
21:56:48<Barto>complaining about deleting the past when he's doing all his power to screw snscrape and delete twitter account :p
22:21:29Megame quits [Ping timeout: 252 seconds]
22:23:39MrTumnus joins
22:30:27MrTumnus quits [Client Quit]
22:30:53MrTumnus joins
22:54:08killsushi quits [Ping timeout: 252 seconds]
22:58:32decky_e quits [Ping timeout: 252 seconds]
22:59:10decky_e (decky_e) joins
23:02:52MrTumnus quits [Ping timeout: 265 seconds]
23:05:48MrTumnus joins
23:07:59MrTumnus quits [Client Quit]
23:26:56Dango360 quits [Ping timeout: 252 seconds]
23:31:31Dango360 (Dango360) joins
23:34:17<andrew>What the fuck is going on with Elon/the alt-right/Internet Archive now
23:34:26<andrew>I can't even
23:52:10Arcorann (Arcorann) joins
23:54:30nicolas17 joins