| 00:10:11 | | eroc1990 (eroc1990) joins |
| 00:13:35 | <pabs> | JAA: how does one use socialbot to grab tweet threads? I looked at the code and didn't see a command for it |
| 00:14:16 | <@JAA> | pabs: socialbot doesn't support it. |
| 00:14:56 | <@JAA> | It only implements a subset of snscrape's scrapers. |
| 00:15:41 | <pabs> | ah, so how would we do twitter threads then? |
| 00:16:11 | <pabs> | these ones in particular https://transfer.archivete.am/EZdNi/cve-twitter-refs.txt |
| 00:16:29 | <pabs> | just AB? |
| 00:17:33 | <@JAA> | Not really possible I guess. AB doesn't grab tweets correctly, and only socialbot feeds into the background thing that does. |
| 00:19:04 | <pabs> | we/I/someone could add a twitter-thread < command though? |
| 00:20:32 | <pabs> | or we can just twitter-profile every user in the list and hope for the best :) |
| 00:21:47 | <pabs> | todb: cve-refs.txt AB job running now. look at archivebot.com or archivebot.com/3 to watch the progress |
| 00:22:20 | <@JAA> | I don't think a recursive tweet list command in socialbot is a good idea. Recursive scrapes can take a very long time, and lists make that worse. socialbot isn't really meant for that. |
| 00:24:25 | <pabs> | ok. I guess twitter-profile is the only option then |
| 00:31:44 | | tbc1887 (tbc1887) joins |
| 00:40:03 | | tbc1887_ joins |
| 00:41:05 | | eroc1990 quits [Client Quit] |
| 00:42:06 | | tbc1887 quits [Ping timeout: 265 seconds] |
| 00:48:24 | | eroc1990 (eroc1990) joins |
| 00:50:13 | | birdjj quits [Client Quit] |
| 00:50:19 | | Somebody2 quits [Ping timeout: 265 seconds] |
| 00:50:30 | | birdjj joins |
| 00:57:47 | | cascode quits [Ping timeout: 252 seconds] |
| 01:00:13 | | birdjj quits [Read error: Connection reset by peer] |
| 01:00:18 | | birdjj joins |
| 01:00:49 | | cascode joins |
| 01:02:49 | | birdjj3 joins |
| 01:02:57 | | birdjj quits [Read error: Connection reset by peer] |
| 01:02:57 | | birdjj3 is now known as birdjj |
| 01:04:10 | | Somebody2 joins |
| 01:18:20 | | cascode quits [Ping timeout: 252 seconds] |
| 01:19:28 | | cascode joins |
| 01:44:33 | | icedice quits [Client Quit] |
| 01:51:55 | | needlenose quits [Quit: Leaving] |
| 02:05:20 | | cascode quits [Read error: Connection reset by peer] |
| 02:05:33 | | cascode joins |
| 02:10:02 | | cascode quits [Ping timeout: 252 seconds] |
| 02:11:05 | | cascode joins |
| 02:20:49 | <andrew> | I finally got grab-site running and wow it's quite a bit faster than wget |
| 02:25:03 | | cascode quits [Ping timeout: 265 seconds] |
| 02:25:17 | | cascode joins |
| 02:31:12 | <andrew> | how many millions of pages is too many pages for grab-site? |
| 02:37:51 | <fireonlive> | https://i.postimg.cc/8cRLSG9h/image.png hopefully not 1 |
| 02:37:57 | <fireonlive> | though am only getting like 3r/s :( |
| 02:41:36 | | cascode quits [Read error: Connection reset by peer] |
| 02:41:53 | | cascode joins |
| 02:43:38 | <fireonlive> | oops forgo to lower con |
| 02:56:12 | <andrew> | hopefully it can handle 60 million pages (my rough estimate for this crawl) |
| 03:06:08 | | dumbgoy quits [Ping timeout: 265 seconds] |
| 03:11:35 | <fireonlive> | what's your r/s aobut? |
| 03:11:38 | <fireonlive> | abouts* |
| 03:12:12 | <andrew> | fireonlive: it says 4.5 r/s |
| 03:12:17 | <andrew> | at 6 concurrency |
| 03:12:19 | <fireonlive> | oh ntb |
| 03:12:28 | <andrew> | but it wildly varies because the server isn't particularly fast on some pages |
| 03:12:30 | <fireonlive> | i'm at 3.9 with 5 atm |
| 03:12:40 | <fireonlive> | ye i guess it depends on server, backend, etc |
| 03:12:46 | <fireonlive> | grab-site is very neat! |
| 03:13:15 | <fireonlive> | also uh thanks to everyone here for not minding the stuff i add in lol |
| 03:13:29 | <fireonlive> | or "waste my time with" to add |
| 03:17:30 | <fireonlive> | oh i thought this was imgone |
| 03:26:00 | | pikablu joins |
| 03:28:59 | | pikablu quits [Client Quit] |
| 04:21:18 | <todb> | pabs: thanks so much for running the AB job! |
| 04:26:03 | <pabs> | todb: once the job completes, you will be able to look at what URLs worked/didn't by finding the job on https://archive.fart.website/archivebot/viewer/ |
| 04:45:23 | | decay quits [] |
| 04:58:20 | | cascode quits [Read error: Connection reset by peer] |
| 04:59:09 | | cascode joins |
| 05:09:40 | | Jake quits [Quit: Leaving for a bit!] |
| 05:10:06 | | Jake (Jake) joins |
| 05:15:40 | | Jake quits [Client Quit] |
| 05:16:00 | | Jake (Jake) joins |
| 05:19:41 | | lexikiq quits [Client Quit] |
| 05:20:20 | | Jake quits [Client Quit] |
| 05:24:01 | | Jake (Jake) joins |
| 05:25:03 | | decky_e quits [Remote host closed the connection] |
| 05:25:20 | | decky_e joins |
| 05:27:36 | | Jake quits [Client Quit] |
| 05:28:10 | | Jake (Jake) joins |
| 06:08:44 | | MrTumnus quits [Ping timeout: 252 seconds] |
| 06:15:22 | | parfait (kdqep) joins |
| 06:23:49 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 06:29:49 | | sec^nd quits [Remote host closed the connection] |
| 06:30:19 | | sec^nd (second) joins |
| 06:30:24 | | Arachnophine quits [Remote host closed the connection] |
| 06:31:59 | | Arachnophine (Arachnophine) joins |
| 06:34:09 | | Island quits [Read error: Connection reset by peer] |
| 06:46:31 | | ave quits [Read error: Connection reset by peer] |
| 06:46:34 | | igloo222252 joins |
| 06:46:38 | | lun42 (lun4) joins |
| 06:46:52 | | ave (ave) joins |
| 06:48:20 | | lun4 quits [Ping timeout: 252 seconds] |
| 06:48:20 | | lun42 is now known as lun4 |
| 06:48:42 | | igloo22225 quits [Ping timeout: 252 seconds] |
| 06:48:42 | | igloo222252 is now known as igloo22225 |
| 06:52:09 | | nicolas17 joins |
| 06:53:55 | | nicolas17 quits [Client Quit] |
| 07:08:07 | | parfait quits [Client Quit] |
| 07:28:17 | <h2ibot> | PaulWise edited ArchiveBot/CVE (+692, add a few issues): https://wiki.archiveteam.org/?diff=49798&oldid=49796 |
| 07:28:33 | <pabs> | todb: ^ |
| 07:29:17 | <h2ibot> | PaulWise edited ArchiveBot/CVE (-9, drop http): https://wiki.archiveteam.org/?diff=49799&oldid=49798 |
| 07:30:17 | <h2ibot> | PaulWise edited ArchiveBot/CVE (+28, reference URLs project): https://wiki.archiveteam.org/?diff=49800&oldid=49799 |
| 07:31:17 | <h2ibot> | PaulWise edited ArchiveBot/CVE (+69, mention AB job): https://wiki.archiveteam.org/?diff=49801&oldid=49800 |
| 07:31:50 | | ave quits [Client Quit] |
| 07:32:09 | | ave (ave) joins |
| 07:45:57 | | Arcorann (Arcorann) joins |
| 07:48:11 | | decky_e quits [Read error: Connection reset by peer] |
| 07:48:53 | | Somebody2 quits [Ping timeout: 265 seconds] |
| 08:02:04 | | Ivan226 joins |
| 11:22:47 | | AnotherIki joins |
| 11:26:26 | | Iki1 quits [Ping timeout: 252 seconds] |
| 11:31:30 | | Somebody2 joins |
| 11:46:27 | <todb> | Ty for minding that and noting the failures, pabs ! |
| 12:01:11 | <JTL> | Am I misremembering things, or did I see a project somewhere (might've been HN or Reddit) that provided an interface to do semi automated bulk archival of YouTube channels using the usual projects? (youtube-dl/yt-dlp) |
| 12:01:13 | | M--mlv|m joins |
| 12:03:37 | <pabs> | there is #down-the-tube (things end up in archive.org) and #youtubearchive (things end up in a private archive IIRC) |
| 12:03:56 | <JTL> | Not quite what I was looking for, but thanks |
| 12:04:20 | <JTL> | well, should've asked in -ot because I was referring to a frontend type project, not AT, sorry! |
| 12:10:28 | | sonick quits [Client Quit] |
| 12:12:58 | <pabs> | ah ok |
| 12:41:40 | | icedice (icedice) joins |
| 12:45:04 | | that_lurker quits [Client Quit] |
| 12:45:48 | | that_lurker (that_lurker) joins |
| 13:19:36 | | dumbgoy joins |
| 13:41:10 | <todb> | pabs: so how do I link archivebot job 10p5imv06zrzlrxm7wcneyhw0 to something searchable on https://archive.fart.website/archivebot/viewer/ ? The JobID syntax is different there, so I'm not sure how to discover it there. |
| 13:46:46 | | ThetaDev joins |
| 14:02:04 | <pabs> | todb: the job isn't completed yet so it won't show up, but when it is, you can truncate the jobid and look it up on the viewer |
| 14:17:45 | <todb> | pabs: ah gotcha. So how do you do things like notice those marc.info errors? I can see the job scrolling on http://archivebot.com/ but it's not obvious to me how you see those errors. If you don't mind me asking. |
| 14:21:50 | | Arcorann quits [Ping timeout: 265 seconds] |
| 14:23:41 | | sonick (sonick) joins |
| 14:26:28 | <pabs> | todb: I saw it in the scrolling, I guess it got past marc.info by now |
| 14:26:55 | <pabs> | and you can get the full log from the meta warc after it is done |
| 14:29:35 | <todb> | ah ok. So if you happen to see it on the scroll, you're good, otherwise you have to wait until the end |
| 14:37:19 | <pabs> | yep |
| 14:44:36 | | hitgrr8 joins |
| 14:53:37 | | michaelblob_ (michaelblob) joins |
| 14:54:53 | | michaelblob quits [Ping timeout: 252 seconds] |
| 15:49:11 | | Island joins |
| 15:56:18 | | Icyelut quits [Read error: Connection reset by peer] |
| 15:56:37 | | Icyelut (Icyelut) joins |
| 16:01:04 | | decky_e (decky_e) joins |
| 16:10:46 | <h2ibot> | Entartet edited List of websites excluded from the Wayback Machine (+19, Added krot.org.): https://wiki.archiveteam.org/?diff=49802&oldid=49727 |
| 17:00:20 | | killsushi joins |
| 17:21:26 | | nicolas17 joins |
| 17:25:38 | | spirit joins |
| 17:35:30 | | decky_e quits [Read error: Connection reset by peer] |
| 17:36:01 | | decky_e (decky_e) joins |
| 18:23:57 | <lennier1> | JAA: My abandoned verified users Twitter scrape is complete. Format current_username, old_username, user_id. https://transfer.archivete.am/rXjVK/twitter_legacy_verified_abandoned_users_cutoff_2022-06-01.txt |
| 18:24:08 | <lennier1> | The list must have had accounts verified earlier at the end, because the rate went up as it got further down the list. I found 33938 users using the tweet cutoff date 2022-06-01. I saved the .json output as well, so could get a different cutoff date in the future. A year is pretty arbitrary. The Twitter TOS recommends logging in every 30 days. Who knows what they'll actually do. |
| 18:24:16 | <lennier1> | Additionally, rocketdive requested 10 users that didn't show up in my list because they were unverified or had tweeted somewhat later than the cutoff date. Some are more prominent than others. https://transfer.archivete.am/8TVfu/twitter_unverified_abandoned_users_from_rocketdive.txt |
| 18:27:40 | <icedice> | JAA: Were Imgur links on pokemon-trainer.com grabbed? It's a tiny forum, so they might have been. If not, that WARC would be nice to scrape after Bulbagarden Forums and Serebii Forums if there's time |
| 18:32:15 | <spirit> | https://github.com/ArchiveTeam/ArchiveBot/blob/master/INSTALL.pipeline says to mail yipdw, could someone privately msg me their mail address? i dont see them in the irc channels anymore |
| 18:33:05 | <pokechu22> | Pipelines from other people generally aren't accepted nowadays because of complications with setup and stuff |
| 18:33:15 | <pokechu22> | JAA can probably explain better |
| 18:34:24 | <spirit> | ah |
| 18:34:42 | <spirit> | i should have asked before ordering and trying to setup a server... |
| 18:34:50 | <spirit> | seems broken with python 3.9 anyways |
| 18:37:14 | <xkey> | have you seen this https://twitter.com/elonmusk/status/1659255118196355073 |
| 18:37:27 | <xkey> | textfiles quite angry obv https://twitter.com/textfiles/status/1659259005888282624 |
| 18:40:23 | <pokechu22> | Yeah, unfortunately there's a lot of legacy stuff with it (and very outdated/missing documentation) :| |
| 18:56:49 | <pokechu22> | spirit: I don't think you actually ever mentioned what site you were interested in having archived via archivebot |
| 18:57:34 | <spirit> | it's https://www.artdoxa.com/ , one of the operators contacted me |
| 18:57:56 | <spirit> | i should have good ignore rules ready but it will be more than 500G for sure, maybe 1TB |
| 18:58:18 | <spirit> | should be something like: --no-offsite-links --large --explain "Requested by operators before shutdown, contact via spirit@irc" |
| 18:58:21 | <spirit> | !ig TODO \?(random|return_to)= |
| 18:58:25 | <spirit> | !ig TODO \/(search|recent_activities)\? |
| 18:58:57 | <pokechu22> | Oh yeah, JAA said this after you went offline last time: 20:46 <@JAA> spirit, pokechu22: Total size rarely matters; number of URLs and size of the largest files does. |
| 18:59:07 | <spirit> | ah nice |
| 18:59:14 | | Megame (Megame) joins |
| 18:59:18 | <spirit> | let me calculate some estimate for the URLs |
| 18:59:34 | <spirit> | the biggest file should be below 100MB unless someone managed to upload a huuuuge image :) |
| 19:00:34 | <pokechu22> | Yeah, that should definitely be safe - things pause if the overall disk space gets below 5GB which is generally safe unless there's a bunch of things downloading somewhat big files at once or a single thing tries to download a multi-gigabyte file |
| 19:03:16 | <h2ibot> | Pokechu22 edited Deathwatch (+52, /* 2023 */ ARTDOXA): https://wiki.archiveteam.org/?diff=49803&oldid=49795 |
| 19:03:29 | <pokechu22> | When not signed in https://www.artdoxa.com/Thomas_Sowa/large?page=1#203664 gives e.g. https://www.artdoxa.com/session/new?return_to=%2Fusers%2F10972%2Fartworks%2F203664 for "add tags", probably we want to ignore /session/new in general |
| 19:03:49 | <pokechu22> | and /users/new, too, it seems |
| 19:04:31 | <spirit> | my first ignore regex matches those |
| 19:04:51 | <pokechu22> | ah, via return_to, that works too |
| 19:05:24 | <pokechu22> | I also see that the pages embed https://artdoxa-images.s3.amazonaws.com/uploads/artwork/image/203683/normal_bild_1117.jpg but link to https://artdoxa-images.s3.amazonaws.com/uploads/artwork/image/203683/watermark_bild_1117.jpg -- probably we don't want --no-offsite-links in that case if we want to save the full sized ones (which seems reasonable) |
| 19:05:38 | <spirit> | oh it needs https://artdoxa-images.s3.amazonaws.com |
| 19:05:42 | <spirit> | haha, same thought |
| 19:05:45 | <spirit> | there are ~150,000 artworks, i think there are 4 image sizes so ~600,000, surely <1,000,000 for images |
| 19:06:42 | <spirit> | there are ~6,000 users, who have a) avatar images in several sizes, that's maybe 25000 URLs and b) favorite pages which might be 5 pages on average maybe?, that's maybe 30000 URLs. So for that another maybe 100,000 URLs |
| 19:07:21 | <spirit> | some artists have catalog pages but that should also be a lowish number, surely no 100,000 URLs |
| 19:07:37 | <spirit> | i would be surprised if it is more than 2 million URLs in total |
| 19:08:17 | <spirit> | forgot artwork and contact pages for artists, maybe another 100,000 in total |
| 19:09:02 | <spirit> | maybe twice that for thumbnail and large image previews in pagination |
| 19:10:16 | <pokechu22> | Yeah, that should be fine overall, we've definitely done much bigger. And the images being hosted on amazon should mean that things will be fairly stable |
| 19:12:13 | <pokechu22> | Where do you get ~150,000 artworks? The most recent stuff on the main page gives https://www.artdoxa.com/users/matl1/artworks/203685, https://www.artdoxa.com/users/matl1/artworks/203684 which would make me think there's 203685 (but the one after that is https://www.artdoxa.com/users/johann_i/artworks/174344 so maybe it's not as incremental as it appears?) |
| 19:13:28 | <spirit> | 40 image per page on https://artdoxa.com/artworks?sort=most_recent with 3268 pages |
| 19:13:58 | <spirit> | phew the site is slow atm |
| 19:14:53 | <pokechu22> | Interesting, https://artdoxa.com/hanskostercom/large?page=1#203643 is a YouTube video instead |
| 19:15:48 | <spirit> | yeah i found some but i'd say they are not too important |
| 19:16:18 | <spirit> | if it's easy to include them, nice, but if not, no issue |
| 19:17:04 | <pokechu22> | Looking at https://artdoxa.com/pages/imprint it seems like https://www.artdoxa.com/ is probably preferable over https://artdoxa.com/ (both serve the same content and use relative links, but the imprint links to the www version) |
| 19:21:37 | <spirit> | yeah that would be the canonical one |
| 19:28:16 | | Somebody2 quits [Ping timeout: 265 seconds] |
| 19:35:02 | | cascode quits [Ping timeout: 252 seconds] |
| 19:35:51 | | cascode joins |
| 20:08:18 | | cascode quits [Read error: Connection reset by peer] |
| 20:08:22 | <@JAA> | spirit: Re AB pipelines: https://wiki.archiveteam.org/index.php/ArchiveBot#Volunteer_to_run_a_Pipeline And yes, only up to Py 3.6 supported. But thanks for pointing out INSTALL.pipeline; that doesn't belong there. |
| 20:08:40 | | cascode joins |
| 20:08:57 | <@JAA> | I've been trying to separate 'ArchiveBot the software' from 'ArchiveBot the AT instance of the software'. |
| 20:17:11 | | Mateon1 quits [Ping timeout: 252 seconds] |
| 20:18:17 | | Island_ joins |
| 20:18:24 | | sonick quits [Client Quit] |
| 20:18:50 | | Island quits [Ping timeout: 252 seconds] |
| 20:19:03 | | Mateon1 joins |
| 20:32:14 | <spirit> | JAA: cheers! i'd be happy to donate a server for a month to get this site archived but paying for longer would not be a good plan for me. should i or someone else just throw this job into the existing archivebot "pool" with --large? |
| 20:35:23 | <pokechu22> | I can probably just throw it in. --large isn't actually a thing anymore, but I can manually pick a pipeline that works for it. I'll probably do it in 30 minutes-an hour, eating right now |
| 20:35:43 | <@JAA> | spirit: Yeah, doesn't make sense to add a server for a month. --large should simply be purged; it hasn't been used in years. |
| 20:35:57 | <@JAA> | AB is down right now though due to an issue with the control node. |
| 20:41:11 | <spirit> | =) |
| 20:41:23 | <spirit> | pokechu22: thanks! |
| 20:41:49 | <spirit> | no rush, it will be up for a month at least |
| 21:07:12 | | decky_e quits [Read error: Connection reset by peer] |
| 21:07:40 | | decky_e (decky_e) joins |
| 21:19:37 | | hitgrr8 quits [Client Quit] |
| 21:28:20 | | cascode quits [Read error: Connection reset by peer] |
| 21:35:41 | | Somebody2 joins |
| 21:54:14 | | nicolas17 quits [Ping timeout: 265 seconds] |
| 21:56:07 | <Barto> | https://twitter.com/elonmusk/status/1659255118196355073 lol btw |
| 21:56:48 | <Barto> | complaining about deleting the past when he's doing all his power to screw snscrape and delete twitter account :p |
| 22:21:29 | | Megame quits [Ping timeout: 252 seconds] |
| 22:23:39 | | MrTumnus joins |
| 22:30:27 | | MrTumnus quits [Client Quit] |
| 22:30:53 | | MrTumnus joins |
| 22:54:08 | | killsushi quits [Ping timeout: 252 seconds] |
| 22:58:32 | | decky_e quits [Ping timeout: 252 seconds] |
| 22:59:10 | | decky_e (decky_e) joins |
| 23:02:52 | | MrTumnus quits [Ping timeout: 265 seconds] |
| 23:05:48 | | MrTumnus joins |
| 23:07:59 | | MrTumnus quits [Client Quit] |
| 23:26:56 | | Dango360 quits [Ping timeout: 252 seconds] |
| 23:31:31 | | Dango360 (Dango360) joins |
| 23:34:17 | <andrew> | What the fuck is going on with Elon/the alt-right/Internet Archive now |
| 23:34:26 | <andrew> | I can't even |
| 23:52:10 | | Arcorann (Arcorann) joins |
| 23:54:30 | | nicolas17 joins |