00:18:07lennier2_ joins
00:20:49lennier2 quits [Ping timeout: 260 seconds]
00:42:38beastbg8__ joins
00:43:17qwertyasdfuiopghjkl2 quits [Quit: Leaving.]
00:45:21sec^nd quits [Remote host closed the connection]
00:45:44sec^nd (second) joins
00:45:52<h2ibot>BlankEclair edited Anubis/uncategorized (+27, +https://bugs.winehq.org/): https://wiki.archiveteam.org/?diff=56276&oldid=56207
00:45:54beastbg8_ quits [Ping timeout: 260 seconds]
00:48:53dabs quits [Read error: Connection reset by peer]
00:56:52Mateon1 quits [Remote host closed the connection]
00:57:48Mateon1 joins
01:03:56lennier2_ quits [Ping timeout: 276 seconds]
01:03:59glassy quits [Ping timeout: 260 seconds]
01:04:38lennier2_ joins
01:05:00cuphead2527480 (Cuphead2527480) joins
01:08:40glassy joins
01:16:48etnguyen03 quits [Client Quit]
01:17:08etnguyen03 (etnguyen03) joins
01:19:27<pokechu22>re weebly, https://www.weebly.com/app/help/us/en/topics/weebly-website-builder-support-update now says they no longer have plans to discontinue it (https://web.archive.org/web/20250130035126/https://www.weebly.com/app/help/us/en/topics/weebly-website-builder-support-update before said they would maintain it "through at least July 2025"). So probably not at high risk anymore.
01:23:59<pabs>I'm running an enumeration of glitch.com projects, and soon users. will AB the API call URLs for this. there are over 1.5mil projects so far, so probably we need a DPoS given the 8 day deadline. /cc JAA arkiver
01:27:44etnguyen03 quits [Client Quit]
01:28:04etnguyen03 (etnguyen03) joins
01:29:11<pabs>all of the glitch.com URLs are JSy, and probably many of the $project.glitch.me URLs are JSy though
01:30:44<pabs>the former we might be able to get a defined set of behaviours/files from, but the latter will not be possible to archive properly, since they are individual user projects with individual JS behaviour
01:35:04<h2ibot>PaulWise edited Glitch (-5, fix url): https://wiki.archiveteam.org/?diff=56277&oldid=55836
01:37:04<h2ibot>PaulWise edited Glitch (+264, add discovery): https://wiki.archiveteam.org/?diff=56278&oldid=56277
01:37:05<h2ibot>PaulWise edited Glitch (+11, formatting): https://wiki.archiveteam.org/?diff=56279&oldid=56278
01:37:50etnguyen03 quits [Client Quit]
01:38:05<h2ibot>PaulWise edited Glitch (+125, add SWH archive effort): https://wiki.archiveteam.org/?diff=56280&oldid=56279
01:40:26<pabs>1mil users to 2021-01-26, 2.3mil projects to 2020-05-26
01:48:52etnguyen03 (etnguyen03) joins
01:49:06<h2ibot>PaulWise edited Glitch (+207, add Python scripts): https://wiki.archiveteam.org/?diff=56281&oldid=56280
01:49:45<pabs>anyone here got a gigabit conn and more disk? or an EC2 server close to api.glitch.com? you might be able to run the enumeration scripts ^ faster than me
01:54:07<h2ibot>PaulWise edited Glitch (+15, thumbnail properties use CDN too): https://wiki.archiveteam.org/?diff=56282&oldid=56281
01:55:51<TheTechRobo>How do I test if my connection is faster than yours?
01:56:08<h2ibot>PaulWise edited Glitch (+8, reference JSONL standard): https://wiki.archiveteam.org/?diff=56283&oldid=56282
02:01:57<TheTechRobo>(I'm at 2020-01 on users and 2018-07 for projects.)
02:02:08<h2ibot>PaulWise edited Glitch (+225, add note about assets deletion): https://wiki.archiveteam.org/?diff=56284&oldid=56283
02:02:09<h2ibot>PaulWise edited Glitch (-6, typo): https://wiki.archiveteam.org/?diff=56285&oldid=56284
02:02:16<pabs>I'm on 50Mbit
02:02:27<TheTechRobo>ah, I thought there was specific peering or something you were looking for
02:02:59<nicolas17>pabs: are you running it from home?
02:03:04<pabs>the server is on EC2 I noticed, so if you have one in the same location they do, it might be faster
02:03:05<pabs>yeah
02:03:13<nicolas17>aren't you in australia and hence with awful latency too? :P
02:03:20<pabs>yeah :)
02:03:33<nicolas17>TheTechRobo: it wouldn't take much to beat pabs's speeds
02:03:51<TheTechRobo>(Now on 2020-06 on users and 2018-11 for projects.)
02:04:11<pabs>2023-04-30 users, 2020-12-02 projects
02:04:18<pabs>:P
02:04:22<pabs>head start++
02:04:22<eggdrop>[karma] 'head start' now has 1 karma!
02:04:32<TheTechRobo>hah, we'll see who wins :P
02:06:15<pabs>arkiver: btw the Glitch DPoS looks like it will need `git clone` in it, to figure out which repos are accessible, clone/bundle them, and be able to get the URLs to the CDN assets from the clonable repos
02:06:52<nicolas17>pain
02:07:02<nicolas17>is it http(s) git at least?
02:07:14<TheTechRobo>think this might be server limited, I'm by no means saturating my connection
02:07:21<pabs>yep https://api.glitch.com/git/PROJECT-SLUG
02:09:17<pabs>hmm, my disk is getting closer to full...
02:09:20lemuria quits [Read error: Connection reset by peer]
02:10:34lemuria (lemuria) joins
02:13:25<nicolas17>do you have an example git repo link?
02:14:44<pabs>https://api.glitch.com/git/the-cube-puzzle
02:15:22<pabs>unfortunately they don't populate gitRepoUrl properly, so we have to DPoS the clone URLs :/
02:17:52<nicolas17>pabs: you can use this to know if a repo exists / is accessible https://api.glitch.com/git/the-cube-puzzle/info/refs?service=git-upload-pack
02:18:15<nicolas17>it seems somewhat slow tho
02:21:17Lunarian1 quits [Ping timeout: 276 seconds]
02:24:22<nicolas17>ok uh
02:24:27<nicolas17>how do I list an azure blob storage bucket?
02:25:05<nicolas17>I can't figure out how to configure it in rclone
02:25:06<nicolas17>rclone asks for account name and auth key, it's a public bucket so I only have the bucket name
02:27:59etnguyen03 quits [Remote host closed the connection]
02:30:48<nicolas17>ok got it
02:31:14<nicolas17>newer rclone and I can use 'rclone ls :azureblob,account=ACCOUNT:CONTAINER' without even having to configure a remote
02:31:24<TheTechRobo>ERROR: API error code: Expecting value: line 1 column 1 (char 0)
02:31:49<TheTechRobo>Both of them crashed with that ^
02:31:49<pabs>hmm, users enumeration crashed here too
02:31:55BitsNBytesNBagels (BitsNBytesNBagels) joins
02:32:02<pabs>projects still going
02:32:13<TheTechRobo>Any way to resume?
02:32:20<pabs>you can resume them by putting in the last URL to the path init
02:32:52<TheTechRobo>The last from api.glitch.com-v1-users.txt ?
02:33:36<TheTechRobo>so path = '/v1/users/?limit=1000&orderKey=createdAt&orderDirection=ASC&lastOrderValue=2023-05-30T16%3A41%3A14.159Z' in my case?
02:33:52<pabs>yeah, will need to set i too
02:34:12<pabs>probably should have used lastOrderValue instead of i...
02:34:41<TheTechRobo>what should i be? one after the largest filename?
02:35:18<TheTechRobo>my largest file is users.3669.json, so set i to 3670?
02:36:29<pabs>yeah
02:36:41<TheTechRobo>ok continuing I think
02:37:09<TheTechRobo>currently on 2023-06-04
02:45:47<pabs>updated scripts to drop the i thing:
02:45:49<pabs>https://transfer.archivete.am/YF1xj/api.glitch.com-v1-users-enumerator.py
02:45:49<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/YF1xj/api.glitch.com-v1-users-enumerator.py
02:45:53<pabs>https://transfer.archivete.am/ApUdg/api.glitch.com-v1-projects-enumerator.py
02:47:31<TheTechRobo>I'll start running that if it crashes again
02:48:15<h2ibot>PaulWise edited Glitch (+0, better scripts): https://wiki.archiveteam.org/?diff=56286&oldid=56285
02:49:15<h2ibot>PaulWise edited Glitch (+6, fix formatting): https://wiki.archiveteam.org/?diff=56287&oldid=56286
02:50:02<pabs>I'm going to bow out of the race, don't have the disk for all this JSON
02:51:26<TheTechRobo>for projects too?
02:51:40<pabs>yeah
02:52:02<pabs>after you are done, the two .txt files need an AB !ao <
02:52:56<TheTechRobo>are the json files important?
02:53:28<pabs>yes, they contain all the data about projects/users, including the slugs, which are needed for getting all the URLs for projects/users
02:53:43<TheTechRobo>Ack
03:03:18<h2ibot>PaulWise edited Glitch (+335, add JS note): https://wiki.archiveteam.org/?diff=56288&oldid=56287
03:03:19<h2ibot>PaulWise edited Glitch (-1, typo): https://wiki.archiveteam.org/?diff=56289&oldid=56288
03:11:08gatagoto (gatagoto) joins
03:14:44cuphead2527480 quits [Client Quit]
03:19:20<h2ibot>PaulWise edited Glitch (+88, git repos files from project.glitch.me URLs): https://wiki.archiveteam.org/?diff=56290&oldid=56289
03:21:20<h2ibot>PaulWise edited Glitch (+37, no .glitch-assets on project hosting): https://wiki.archiveteam.org/?diff=56291&oldid=56290
03:34:12nine quits [Quit: See ya!]
03:34:24nine joins
03:34:24nine quits [Changing host]
03:34:24nine (nine) joins
03:44:24<h2ibot>PaulWise edited Glitch (+137, CDN availability): https://wiki.archiveteam.org/?diff=56292&oldid=56291
03:47:31<nicolas17>I found an open bucket with 8.85TiB of data, of which 965GiB is duplicated files, aaaaaa
03:47:50<nicolas17>we need a sane way to dedup
03:48:24<h2ibot>PaulWise edited Glitch (+93, custom domains): https://wiki.archiveteam.org/?diff=56293&oldid=56292
03:49:48<steering>psh, only 10% duplicate, its fine
03:53:01chunkynutz6 quits [Quit: The Lounge - https://thelounge.chat]
03:56:13<nicolas17>well, if we had a sane way to dedup I would also archive the two subdomains the bucket is accessible through :P
03:57:29<pabs>ISTR DPoS usually does dedup?
03:58:06<DigitalDragons>wget-at does dedup
03:58:38<nicolas17>pabs: DPoS does dedup if the duplicate URLs are retrieved as part of the same item
03:59:08<pabs>oh, complicates it a bit...
03:59:31<nicolas17>I know which URLs are duplicates, thanks to the file listing having hashes
04:00:00<pabs>if only there were a content-addressed protocol people could use instead of HTTP...
04:00:03<nicolas17>but I think that would need stuffing 10 URLs (or at least relative paths) into the item name
04:00:16<pabs>BitTorrent++
04:00:17<eggdrop>[karma] 'BitTorrent' now has 2 karma!
04:01:18<TheTechRobo>Could do it with warcprox manually, I guess.
04:02:05<TheTechRobo>(warcprox does dedupe.)
04:03:29LunarianBunny1147 (LunarianBunny1147) joins
04:03:33<nicolas17>well, "manually" I could use wget-at locally too
04:04:00<nicolas17>but I don't have that much disk space :p
04:05:50<nicolas17>https://paste.debian.net/plain/1383292 example of a file duplicated 10x
04:06:41<nicolas17>that one is 2MB, others are 600MB
04:28:27qwertyasdfuiopghjkl2 (qwertyasdfuiopghjkl2) joins
04:29:38chunkynutz6 joins
05:10:36<TheTechRobo>I started getting 503s so I modified the scripts as such: https://transfer.archivete.am/inline/BRe4l/api.glitch.com-v1-users-enumerator.py https://transfer.archivete.am/inline/2wBh5/api.glitch.com-v1-projects-enumerator.py
05:11:28<TheTechRobo>Adding a small delay to users (projects is slow enough as is :/) and handling 503. The "continue" there should be fine to retry the current URL since `path` isn't updated until after.
05:12:11<TheTechRobo>Also noticed that data['lastOrderValue'] is not the same as the lastOrderValue in the URL. So the filename doesn't seem to match the URL. (No real harm (as long as it's always unique), just confusing).
05:13:42<TheTechRobo>Haven't run into any 503s in awhile so I guess it was temporary, but hopefully it won't crash overnight.
05:18:33<TheTechRobo>(Also changed the open(output, ...) to open with 'x' instead of 'w' so I can make sure of that.)
05:47:09gatagoto quits [Client Quit]
06:34:02BitsNBytesNBagels quits [Quit: BitsNBytesNBagels]
06:46:50<h2ibot>PaulWise edited Glitch (-3, TheTechRobo updates): https://wiki.archiveteam.org/?diff=56294&oldid=56293