00:18:07 | | lennier2_ joins |
00:20:49 | | lennier2 quits [Ping timeout: 260 seconds] |
00:42:38 | | beastbg8__ joins |
00:43:17 | | qwertyasdfuiopghjkl2 quits [Quit: Leaving.] |
00:45:21 | | sec^nd quits [Remote host closed the connection] |
00:45:44 | | sec^nd (second) joins |
00:45:52 | <h2ibot> | BlankEclair edited Anubis/uncategorized (+27, +https://bugs.winehq.org/): https://wiki.archiveteam.org/?diff=56276&oldid=56207 |
00:45:54 | | beastbg8_ quits [Ping timeout: 260 seconds] |
00:48:53 | | dabs quits [Read error: Connection reset by peer] |
00:56:52 | | Mateon1 quits [Remote host closed the connection] |
00:57:48 | | Mateon1 joins |
01:03:56 | | lennier2_ quits [Ping timeout: 276 seconds] |
01:03:59 | | glassy quits [Ping timeout: 260 seconds] |
01:04:38 | | lennier2_ joins |
01:05:00 | | cuphead2527480 (Cuphead2527480) joins |
01:08:40 | | glassy joins |
01:16:48 | | etnguyen03 quits [Client Quit] |
01:17:08 | | etnguyen03 (etnguyen03) joins |
01:19:27 | <pokechu22> | re weebly, https://www.weebly.com/app/help/us/en/topics/weebly-website-builder-support-update now says they no longer have plans to discontinue it (https://web.archive.org/web/20250130035126/https://www.weebly.com/app/help/us/en/topics/weebly-website-builder-support-update before said they would maintain it "through at least July 2025"). So probably not at high risk anymore. |
01:23:59 | <pabs> | I'm running an enumeration of glitch.com projects, and soon users. will AB the API call URLs for this. there are over 1.5mil projects so far, so probably we need a DPoS given the 8 day deadline. /cc JAA arkiver |
01:27:44 | | etnguyen03 quits [Client Quit] |
01:28:04 | | etnguyen03 (etnguyen03) joins |
01:29:11 | <pabs> | all of the glitch.com URLs are JSy, and probably many of the $project.glitch.me URLs are JSy though |
01:30:44 | <pabs> | the former we might be able to get a defined set of behaviours/files from, but the latter will not be possible to archive properly, since they are individual user projects with individual JS behaviour |
01:35:04 | <h2ibot> | PaulWise edited Glitch (-5, fix url): https://wiki.archiveteam.org/?diff=56277&oldid=55836 |
01:37:04 | <h2ibot> | PaulWise edited Glitch (+264, add discovery): https://wiki.archiveteam.org/?diff=56278&oldid=56277 |
01:37:05 | <h2ibot> | PaulWise edited Glitch (+11, formatting): https://wiki.archiveteam.org/?diff=56279&oldid=56278 |
01:37:50 | | etnguyen03 quits [Client Quit] |
01:38:05 | <h2ibot> | PaulWise edited Glitch (+125, add SWH archive effort): https://wiki.archiveteam.org/?diff=56280&oldid=56279 |
01:40:26 | <pabs> | 1mil users to 2021-01-26, 2.3mil projects to 2020-05-26 |
01:48:52 | | etnguyen03 (etnguyen03) joins |
01:49:06 | <h2ibot> | PaulWise edited Glitch (+207, add Python scripts): https://wiki.archiveteam.org/?diff=56281&oldid=56280 |
01:49:45 | <pabs> | anyone here got a gigabit conn and more disk? or an EC2 server close to api.glitch.com? you might be able to run the enumeration scripts ^ faster than me |
01:54:07 | <h2ibot> | PaulWise edited Glitch (+15, thumbnail properties use CDN too): https://wiki.archiveteam.org/?diff=56282&oldid=56281 |
01:55:51 | <TheTechRobo> | How do I test if my connection is faster than yours? |
01:56:08 | <h2ibot> | PaulWise edited Glitch (+8, reference JSONL standard): https://wiki.archiveteam.org/?diff=56283&oldid=56282 |
02:01:57 | <TheTechRobo> | (I'm at 2020-01 on users and 2018-07 for projects.) |
02:02:08 | <h2ibot> | PaulWise edited Glitch (+225, add note about assets deletion): https://wiki.archiveteam.org/?diff=56284&oldid=56283 |
02:02:09 | <h2ibot> | PaulWise edited Glitch (-6, typo): https://wiki.archiveteam.org/?diff=56285&oldid=56284 |
02:02:16 | <pabs> | I'm on 50Mbit |
02:02:27 | <TheTechRobo> | ah, I thought there was specific peering or something you were looking for |
02:02:59 | <nicolas17> | pabs: are you running it from home? |
02:03:04 | <pabs> | the server is on EC2 I noticed, so if you have one in the same location they do, it might be faster |
02:03:05 | <pabs> | yeah |
02:03:13 | <nicolas17> | aren't you in australia and hence with awful latency too? :P |
02:03:20 | <pabs> | yeah :) |
02:03:33 | <nicolas17> | TheTechRobo: it wouldn't take much to beat pabs's speeds |
02:03:51 | <TheTechRobo> | (Now on 2020-06 on users and 2018-11 for projects.) |
02:04:11 | <pabs> | 2023-04-30 users, 2020-12-02 projects |
02:04:18 | <pabs> | :P |
02:04:22 | <pabs> | head start++ |
02:04:22 | <eggdrop> | [karma] 'head start' now has 1 karma! |
02:04:32 | <TheTechRobo> | hah, we'll see who wins :P |
02:06:15 | <pabs> | arkiver: btw the Glitch DPoS looks like it will need `git clone` in it, to figure out which repos are accessible, clone/bundle them, and be able to get the URLs to the CDN assets from the clonable repos |
02:06:52 | <nicolas17> | pain |
02:07:02 | <nicolas17> | is it http(s) git at least? |
02:07:14 | <TheTechRobo> | think this might be server limited, I'm by no means saturating my connection |
02:07:21 | <pabs> | yep https://api.glitch.com/git/PROJECT-SLUG |
02:09:17 | <pabs> | hmm, my disk is getting closer to full... |
02:09:20 | | lemuria quits [Read error: Connection reset by peer] |
02:10:34 | | lemuria (lemuria) joins |
02:13:25 | <nicolas17> | do you have an example git repo link? |
02:14:44 | <pabs> | https://api.glitch.com/git/the-cube-puzzle |
02:15:22 | <pabs> | unfortunately they don't populate gitRepoUrl properly, so we have to DPoS the clone URLs :/ |
02:17:52 | <nicolas17> | pabs: you can use this to know if a repo exists / is accessible https://api.glitch.com/git/the-cube-puzzle/info/refs?service=git-upload-pack |
02:18:15 | <nicolas17> | it seems somewhat slow tho |
02:21:17 | | Lunarian1 quits [Ping timeout: 276 seconds] |
02:24:22 | <nicolas17> | ok uh |
02:24:27 | <nicolas17> | how do I list an azure blob storage bucket? |
02:25:05 | <nicolas17> | I can't figure out how to configure it in rclone |
02:25:06 | <nicolas17> | rclone asks for account name and auth key, it's a public bucket so I only have the bucket name |
02:27:59 | | etnguyen03 quits [Remote host closed the connection] |
02:30:48 | <nicolas17> | ok got it |
02:31:14 | <nicolas17> | newer rclone and I can use 'rclone ls :azureblob,account=ACCOUNT:CONTAINER' without even having to configure a remote |
02:31:24 | <TheTechRobo> | ERROR: API error code: Expecting value: line 1 column 1 (char 0) |
02:31:49 | <TheTechRobo> | Both of them crashed with that ^ |
02:31:49 | <pabs> | hmm, users enumeration crashed here too |
02:31:55 | | BitsNBytesNBagels (BitsNBytesNBagels) joins |
02:32:02 | <pabs> | projects still going |
02:32:13 | <TheTechRobo> | Any way to resume? |
02:32:20 | <pabs> | you can resume them by putting in the last URL to the path init |
02:32:52 | <TheTechRobo> | The last from api.glitch.com-v1-users.txt ? |
02:33:36 | <TheTechRobo> | so path = '/v1/users/?limit=1000&orderKey=createdAt&orderDirection=ASC&lastOrderValue=2023-05-30T16%3A41%3A14.159Z' in my case? |
02:33:52 | <pabs> | yeah, will need to set i too |
02:34:12 | <pabs> | probably should have used lastOrderValue instead of i... |
02:34:41 | <TheTechRobo> | what should i be? one after the largest filename? |
02:35:18 | <TheTechRobo> | my largest file is users.3669.json, so set i to 3670? |
02:36:29 | <pabs> | yeah |
02:36:41 | <TheTechRobo> | ok continuing I think |
02:37:09 | <TheTechRobo> | currently on 2023-06-04 |
02:45:47 | <pabs> | updated scripts to drop the i thing: |
02:45:49 | <pabs> | https://transfer.archivete.am/YF1xj/api.glitch.com-v1-users-enumerator.py |
02:45:49 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/YF1xj/api.glitch.com-v1-users-enumerator.py |
02:45:53 | <pabs> | https://transfer.archivete.am/ApUdg/api.glitch.com-v1-projects-enumerator.py |
02:47:31 | <TheTechRobo> | I'll start running that if it crashes again |
02:48:15 | <h2ibot> | PaulWise edited Glitch (+0, better scripts): https://wiki.archiveteam.org/?diff=56286&oldid=56285 |
02:49:15 | <h2ibot> | PaulWise edited Glitch (+6, fix formatting): https://wiki.archiveteam.org/?diff=56287&oldid=56286 |
02:50:02 | <pabs> | I'm going to bow out of the race, don't have the disk for all this JSON |
02:51:26 | <TheTechRobo> | for projects too? |
02:51:40 | <pabs> | yeah |
02:52:02 | <pabs> | after you are done, the two .txt files need an AB !ao < |
02:52:56 | <TheTechRobo> | are the json files important? |
02:53:28 | <pabs> | yes, they contain all the data about projects/users, including the slugs, which are needed for getting all the URLs for projects/users |
02:53:43 | <TheTechRobo> | Ack |
03:03:18 | <h2ibot> | PaulWise edited Glitch (+335, add JS note): https://wiki.archiveteam.org/?diff=56288&oldid=56287 |
03:03:19 | <h2ibot> | PaulWise edited Glitch (-1, typo): https://wiki.archiveteam.org/?diff=56289&oldid=56288 |
03:11:08 | | gatagoto (gatagoto) joins |
03:14:44 | | cuphead2527480 quits [Client Quit] |
03:19:20 | <h2ibot> | PaulWise edited Glitch (+88, git repos files from project.glitch.me URLs): https://wiki.archiveteam.org/?diff=56290&oldid=56289 |
03:21:20 | <h2ibot> | PaulWise edited Glitch (+37, no .glitch-assets on project hosting): https://wiki.archiveteam.org/?diff=56291&oldid=56290 |
03:34:12 | | nine quits [Quit: See ya!] |
03:34:24 | | nine joins |
03:34:24 | | nine is now authenticated as nine |
03:34:24 | | nine quits [Changing host] |
03:34:24 | | nine (nine) joins |
03:44:24 | <h2ibot> | PaulWise edited Glitch (+137, CDN availability): https://wiki.archiveteam.org/?diff=56292&oldid=56291 |
03:47:31 | <nicolas17> | I found an open bucket with 8.85TiB of data, of which 965GiB is duplicated files, aaaaaa |
03:47:50 | <nicolas17> | we need a sane way to dedup |
03:48:24 | <h2ibot> | PaulWise edited Glitch (+93, custom domains): https://wiki.archiveteam.org/?diff=56293&oldid=56292 |
03:49:48 | <steering> | psh, only 10% duplicate, its fine |
03:53:01 | | chunkynutz6 quits [Quit: The Lounge - https://thelounge.chat] |
03:56:13 | <nicolas17> | well, if we had a sane way to dedup I would also archive the two subdomains the bucket is accessible through :P |
03:57:29 | <pabs> | ISTR DPoS usually does dedup? |
03:58:06 | <DigitalDragons> | wget-at does dedup |
03:58:38 | <nicolas17> | pabs: DPoS does dedup if the duplicate URLs are retrieved as part of the same item |
03:59:08 | <pabs> | oh, complicates it a bit... |
03:59:31 | <nicolas17> | I know which URLs are duplicates, thanks to the file listing having hashes |
04:00:00 | <pabs> | if only there were a content-addressed protocol people could use instead of HTTP... |
04:00:03 | <nicolas17> | but I think that would need stuffing 10 URLs (or at least relative paths) into the item name |
04:00:16 | <pabs> | BitTorrent++ |
04:00:17 | <eggdrop> | [karma] 'BitTorrent' now has 2 karma! |
04:01:18 | <TheTechRobo> | Could do it with warcprox manually, I guess. |
04:02:05 | <TheTechRobo> | (warcprox does dedupe.) |
04:03:29 | | LunarianBunny1147 (LunarianBunny1147) joins |
04:03:33 | <nicolas17> | well, "manually" I could use wget-at locally too |
04:04:00 | <nicolas17> | but I don't have that much disk space :p |
04:05:50 | <nicolas17> | https://paste.debian.net/plain/1383292 example of a file duplicated 10x |
04:06:41 | <nicolas17> | that one is 2MB, others are 600MB |
04:28:27 | | qwertyasdfuiopghjkl2 (qwertyasdfuiopghjkl2) joins |
04:29:38 | | chunkynutz6 joins |
05:10:36 | <TheTechRobo> | I started getting 503s so I modified the scripts as such: https://transfer.archivete.am/inline/BRe4l/api.glitch.com-v1-users-enumerator.py https://transfer.archivete.am/inline/2wBh5/api.glitch.com-v1-projects-enumerator.py |
05:11:28 | <TheTechRobo> | Adding a small delay to users (projects is slow enough as is :/) and handling 503. The "continue" there should be fine to retry the current URL since `path` isn't updated until after. |
05:12:11 | <TheTechRobo> | Also noticed that data['lastOrderValue'] is not the same as the lastOrderValue in the URL. So the filename doesn't seem to match the URL. (No real harm (as long as it's always unique), just confusing). |
05:13:42 | <TheTechRobo> | Haven't run into any 503s in awhile so I guess it was temporary, but hopefully it won't crash overnight. |
05:18:33 | <TheTechRobo> | (Also changed the open(output, ...) to open with 'x' instead of 'w' so I can make sure of that.) |
05:47:09 | | gatagoto quits [Client Quit] |
06:34:02 | | BitsNBytesNBagels quits [Quit: BitsNBytesNBagels] |
06:46:50 | <h2ibot> | PaulWise edited Glitch (-3, TheTechRobo updates): https://wiki.archiveteam.org/?diff=56294&oldid=56293 |