00:10:45<h2ibot>Tech234a edited YouTube/Technical details (+67, /* Playlists */): https://wiki.archiveteam.org/?diff=49228&oldid=49175
00:10:58qwertyasdfuiopghjkl quits [Remote host closed the connection]
00:15:09<tech234a>^ adds mention of 2022 music recap playlist prefix
00:44:35qwertyasdfuiopghjkl joins
01:32:59<@OrIdow6>Back
01:33:38sepro quits [Ping timeout: 268 seconds]
01:38:58sepro (sepro) joins
01:41:18<@OrIdow6>What was the channel for that big Chinese forum again?
01:44:52sepro quits [Client Quit]
01:45:43sepro (sepro) joins
01:48:04<thuban>i don't think one was established.
01:48:42Arcorann quits [Ping timeout: 276 seconds]
01:50:45sepro quits [Ping timeout: 265 seconds]
01:50:55Arcorann (Arcorann) joins
01:52:14<@OrIdow6>Oh
02:04:56Iki1 joins
02:09:24Iki quits [Ping timeout: 268 seconds]
02:26:25Arcorann quits [Read error: Connection reset by peer]
02:30:48Arcorann (Arcorann) joins
02:35:55Arcorann quits [Ping timeout: 268 seconds]
02:37:26Arcorann (Arcorann) joins
02:38:36tzt quits [Ping timeout: 265 seconds]
02:40:00tzt (tzt) joins
03:00:34Ketchup901 quits [Remote host closed the connection]
03:01:01Ketchup901 (Ketchup901) joins
03:19:02@AlsoJAA quits [Quit: leaving]
03:42:05AlsoJAA (JAA) joins
03:42:05@ChanServ sets mode: +o AlsoJAA
03:50:59AnotherIki joins
03:54:48Iki1 quits [Ping timeout: 276 seconds]
04:40:26<h2ibot>Wickedplayer494 uploaded File:Surrender at 20 logo.png: https://wiki.archiveteam.org/?title=File%3ASurrender%20at%2020%20logo.png
04:40:27<h2ibot>Wickedplayer494 uploaded File:Surrender at 20 - 12-10-22.png: https://wiki.archiveteam.org/?title=File%3ASurrender%20at%2020%20-%2012-10-22.png
04:50:27<h2ibot>Jarshua edited Twitter (+241, /* Vital Signs */ uh oh): https://wiki.archiveteam.org/?diff=49231&oldid=49161
05:14:45Jonimus quits [Ping timeout: 276 seconds]
05:22:56lennier1 quits [Ping timeout: 265 seconds]
05:23:54mut4ntm0nkey quits [Remote host closed the connection]
05:24:18mut4ntm0nkey (mutantmonkey) joins
05:24:33<h2ibot>Wickedplayer494 created Surrender at 20 (+4018, Let's finally talk about this site we put…): https://wiki.archiveteam.org/?title=Surrender%20at%2020
05:24:55lennier1 (lennier1) joins
05:42:33Jonimus joins
05:55:43knecht420 quits [Ping timeout: 268 seconds]
06:01:15sonick quits [Client Quit]
06:01:54BlueMaxima quits [Read error: Connection reset by peer]
06:03:18Megame (Megame) joins
06:14:10knecht420 (knecht420) joins
06:20:24sec^nd quits [Remote host closed the connection]
06:21:05sec^nd (second) joins
06:35:11lennier1 quits [Ping timeout: 268 seconds]
06:36:40lennier1 (lennier1) joins
06:48:36sec^nd quits [Remote host closed the connection]
06:50:25sec^nd (second) joins
06:54:30Megame quits [Client Quit]
07:00:54<@JAA>I'm going to change my strategy on the Atom packages archival. It's not getting anywhere like this, and time's slowly running out.
07:03:48<@JAA>I've only managed to get under 3k packages (of 1.1M, although many of the not-so-old ones are spam), and a good portion of those are likely incompletely covered, too.
07:09:19<tech234a>Distributed Atom archival?
07:09:49<h2ibot>Tech234a edited Twitter (+7, /* Vital Signs */ Yes accounts will be purged…): https://wiki.archiveteam.org/?diff=49233&oldid=49231
07:13:32<@JAA>Hopefully won't be necessary. There are more important things for The Project Author to handle. :-)
08:01:50<@OrIdow6>I am probably free for at least a few days now, but it's not always writing the project code that's the overwhelming majority of the bottleneck
08:02:08hackbug quits [Ping timeout: 268 seconds]
08:02:58HackMii quits [Remote host closed the connection]
08:03:41HackMii (hacktheplanet) joins
08:17:19HackMii quits [Ping timeout: 248 seconds]
08:18:26HackMii (hacktheplanet) joins
08:30:35<qwertyasdfuiopghjkl>tech234a: Typo: "accounts that have with no Tweets" should be "accounts with no Tweets". (though Elon's tweet could also be interpreted as "accounts that have posted no tweets for years" rather than "accounts with no tweets at all")
08:32:42<tech234a>Corrected, thanks
08:33:01<h2ibot>Tech234a edited Twitter (+108, /* Vital Signs */ Typo correction and add…): https://wiki.archiveteam.org/?diff=49234&oldid=49233
08:33:50hitgrr8 joins
09:16:33driib quits [Ping timeout: 276 seconds]
09:40:12<h2ibot>Bzc6p edited EOldal (-50, fix launch year): https://wiki.archiveteam.org/?diff=49235&oldid=49198
09:47:41Island quits [Read error: Connection reset by peer]
09:53:45birdjj quits [Ping timeout: 268 seconds]
10:00:15<h2ibot>Bzc6p edited MyVIP (-10, /* Archiving */ Let's proclaim this done; new…): https://wiki.archiveteam.org/?diff=49236&oldid=48131
11:20:51jacksonchen666 (jacksonchen666) joins
11:41:26sonick (sonick) joins
12:14:56jacksonchen666 quits [Remote host closed the connection]
12:26:30jacksonchen666 (jacksonchen666) joins
12:31:02birdjj joins
12:34:42Arcorann quits [Ping timeout: 268 seconds]
12:36:06eroc1990 quits [Ping timeout: 276 seconds]
12:39:08<@arkiver>JAA: do you have a link to the Atom stuff?
12:39:37<@arkiver>for tianya we had a few suggestions
12:39:47<@arkiver>byenya from schwarzkatz|m, endoftheworldclub or endoftheendoftheworldclub by thuban
12:39:56<@arkiver>i kind of like endoftheendoftheworldclub, but byenya maybe be better
12:40:58<@arkiver>i'm currently somewhat less available due to travel, but would be great to have a project started for tianya next week
12:41:05<@arkiver>they're big and have a ton of interesting content
12:43:27<@arkiver>let's do with #byenya , the more sane name
12:44:13<@arkiver>(i'm nothing endoftheendoftheworldclub as the more fun version)
12:53:28<@Sanqui>btw just some numbers - I collected 3 million @seznam.cz emails online and tried using them as sweb.cz domains (because they share usernames). 3589 URLs resolved with 200, and out of those, only 18 were new to me. looks like I've really hit diminishing returns lol
12:57:31hackbug (hackbug) joins
13:15:17birdjj quits [Client Quit]
13:16:40birdjj joins
13:17:18birdjj quits [Client Quit]
13:18:46birdjj joins
13:27:43sec^nd quits [Ping timeout: 248 seconds]
13:29:40sec^nd (second) joins
13:40:06jacksonchen666 quits [Remote host closed the connection]
13:40:37jacksonchen666 (jacksonchen666) joins
13:52:44Mateon1 quits [Remote host closed the connection]
13:53:10Mateon1 joins
13:57:57Megame (Megame) joins
14:35:57HackMii quits [Remote host closed the connection]
14:36:29HackMii (hacktheplanet) joins
14:43:05sepro (sepro) joins
14:52:18Mateon1 quits [Remote host closed the connection]
14:54:23eroc1990 (eroc1990) joins
15:14:45Mateon1 joins
15:16:26jacksonchen666 quits [Remote host closed the connection]
15:16:48jacksonchen666 (jacksonchen666) joins
15:31:59Megame quits [Client Quit]
16:01:25jacksonchen666 quits [Remote host closed the connection]
16:01:53jacksonchen666 (jacksonchen666) joins
16:04:37jacksonchen666 quits [Remote host closed the connection]
16:05:04jacksonchen666 (jacksonchen666) joins
16:13:49<@JAA>arkiver: https://github.blog/2022-06-08-sunsetting-atom/ https://atom.io/packages
16:14:39<@JAA>I'm only going after the API. Best data and should probably keep package management working through the WBM in the future with some tweaks to `apm`.
16:19:49<@JAA>Across the past week, about 80% of the responses I got were 500s.
16:24:22<@JAA>Blast cannon has been activated.
16:33:44jacksonchen666 quits [Remote host closed the connection]
16:34:06jacksonchen666 (jacksonchen666) joins
16:35:09<@JAA>Blast cannon seems to be too powerful. If I alone can already overload it, a distributed approach won't help.
16:56:24<@JAA>Oh, ok, it's just the pagination that gets slower and slower until it dies, apparently. Welp.
16:59:34<joepie91|m>good old OFFSET X LIMIT Y
17:00:25<@JAA>Either that or fetching all results and then slicing them in code.
17:01:00<joepie91|m>if it's getting slower as you delve into higher page numbers then it's definitely OFFSET, otherwise it'd be consistently slow
17:01:20<@JAA>Right
17:01:30<joepie91|m>if you have some way to obtain a subset of packages that's looked up by index, that'll probably be faster
17:01:42<@JAA>Although I have seen 'fetch page_number*page_size results, return last page_size ones' as well. :-|
17:01:45<joepie91|m>eg. scrape each category individually, or search for trigrams or whatever
17:02:35<joepie91|m>I'd have a look at the options that the package registry offers, but it's kinda unreachable :p
17:03:09<@JAA>Here's the API: https://flight-manual.atom.io/atom-server-side-apis/sections/atom-package-server-api/
17:03:14<@JAA>There's only sorting and pagination.
17:03:34<@JAA>Well, and the search, but that requires a search term.
17:03:42<@JAA>No categories or similar, and no way to access packages by numeric ID.
17:04:27<joepie91|m>does it require whole words or does it do a substring search?
17:04:34<@JAA>There are 'keywords', which I guess are kind of like categories, but I haven't seen a way to list all of those.
17:04:50<joepie91|m>eg. does "oog" get you google-related stuff?
17:05:57<@JAA>Looks like it does, yeah.
17:06:02<joepie91|m>excellent
17:06:12<joepie91|m>then grab google's ngram dataset or whatever it's called
17:06:17<joepie91|m>order by frequency
17:06:22<joepie91|m>and search for each entry in order :p
17:07:01<joepie91|m>if the search works like I think it does, every individual search query should be very fast for their database
17:07:31<joepie91|m>and different ngrams will get you different subsets of packages
17:08:17<joepie91|m>since the search is also paginated, it's probably a good idea to only do the first N pages for each search query, the ones that are fast
17:09:49<@JAA>Hmm
17:12:25<joepie91|m>related: https://aarch64.com/about
17:12:25<madpro|m>If I may derail for a moment.
17:12:25<joepie91|m>doesn't seem to be much else interesting in that range
17:12:26<madpro|m>Twitter may have plans for purging inactive accounts, again
17:12:26<madpro|m>https://www.reddit.com/r/DataHoarder/comments/zhxv1x/twitter_to_begin_purging_accounts/
17:12:26<joepie91|m>JAA: an alternative option is to repurpose the search bruteforce thingem that I once wrote for... webshots? which adaptively refined its search query based on whether there seemed to be more results beyond the hard-coded X amount. but that's very much a sequential process, and will probably be slower than the ngram approach on average
17:12:27<@JAA>I wonder if it'd be more effective to just bruteforce through three-letter combinations. Or maybe even two letters with more pagination.
17:12:44<joepie91|m>that is effectively what the ngram approach is, just in order of frequency
17:13:22<joepie91|m>basically you'd be searching N-letter sequences in order of frequency of appearance in (eg.) English
17:13:32<joepie91|m>as a language
17:13:48<joepie91|m>that way you optimize for scraping the high-result-count ones first
17:14:10<joepie91|m>I believe that this is part of Google's n-gram dataset somewhere
17:14:16<@JAA>Ah, that's what you mean.
17:14:27<@JAA>Google's Ngram datasets are word combinations, not letter combinations.
17:15:17<joepie91|m>https://en.wikipedia.org/wiki/Trigram
17:15:23<joepie91|m>hm, I thought there was a letter dataset also
17:16:09<joepie91|m>(which afaik is used in google's language detection thingem)
17:18:59<@JAA>http://practicalcryptography.com/cryptanalysis/letter-frequencies-various-languages/english-letter-frequencies/ has one.
17:20:01<@JAA>Apparently there are 20 three-letter combinations that never appear in that data. Heh
17:21:20<@JAA>My other idea is to simply only do the first few hundred pages of the packages list for now. Nearly all of the other 37k ones are spam. There were only 12k packages as of June, now there are well over a million.
17:21:57<@JAA>Even though the API doesn't return when a package was created anywhere, it lets you sort by that, so that's nice.
17:22:58<@JAA>Most of those spam packages can't be downloaded either because the GitHub repos are gone as well (if they ever existed in the first place, not sure how that works).
17:37:30<@JAA>Yeah, I'll do that for now since it's an easy adjustment. Should get 99.9% of all relevant data. I'll look into the other options while it's running.
17:55:22<@JAA>Maybe we should feed all those package repositories through #gitgud as well. I'm logging them, anyway.
17:56:15<@JAA>Running smoothly now, by the way. ETA for those oldest 15k packages is a couple hours. :-)
17:56:31<@arkiver>JAA: so only API data
17:56:39<@arkiver>are you planning to also get everything else related to them?
17:58:19<@JAA>I'll look into grabbing the package pages, yeah. Don't think there's anything beyond that.
17:59:18<@JAA>Archiving the website through AB went about as well as you might expect given the broken server.
17:59:31<@JAA>That includes the blog.
18:00:07<@JAA>The other remaining thing are those 'Deprecated redirects that supported downloading Electron symbols and headers'. I haven't looked much into that yet.
18:06:01<@JAA>Oh yeah, and the Atom installers, but I'm working on those.
18:18:30<schwarzkatz|m>are there even any good news regarding that site lately
19:09:03hackbug quits [Client Quit]
19:15:49hackbug (hackbug) joins
19:32:11<@JAA>joepie91|m: Response times are all over the place: https://transfer.archivete.am/inline/aDcev/atom_api_packages_response_times_only_successful.png https://transfer.archivete.am/inline/Gcc65/atom_api_packages_response_times_first_to_successful.png
19:33:02<@JAA>x-axis is page number, y-axis is time between request and response. First graph is only the successful request, second graph is first request to successful response.
19:33:52<@JAA>Seems to be slightly linear up to maybe page 100 or so, but I don't know what to read into this noise.
19:35:13<@JAA>The linear trend might also be due to how my crawl is starting. I'm not sending the first 100 requests at once, it's slowly ramping up.
19:55:01<@JAA>Atom installers are in AB now. Looks like they only have versions 1.58.0 and up on https://atom-installer.github.com/ now, but there's no actual list. I went with everything since 1.50.0 in the list just in case.
19:55:29<@JAA>The installers are in the repo anyway, and the announcement said that'll stay, so these shouldn't really be at risk.
20:11:11jacksonchen666_ (jacksonchen666) joins
20:14:07jacksonchen666 quits [Ping timeout: 248 seconds]
20:16:00jacksonchen666_ is now known as jacksonchen666
20:24:33jacksonchen666 quits [Remote host closed the connection]
20:25:14jacksonchen666 (jacksonchen666) joins
21:11:28lennier1 quits [Ping timeout: 268 seconds]
21:13:13lennier1 (lennier1) joins
21:22:55jacksonchen666 quits [Ping timeout: 248 seconds]
21:24:42jacksonchen666 (jacksonchen666) joins
21:28:43fishingforsoup_ quits [Read error: Connection reset by peer]
21:37:30<tech234a>JAA: 'Deprecated redirects that supported downloading Electron symbols and headers' is linked to https://www.electronjs.org/blog/s3-bucket-change in the announcement, looks like it's just an S3 bucket change
21:37:57<@JAA>tech234a: Yes, but I haven't been able to find a list of objects in that bucket. The redirects from the old URL will go away.
21:38:14<tech234a>good point
21:38:33<@JAA>I ended up at https://github.com/electron/electron/blob/main/script/release/release.js as a hint of what might be there, but it's a mess.
21:38:58<@JAA>Not to mention that it almost certainly changed over the years.
21:40:34<tech234a>some URL references here https://github.com/atom/apm/issues/322#issuecomment-90477573
21:41:15<tech234a>Google also pulls up this URL also https://gh-contractor-zcbenz.s3.amazonaws.com/atom-shell/dist/v9.0.0/SHASUMS256.txt
21:41:48<tech234a>you may be able to use the Electron version history to get the version numbers, then the SHASUMS256.txt file to list the files
21:41:58<@JAA>Ah, nice find!
21:49:42<tech234a>there's also https://gh-contractor-zcbenz.s3.amazonaws.com/atom-shell/dist/index.json
21:49:43BlueMaxima joins
21:50:39sec^nd quits [Ping timeout: 248 seconds]
21:50:58<tech234a>not sure where symbol (pdb) files are located though
21:51:56leo60228 quits [Quit: ZNC 1.8.2 - https://znc.in]
21:52:19leo60228 (leo60228) joins
21:53:31sec^nd (second) joins
21:55:16<schwarzkatz|m>why is it so awfully quiet here currently, where is everybody :c
21:55:43<@JAA>Can someone let schwarzkatz|m know that their Matrix homeserver is broken?
21:59:44<@JAA>At least these days, Electron has a 'symbols server' (copied from Microsoft), see e.g. https://github.com/electron/electron/issues/26961
22:00:13sec^nd quits [Remote host closed the connection]
22:00:36sec^nd (second) joins
22:04:49<TheTechRobo>JAA: I let them know
22:05:56schwarzkatz joins
22:07:56<@JAA>Thanks
22:11:15<@JAA>Oh right, I just believed them, but it isn't even a redirect (probably because AWS doesn't support it for whatever weird reason).
22:11:35<@JAA>Might grab it with qwarc then for dedupe with the new, canonical URL.
22:13:21<@JAA>Oh fun, there's an Atom package called 'search'. Want to guess what happens when you try to retrieve that over the API? Yep.
22:14:53<schwarzkatz>still fetching the packages? :D
22:14:57<@JAA>Apart from that, some other weird package names, and a few random errors, my 'oldest 15k packages' grab is done.
22:15:09<@JAA>Trying, anyway.
22:15:50<schwarzkatz>thanks for doing it, I totally forgot about all of that when I first got here
22:17:02<schwarzkatz>I did like the editor quite a lot in the past and still do (although I don't use it that often anymore)
22:17:36<schwarzkatz>in other news, we got about 1.42 TB of uploadir files to fetch (my messages to arkiver probably got lost in the limbo)
22:19:48<@JAA>There are a handful of package versions which appear to be impossible to download for whatever reason. My grab did 100 retries of the 500s, so I'll call that a good enough attempt.
22:21:05<@JAA>These are the tarballs that failed: https://transfer.archivete.am/inline/KVrLl/atom.io-api-packages-500-failed-tarballs
22:21:49fishingforsoup joins
22:22:39<@JAA>And these are the four package names which failed: search https://github.com/agen-slot-online-gacor/slot-online-deposit-via-qris.git packages/agen-slot-linkaja https://atom.io/packages/daftar-judi-slot-online-deposit-via-pulsa-10rb
22:22:46<@JAA>Yes, these are package names, not URLs that failed.
22:23:00<@JAA>The first one is just a stupid API design failure, and the other three are spam.
22:24:58<schwarzkatz>hey, please don't miss my favorite slot online deposit via qris plugin :(
22:27:22fishingforsoup quits [Client Quit]
22:28:07<@JAA>So I'm considering the critical part of this complete. All packages of any significant value should be in these 35 GiB of WARCs, and this should work with `apm` and some environment variable tweak to point it at the WBM once the data's there.
22:29:54<@JAA>The response code stats don't even look that bad. Only 61% 500s.
22:31:18<schwarzkatz>nice
22:31:20fishingforsoup joins
22:33:07coro leaves
22:38:10<schwarzkatz>did someone start with a script for the uploadir POST downloads? I have no idea how wget-lua works to try something tbh
22:40:09hitgrr8 quits [Client Quit]
22:48:07Island joins
23:01:39katocala quits [Remote host closed the connection]
23:18:55lostbox joins
23:19:19<lostbox>Is IPFS a good medium for storing large files?
23:25:01schwarzkatz quits [Client Quit]
23:27:35ThreeHM quits [Quit: WeeChat 3.5]
23:28:54schwarzkatz_ joins
23:38:21Lord_Nightmare quits [Quit: ZNC - http://znc.in]
23:41:35sec^nd quits [Ping timeout: 248 seconds]
23:45:09ThreeHM (ThreeHeadedMonkey) joins
23:45:17Lord_Nightmare (Lord_Nightmare) joins
23:45:26fishingforsoup quits [Read error: Connection reset by peer]
23:46:29sec^nd (second) joins
23:52:06ThreeHM quits [Ping timeout: 276 seconds]