00:00:12tzt (tzt) joins
00:20:43yay joins
00:21:09<yay>(reposting) so http://glencoe.mheducation.com/ has a shutdown announcement for a month from nowone example of the sites is http://glencoe.mheducation.com/sites/0078807239/
00:21:38<Jake>Thanks, I know thuban looked into that a little bit, but we hadn't found a concrete method for discovering the websites.
00:21:55<yay>I was wondering what the id after sites/ was, but turns out it's the ISBN-10 of the published McGraw-Hill Education book
00:22:11<Jake>Yup.
00:22:13<thuban>yes, see previous discussion: https://hackint.logs.kiska.pw/archiveteam-bs/20220519#c318845
00:23:05<thuban>i don't think we came to a conclusion about whether this could be searched through #// or #Y.
00:23:56<yay>I've been manually(captcha) copying ISBNs from https://isbnsearch.org/search?s=McGraw-Hill+Education to check later
00:25:07<yay>also all of the ISBNs I've seen so far either start with 12 or 07, which might make a distributed double-check for sanity later on feasible
00:26:15<thuban>yay: unfortunately not all of the sites are associated with real isbns, and very very many of them are not in the 12 or 07 blocks
00:27:20<thuban>if we're not going to do a full distributed run, i think the best way forward would be for someone with a few ips to play with to finish off the brute-force (i've been probing them occasionally and i don't think they'll be too quick on the banhammer)--shouldn't require anything other than curl one-liners
00:28:23<yay>back-of-the envelope calculations puts it at about a month to index at 300 requests per minute, ignoring the existence of the possible final 'X'
00:29:02<Jake>I'd be happy to help with my IP thuban, as we know there are non ISBN entries.
00:29:19<Jake>(If it was just ISBN entries, I'd say let's find a list, but it's not :) )
00:29:56thetechrobo_ is now known as TheTechRobo
00:31:00<thuban>the command i was using is `curl -w '%{url}\t%{http_code}\n' -so /dev/null --cookie "Sayonara=Dewa_mata" -IZ --parallel-max 300 'http://glencoe.mheducation.com/sites/[RANGE]/'`
00:31:33<yay>so 11 million possible ids... could we find enough people to do 2 million each?
00:32:27<TheTechRobo>thuban: how much storage space would be 2 million items?
00:32:34<@JAA>How do you get to 11 million?
00:32:35<TheTechRobo>i don't have much storage space or bandwidth, but i'd love to help
00:33:30<yay>JackThompson05 the ids are 10 digits, which gives us 10 million possible ids
00:33:39<yay>multiply that by 1.1x to account for a possible final 'X'
00:33:59<thuban>the cookie bypasses the shutdown notice, 300 is the maximum maximum curl allows (this isn't mentioned in the man page), and RANGE is your block of choice--i was doing 1e9 at a time but you may need to adjust this to avoid ooming
00:34:07<yay>JAA sorry, wrong ping
00:34:16<thuban>i covered up to 1799999999 before i got b&
00:34:33<@JAA>Oh right, yeah.
00:34:37<@JAA>I miscounted there.
00:34:39<yay>that's a pretty slow banhammer :D
00:34:44<Jake>thuban: I'd get some ranges assigned, as well as somewhere to report back matches, we can get something done.
00:35:23<thuban>TheTechRobo: using the output format of that curl command, 1e9 urls ~= 5G
00:35:37<thuban>Jake: i was just thinking of that; maybe an etherpad?
00:35:48<Jake>Sure, something like that.
00:36:30<yay>A secondary concern might be somehow preserving the functionality of the "Self Check Quizzes"
00:37:16<yay>but that comes later
00:37:22<@JAA>yay: Actually no, I didn't miscount. It's 11 billion, not million.
00:37:28<thuban>yay: iirc that shouldn't require special attention (assuming we feed the hits into archivebot)
00:38:08<yay>JAA now it's my turn to ask: how did you get 11 billion?
00:38:26<@JAA>10^10 = 10 billion
00:39:04<@JAA>Seven digits would be 10 million.
00:39:32<thuban>see this is why i just use scientific notation
00:40:17<@JAA>Yeah, 1.1e10 URLs to do, and you did 1.8e9.
00:40:43opsoyo|m quits [K-Lined]
00:40:43thibaultmol|m quits [K-Lined]
00:40:43ase|m quits [K-Lined]
00:40:43rewby|m quits [K-Lined]
00:40:43thermospheric quits [K-Lined]
00:40:43nyuuzyou quits [K-Lined]
00:40:43mpeter|m quits [K-Lined]
00:40:43Maakuth|m quits [K-Lined]
00:40:43audrooku|m quits [K-Lined]
00:40:43@Sanqui|m quits [K-Lined]
00:40:43DigitalDragon quits [K-Lined]
00:40:43Ajay quits [K-Lined]
00:40:43madpro|m quits [K-Lined]
00:40:43igneousx quits [K-Lined]
00:40:43britmob|m quits [K-Lined]
00:40:43applepiesavannah|m quits [K-Lined]
00:40:43tech234a|m quits [K-Lined]
00:40:48<yay>*1.1e9
00:40:57<yay>but yeah, I underestimated by a factor of 10
00:41:12<@JAA>No, 1.1e10.
00:41:23<@JAA>Unless there are some restrictions to the digits that I'm unaware of.
00:41:44<@JAA>^[0-9]{9}[0-9X]$, right?
00:41:57<thuban>as far as we know, right.
00:42:18<@JAA>That's 1.1e10.
00:42:40pabs quits [Quit: Don't rest until all the world is paved in moss and greenery.]
00:42:59<yay>that's 11 digits?
00:43:11<@JAA>No, ten.
00:43:11<yay>110,0000,0000
00:43:21<yay>according to https://coolconversion.com/math/scientific-notation-to-decimal/
00:43:40<@JAA>1.1e10 = 11'000'000'000
00:44:48<@JAA>Oh, 11 digits in the number, lol, yes.
00:44:56<@JAA>Anyway, 11 billion.
00:45:12<thuban>https://pad.notkiska.pw/p/glencoe
00:45:14<@JAA>thuban: What did you run those 1.8B with, and how long did it take?
00:45:31<@JAA>Network/hardware I mean.
00:46:16<thuban>JAA: a seedbox, and about a week (but i had some miscellaneous downtime so that's an overestimate).
00:46:27<yay>the main problem is that glencoe has `User-agent: * \n Disallow: /` in its robots.txt
00:46:44<@JAA>We don't care about robots.txt here. :-)
00:47:01<yay>that does mean that practically no search engines are going to be useful in discovering sites
00:47:10<@JAA>Yeah
00:47:16<@JAA>Although a few do show up.
00:51:02<Jake>my outstanding questions: how are we transferring matches? is there a "good" way of detecting the ban and stopping?
00:52:43<yay>also how do we search 11,000,000,000 possible numbers in a month?
00:53:06<@JAA>Given how small the results are, (compress and) upload to transfer?
00:53:27<thuban>Jake: presumably by ordinary file transfer to whoever is prepared to collate (i volunteer); i used `watch 'tail'` and kept an eye out for consistent 000s (which curl uses to indicate connection failure--the ban appears to take the form of a refusal to connect/timeout)
00:54:20<Jake>Great, We should split them into slightly smaller segments to take. 1.8B took... a week with one IP?
00:55:17<@JAA>thuban: How many sites did you find in those 1.8B?
00:55:20<TheTechRobo>thuban: what's your connection speed?
00:55:43<@JAA>Network location relative to the site might also matter.
00:55:56<TheTechRobo>yeah, that too. I'm in Ontario.
00:56:24<yay>I see no reason to send anything other than a list of valid matches
00:56:36<yay>which should be pretty trivially small
00:56:42<@JAA>Yeah
00:56:49<@JAA>Definitely not the whole output. :-)
00:56:58pabs (pabs) joins
00:57:29<thuban>Jake: about, yeah. i did that in blocks of 1e9 but, as i said, you may need to adjust this
00:57:37<thuban>JAA: grepping...
01:00:29<Jake>It appears to be "selfhosted" on their own ASN. I get 10ms in VA.
01:00:48<Jake>Based on traceroute, likely to be in NY.
01:01:03<yay>do you ever do some network trangulation :)
01:01:20<yay>**triangulation
01:02:58dm4v quits [Ping timeout: 265 seconds]
01:04:43dm4v joins
01:04:45dm4v quits [Changing host]
01:04:45dm4v (dm4v) joins
01:05:18<Jake>ping.pe confirms my priors, seems to be NY. https://jakel.rocks/up/6f2c2c29f0a7d2d6/image.png
01:09:40yay quits [Remote host closed the connection]
01:13:06yay joins
01:13:43<yay>hah, I tried to expand the restricted range of {1800000000..1899999999} and earlyoom killed everything
01:18:39<thuban>...shell expansion? you should be using curl's range syntax ('[1800000000-1899999999]')
01:19:10<yay>oops
01:19:11<yay>:p
01:23:54tbc1887 (tbc1887) joins
01:31:02<TheTechRobo>TIL
01:31:13<Jake>curl version minimum is 7.75.0 for --write-out url support.
01:31:36<@JAA>url_effective is equivalent since this isn't following redirects.
01:31:52<Jake>๐Ÿ‘just read the page, haha.
01:31:57<@JAA>:-)
01:32:49<yay>I'm compiling curl now
01:32:52<yay>at least it's pretty quick
01:33:22<@JAA>http://glencoe.mheducation.com/sites/2000000000/ Yeahhh, that looks like a valid ISBN. lol
01:35:44<Jake>heh.
01:36:15<yay>I like to imagine that the sysadmins got lazy
01:37:07<thuban>there are a lot of hits in the 00 block that just look like tests
01:38:53eroc19908 (eroc1990) joins
01:39:42eroc1990 quits [Ping timeout: 245 seconds]
01:40:30Arcorann (Arcorann) joins
01:43:30yay quits [Client Quit]
01:46:51<@JAA>Hmm, looks like there are also random errors. Only well under a per mille for me, but still, should be rerun.
01:48:00<Jake>We're probably hitting it much harder than during the first run.
01:54:31<@JAA>20 minutes for a bit over 6M here.
01:55:22<@JAA>Does curl store all range URLs in memory, or why does it eat so much? That'd be impressively dumb.
01:55:50<thuban>not sure exactly. i also found it suspicious
02:00:12<Jake>Has to be a leak of some kind,,,, hopefully that's not intended behavior....
02:00:50<@JAA>Hmm yeah, memory usage is growing as well. Yikes.
02:03:08<@JAA>(It does not store all URLs in memory as far as I can tell.)
02:09:24<thuban>oh, almost forgot: JAA, i found 16297 sites in the 1.8e9 urls i tested (for a ~00.001% hit rate).
02:10:09<@JAA>Tremendous.
02:10:32<@JAA>We definitely don't need to worry about the size of the filtered files. :-)
02:12:49tech234a|m joins
02:14:33<Arcorann>What's going on?
02:14:37opsoyo|m joins
02:14:43<@JAA>Yeah, had to kill my process as it almost ran out of memory. It ate over 20 GiB after 11M URLs.
02:14:43madpro|m joins
02:14:49ase|m joins
02:14:49DigitalDragon (DigitalDragon) joins
02:14:54Maakuth|m joins
02:15:00britmob|m joins
02:15:00audrooku|m joins
02:15:06Ajay joins
02:15:09<Jake>We gotta rerun 000s as well at some point.
02:15:11mpeter|m joins
02:15:17igneousx (igneousx) joins
02:15:23rewby|m joins
02:15:24<thuban>Arcorann: we're brute-forcing the glencoe sites (see https://pad.notkiska.pw/p/glencoe)
02:15:29thermospheric (Thermospheric) joins
02:15:29Sanqui|m (Sanqui) joins
02:15:29@ChanServ sets mode: +o Sanqui|m
02:15:35applepiesavannah|m joins
02:15:41nyuuzyou (nyuuzyou) joins
02:17:48yay joins
02:36:11<@JAA>I'm pretty impressed though. 6-7k req/s on a single processor here. Not bad at all.
02:36:29<Jake>Just... memory intensive for seemingly no good reason.
02:37:27<@JAA>Yeah
02:37:32<@JAA>I'm doing it in 5M chunks now.
02:38:05<@JAA>Not feeling like trying to debug that and figuring out where all that memory goes.
02:45:38<yay>I wonder what the cookie contents("Sayonara=Dewa_mata") mean
02:46:53<@JAA>According to a brief search, it's Japanese. Sayonara = Goodbye, and Dewa mata = See you later
02:59:19onetruth joins
03:05:11eroc19908 quits [Client Quit]
03:12:30eroc1990 (eroc1990) joins
03:59:44nerdguy1138 (nerdguy1138) joins
04:00:38NinCollin|1 quits [Remote host closed the connection]
04:15:49yay quits [Ping timeout: 265 seconds]
04:31:10nerdguy1138 quits [Client Quit]
04:31:42nerdguy1138 (nerdguy1138) joins
04:32:58nerdguy1138 quits [Client Quit]
04:34:04nerdguy1138 (nerdguy1138) joins
04:40:19nerdguy1138 quits [Client Quit]
04:41:39nerdguy1138 (nerdguy1138) joins
05:02:46yay joins
05:09:22DigitalDragon quits [Client Quit]
05:09:26DigitalDragon (DigitalDragon) joins
05:13:49yay quits [Ping timeout: 265 seconds]
05:35:07tbc1887 quits [Ping timeout: 245 seconds]
05:45:54Iki1 quits [Remote host closed the connection]
05:46:28Iki1 joins
06:11:04DiscantX joins
06:38:31nerdguy1138 quits [Client Quit]
06:39:23nerdguy1138 (nerdguy1138) joins
06:40:40nerdguy1138 quits [Client Quit]
06:42:27nerdguy1138 (nerdguy1138) joins
06:47:02nerdguy1138 quits [Client Quit]
06:48:36nerdguy1138 (nerdguy1138) joins
06:51:29eroc19905 (eroc1990) joins
06:51:35seednode4943 quits [Client Quit]
06:51:35eroc1990 quits [Client Quit]
06:51:35driib quits [Client Quit]
06:51:35TheTechRobo quits [Remote host closed the connection]
06:51:35onetruth quits [Remote host closed the connection]
06:51:35IDK_ quits [Client Quit]
06:51:38seednode4943 (seednode) joins
06:51:44thetechrobo_ joins
06:51:45onetruth joins
06:51:51IDK_ joins
06:52:07thetechrobo_ quits [Remote host closed the connection]
06:52:08driib (driib) joins
06:52:30thetechrobo_ joins
06:55:57nerdguy1138 quits [Read error: Connection reset by peer]
07:07:35JTL quits [Quit: WeeChat 2.9]
07:08:52JTL (jtl) joins
07:55:55chrismeller (chrismeller) joins
07:56:18chrismeller quits [Remote host closed the connection]
07:56:41chrismeller (chrismeller) joins
08:04:47<lennier1>arkiver: Three sets of App Store links extracted from the metadata -- https://transfer.archivete.am/8ARa0/developerLinksUnique.zip https://transfer.archivete.am/E1VaL/screenshotLinksUnique.zip https://transfer.archivete.am/xTYpY/artworkLinksUnique.zip
08:07:20<thuban>why in god's name doesn't curl have a global suppress output flag ._.
08:14:04<@OrIdow6>Anything that needs me? Can't stand reading logs
08:14:10<@OrIdow6>sonick: How did you generate your list?
08:22:41<thuban>all other 'local' options are implicitly applied to all urls (unless reset with --next), but -o has to be separately specified for every single url-pattern. (yes, this means that `curl http://example.com/1.html http://example.com/2.html` behaves differently from `curl http://example.com[1-2].html`.)
08:24:52<thuban>on the one hand, yes, it makes sense for -o to be special in this way. on the other the manpage is not terribly clear about it and it makes using -w perfectly wretched
08:26:56<thuban>(bonus: the manpage still claims to look for an rc file in "$XDG_CONFIG_HOME/.curlrc" instead of "$XDG_CONFIG_HOME/curlrc" like a sane program, even though this has in fact been fixed https://github.com/curl/curl/issues/8208)
08:36:16sonick quits [Client Quit]
09:25:22<systwi>Sanqui: re: library classification systems, yes I do also agree that that would be a great start for broad (and then gradually finer) topics on areas for archival.
09:26:40march_happy (march_happy) joins
10:37:16pabs quits [Client Quit]
10:43:28pabs (pabs) joins
11:00:57sec^nd quits [Ping timeout: 245 seconds]
11:06:10sec^nd (second) joins
11:06:47DiscantX quits [Ping timeout: 245 seconds]
11:27:38spirit quits [Quit: Leaving]
11:48:46Larsenv quits [Quit: ZNC 1.8.2+deb2build5 - https://znc.in]
12:00:24Larsenv (Larsenv) joins
12:06:11DiscantX joins
12:14:30<@arkiver>JAA: thanks for that app store channel! sometimes I miss a channel
12:18:51<@arkiver>JAA: Jake: I can probably get us a project going for those ISBNs
12:19:09<@arkiver>do we have any idea if the site can handle many requests or not?
12:19:38BlueMaxima quits [Read error: Connection reset by peer]
12:43:16<thuban>it seems to be holding up okay under several of us hammering it at a few thousand requests per second
12:57:20HP_Archivist (HP_Archivist) joins
13:26:12chrismeller quits [Ping timeout: 265 seconds]
13:54:55AlsoHP_Archivist joins
13:58:43HP_Archivist quits [Ping timeout: 265 seconds]
14:10:47thetechrobo_ is now known as TheTechRobo
15:00:57Arcorann quits [Ping timeout: 245 seconds]
15:20:03bonga quits [Remote host closed the connection]
15:22:11bonga joins
16:03:17eroc19905 quits [Ping timeout: 265 seconds]
16:06:22DiscantX quits [Ping timeout: 245 seconds]
16:07:25eroc1990 (eroc1990) joins
16:08:15qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
16:19:51march_happy quits [Ping timeout: 265 seconds]
16:21:22march_happy (march_happy) joins
16:26:08march_happy quits [Ping timeout: 265 seconds]
16:26:53march_happy (march_happy) joins
16:33:27march_happy quits [Ping timeout: 245 seconds]
16:36:40<@JAA>The curl --parallel-max >300 bug is fixed in master now, by the way: https://github.com/curl/curl/pull/8930
16:37:30<@JAA>arkiver: I sent you an invite for it at the time, by the way. :-) I'd recommend at least logging those if your connection's still unreliable and randomly drops messages.
16:44:38T31M quits [Quit: ZNC - https://znc.in]
16:45:28T31M joins
16:46:09<Jake>Do you know if the weird memory leak is fixed in master?
16:46:18<@JAA>Nope
16:46:52<@JAA>I haven't tested master at all, just edited that file and sent the PR. :-)
16:47:03<Jake>Haha, alright. I'll give it a shot
16:47:45<@JAA>I got banned.
16:48:09<Jake>:(
16:49:46<@JAA>A bit over 300M requests in 14-ish hours did the trick.
16:50:04<Jake>I wonder if it's entirely manual.
16:50:48<@JAA>Did anyone else get banned as well? Seems unlikely that they'd only hit one of us if it's manual.
16:53:08<Jake>My IP isn't banned, but it looks like I forgot to start it again last night.
17:03:09AlsoHP_Archivist quits [Client Quit]
17:03:40<@JAA>thuban: ^
17:07:19<Jake>(The leak is better in master, but not fixed. 4 minutes of run time and I'm at half a gig and only going up.)
18:10:53<Jake>Did we ever decide on how we were rerunning 000s for glencoe?
18:12:22<@JAA>Yes, but the comments were removed from the pad. Just include them in the filtered output, and we'll deal with them in the end.
18:12:41<@JAA>That's why the filter commands only remove 404s.
18:13:02<Jake>๐Ÿ‘
18:44:23<Jake>http://glencoe.mheducation.com/sites/1911124000/information_center_view0/ more test sites.
19:07:12LeGoupil joins
19:07:32yay joins
19:14:55<yay>I got ip banned for a bit from glencoe after a mere 500K requests
19:19:06<Jake>Very interesting. I'm at 2.9M and no ban yet.
19:26:29<yay>I'm also pretty sure my 2GB server crashed from OOM
19:26:33<yay>I can't SSH into it
19:34:37eroc19904 (eroc1990) joins
19:35:57eroc1990 quits [Ping timeout: 265 seconds]
19:37:14<@JAA>2 GB? If you ran anything over like a million URLs in a single process there, yeah, I'm sure it OOM'd.
19:37:26<@JAA>Probably less than that, even.
19:38:29Muad_Dib quits [Remote host closed the connection]
19:38:51Muad-Dib joins
20:01:47bonga quits [Ping timeout: 245 seconds]
20:03:10bonga joins
20:08:01DiscantX joins
20:39:48yay quits [Remote host closed the connection]
20:40:52yay joins
20:59:42DiscantX quits [Ping timeout: 265 seconds]
21:02:29<h2ibot>Usernam edited List of websites excluded from the Wayback Machine (+27): https://wiki.archiveteam.org/?diff=48652&oldid=48613
21:03:29<h2ibot>Systwi edited Template:Wikis (+39, Formatting correction.): https://wiki.archiveteam.org/?diff=48653&oldid=48639
21:16:34LeGoupil quits [Client Quit]
21:16:35<Jake>cc JAA: you made it to the curl guy's twitter https://twitter.com/bagder/status/1530658224617181185
21:16:55<@JAA>:-)
21:41:07<@JAA>I suggest adding --max-time (-m) to the curl command. I've had some connections hang for quite a while. Will result in a few more 000s of course.
21:59:48onetruth quits [Remote host closed the connection]
21:59:48TheTechRobo quits [Remote host closed the connection]
21:59:48yay quits [Remote host closed the connection]
21:59:48rmrm quits [Remote host closed the connection]
21:59:58onetruth joins
21:59:58TheTechRobo (TheTechRobo) joins
22:00:08yay joins
22:44:14thetechrobo_ joins
22:44:35thetechrobo_ quits [Remote host closed the connection]
22:44:59thetechrobo_ joins
22:46:53yay quits [Remote host closed the connection]
22:47:21TheTechRobo quits [Ping timeout: 265 seconds]
22:54:31march_happy (march_happy) joins
22:59:05march_happy quits [Ping timeout: 265 seconds]
22:59:06bonga quits [Read error: Connection reset by peer]
22:59:52march_happy (march_happy) joins
23:00:00bonga joins
23:31:34BlueMaxima joins
23:39:31<Jake>I appear to have been banned.