00:08:09nicolas17 quits [Ping timeout: 272 seconds]
00:12:38nicolas17 joins
00:19:28magmaus3 (magmaus3) joins
00:52:29jasons quits [Ping timeout: 272 seconds]
01:13:01Naruyoko5 quits [Remote host closed the connection]
01:13:20Naruyoko5 joins
01:41:15Barto quits [Ping timeout: 272 seconds]
01:47:35Barto (Barto) joins
01:55:33jasons (jasons) joins
02:35:43Megame quits [Client Quit]
02:55:50jasons quits [Ping timeout: 240 seconds]
03:00:50pseudorizer quits [Client Quit]
03:01:39pseudorizer (pseudorizer) joins
03:22:53pabs quits [Read error: Connection reset by peer]
03:23:42pabs (pabs) joins
03:30:53<@JAA>I've been battling with SAP's Q&A site that'll be taken down very soon. Their server is very broken. I kind of expected that, given it's SAP, but it's impressive just how bad it is.
03:31:34<@JAA>We got decent coverage from ArchiveBot already, but that can't have gotten everything. Among the things missed will be pagination of answers and attachments in answers or comments, I believe.
03:31:49<@JAA>https://answers.sap.com/ is the site.
03:32:16<@JAA>The good news is there don't seem to be any significant rate limits.
03:34:14<pabs>were they missed due to JS or?
03:34:46<@JAA>Yeah
03:35:25<@JAA>XHR that returns HTML in JSON in JSON for some data, and so on.
03:35:39<nicolas17>yo. dawg.
03:35:40<@JAA>Real pleasure to work with.
03:35:42<pabs>fuuuugly
03:36:46<pabs>hmm, image attachments look enumerable ? https://answers.sap.com/storage/attachments/2234744-image.png
03:36:56<@JAA>The server also returns truncated responses and extra data after completed responses.
03:37:05<nicolas17>are they all called image.png?
03:37:08<@JAA>They are not.
03:37:44<pabs>the one I linked is from here btw https://answers.sap.com/questions/14034200/how-can-i-filter-a-listpicker-to-display-only-dist.html
03:37:54<@JAA>Earlier, I saw 404s on URLs that had been returning 302s just before. Shortly after writing about it in #archivebot, it started returning 302s again.
03:38:25<@JAA>I'll try my best, but this is a shitshow.
03:39:05<@JAA>Oh yeah, I'm seeing that 404 thing again right now.
03:39:13<@JAA>I'll probably get a lot of false 404s.
03:39:43<@JAA>It seems to only affect IDs that aren't questions anyway, so maybe that's 'fine', but yeah.
03:40:15<pabs>hmm, on that page above, the attachments are just hrefs
03:40:34<pabs>no JS needed...
03:40:51<nicolas17>I loaded that answer with NoScript enabled
03:40:56<nicolas17>I could see the inline image and the link
03:41:07<@JAA>Only for attachments in the question, not in answers.
03:41:12<pabs>ah
03:41:33<@JAA>Nor in comments on answers.
03:41:46<@JAA>The comments are where that JSON in JSON happens.
03:42:58<pabs>the answer attachments are in JSON in the HTML it seems
03:43:33<pabs>curl -s https://answers.sap.com/questions/14031315/dunning-notice-on-small-balance.html | grep /attachments/
03:44:10<@JAA>Yeah right, there's another layer of HTML around it for the first page of answers, too.
03:44:22<@JAA>So HTML in JSON in JSON in HTML :-)
03:44:25<nicolas17>/o\
03:44:41<pabs>| grep -oE '/storage/[^\"]+'
03:44:59<@JAA>Anyway, I've got all these things figured out already, the remaining problem is their server not cooperating at the HTTP level.
03:45:02<pabs>can just grep the WARC of the answer HTML :)
03:45:19<pabs>ah, whats the issue there?
03:45:20<@JAA>I also liked this one:
03:45:22<@JAA>> const simplifiedQuestionView = JSON.parse("true");
03:45:26<nicolas17><JAA> Earlier, I saw 404s on URLs that had been returning 302s just before. Shortly after writing about it in #archivebot, it started returning 302s again.
03:45:51<@JAA>No, the truncated responses and extra data after responses. The latter is probably a wrong Content-Length header.
03:46:10<pabs>how are you detecting that?
03:46:22<@JAA>By getting JSON and HTTP parser errors.
03:47:04<pabs>oh geez...
03:47:32<fireonlive><nicolas17> are they all called image.png? < that's my power move
03:47:36<pabs>does retrying the failing ones help?
03:47:43<@JAA>As I wrote in #archivebot earlier, it takes some effort to fuck up this badly.
03:47:59<@JAA>> Request for https://answers.sap.com/users/login.html failed 721 times
03:48:00<@JAA>:-)
03:48:06<nicolas17>to err is human
03:48:11<nicolas17>to really fuck things up you need a computer
03:48:12<pabs>yeah, just saw that...
03:48:19<pabs>we don't need the login page though :)
03:48:44<@JAA>Yeah, that's a bug in my code. Already fixed that.
03:49:10<fireonlive>lord lol
03:49:12<@JAA>The failures are responses with (presumably) extra data at the end.
03:49:35<@JAA>> * Excess found in a non pipelined read: excess = 97, size = 3632, maxdownload = 3632, bytecount = 0
03:49:40<@JAA>That's what curl emits.
03:49:53<@JAA>And the response ends at a random point in the middle of the HTML.
03:50:02<@JAA>That's why I suspect they're sending the wrong Content-Length.
03:50:08<pabs>is it one particular IP address that is bad or all three of them?
03:50:24fireonlive sits back and wonders how
03:51:03<@JAA>I think all are affected, but I'm not currently logging that information on the errors, so I can't confirm.
03:51:19<@JAA>Highly unlikely it didn't hit all IPs on 721 attempts though.
03:51:53<pabs>and does --http1.1 help?
03:52:10<pabs>nope, it doesn't
03:52:21<@JAA>WARC can only store HTTP/1.1 anyway.
03:52:46<pabs>ah
03:53:17<fireonlive>modern web--
03:53:17<eggdrop>[karma] 'modern web' now has -1 karma!
03:53:22<@JAA>Oh yeah, there's a PROTOCOL_ERROR on HTTP/2 as well, right. :-)
03:53:32<pabs>s/-1/-Inf/
03:53:38<@JAA>SAP--
03:53:39<eggdrop>[karma] 'SAP' now has -1 karma!
03:54:26<fireonlive>SAP--
03:54:26<eggdrop>[karma] 'SAP' now has -2 karma!
03:54:29<pabs>plain http redirects to TLS too, so no way to check that
03:55:13<pabs>love it how even when you get the right Content-Length, the HTML is still broken, no </body></html>
03:55:38<pabs>doesn't even close the <form> tag
03:56:20<fireonlive>full, solid, enterprise software
03:56:33<@JAA>Confirmed I see the excess data error on all three IPs.
03:57:57<@JAA>One item was looping on the login retries, had to just ^C it after almost 10k attempts.
03:58:30<@JAA>Yep, 10k retries, all with that excess data stuff, apparently.
03:58:31<@JAA>> ClientResponseError("400, message='invalid constant string'",)
03:58:35<@JAA>That's how it shows up on qwarc.
03:58:35<fireonlive>wow
03:58:49<pabs>how does SPN do?
03:59:13<pabs>its interesting, here I don't get the extra data every time, but have once or twice
03:59:13<@JAA>It might be that the excess data is always there and it's a matter of timing whether your HTTP client considers the transaction done or actually emits an error.
03:59:23<@JAA>Or how are you testing exactly?
03:59:26jasons (jasons) joins
03:59:34<pabs>curl -v ...
03:59:51<@JAA>Yeah, same
04:00:00<pabs>I almost always see this tho: * Connection #0 to host answers.sap.com left intact
04:01:06<pabs>so I guess that means the server is always not closing the connection, and you're right with the timing comment
04:01:07<nicolas17>I think that means "not closing the connection since we may reuse it in the next request"
04:01:13<pabs>ah
04:01:31<nicolas17>* Connection #0 to host google.com left intact
04:01:59<fireonlive><resists joke>
04:02:27<nicolas17>if you do two URLs (same hostname) in the same curl command:
04:02:35<nicolas17>* Connection #0 to host google.com left intact
04:02:37<nicolas17>* Re-using existing connection! (#0) with host google.com
04:03:37<@JAA>Oh
04:03:39<@JAA>lol
04:04:03<@JAA>It seems to break at a hidden <input> containing a CSRF token!
04:04:13<fireonlive>ooh!
04:04:33<fireonlive>is there a token? or is it trying to make one and just catastrophically failing
04:04:33<@JAA>Usually, the response ends with the submit button.
04:04:40<@JAA>Immediately after that button would be the hidden token.
04:04:55<pabs>so the token is the extra data?
04:04:56<fireonlive>ah!
04:04:56<@JAA>But I only get that (and the closing </form></body></html>) sometimes.
04:05:33<nicolas17>JAA: is this on truncated responses or on extra-data responses?
04:05:36<pabs>so something is choking depending on what token gets generated?
04:06:06<pabs>is it only the login page that gets this extra data?
04:06:08<nicolas17>server asking the overkill HSM to encrypt and sign the CSRF token and failing (?)
04:06:17<@JAA>Example of a complete response: https://transfer.archivete.am/inline/rXyZ0/openssl-s_client-answers.sap.com-login-page
04:06:30<@JAA>`printf '%s\r\n' 'GET /users/login.html HTTP/1.1' 'User-Agent: curl/7.38.0' 'Host: answers.sap.com' 'Accept: */*' '' | openssl s_client -connect answers.sap.com:443 -ign_eof`
04:07:33<@JAA>There's an extra 96 bytes at the end, which is exactly that CSRF input tag plus the closing tags.
04:07:44<@JAA>Those are not accounted for in the Content-Length.
04:07:58<nicolas17>tried twice, once the response ended at </html>, next time the response ended at value="Submit" />
04:08:09<nicolas17>I didn't count bytes to see what the content-length included :P
04:08:37<pabs>does --ignore-content-length help? :)
04:08:58<@JAA>pabs: Not only the login page, I saw the same error elsewhere as well, all on 404 pages.
04:09:12<@JAA>Maybe it generates a token there as well for a 'report an error' thing.
04:09:40<@JAA>nicolas17: Extra data, I haven't looked at the truncation yet.
04:10:11<@JAA>The extra data makes qwarc very sad. The truncation just causes a small number of items to crash.
04:10:12<pabs>with --ignore-content-length I still see the truncation sometimes
04:10:32<@JAA>Yeah, try the OpenSSL command instead, it'll stall after the submit button randomly.
04:10:37<pabs>but --ignore-content-length fixes lack of body
04:11:07<pabs>some broken ass hit
04:11:13<@JAA>I can't ignore the Content-Length in qwarc anyway; that happens deep in a C library for HTTP parsing.
04:11:25<pabs>qwarc doesn't use curl?
04:11:35<@JAA>No
04:11:38<@JAA>aiohttp
04:11:48nullpeta joins
04:12:28<@JAA>I'm not aware of anyone having implemented WARC into curl.
04:12:32<fireonlive>nulldata you have competition
04:13:36<@JAA>Anyway, taking a bit of a break, then trying to fix the remaining issues and getting it properly started.
04:13:42<nulldata>:O
04:14:21<@JAA>First very rough size estimate is ~300 GiB.
04:15:03<nullpeta>Hi. Japanese edition of slashdot (https://srad.jp) will be shutdown at 2024/01/31. Can anyone help archive?
04:16:07pabs looks
04:16:18<nullpeta>shutdown notice: https://srad.jp/story/24/01/22/0311225/
04:16:28<pabs>lots of subdomains...
04:17:18<pabs>looks like the article subdomains are just categories, so subdomain stories are on the main domain too
04:18:31<pabs>comments are very JS-y like slashdot
04:18:38<nullpeta>According to the closure notice, OSDN.net (japanese github like sites) may also be closed.
04:18:54<pabs>fuck
04:19:32<fireonlive>oh no :(
04:21:12Island quits [Read error: Connection reset by peer]
04:21:54<pabs>nullpeta: started a job for srad.jp, see archivebot.com to watch it run
04:22:24<pabs>osdn.net was ultra-broken a while back, wonder how it is now
04:23:18<nullpeta>pabs: Thank you very much!
04:23:45Island joins
04:23:53<pabs>don't think I will do article subdomains, that will be duplication I think
04:24:01<pabs>not sure how to deal with comments either
04:32:37<nullpeta>pabs: Subdomains are just categories, so they should accessible from the main domain. If a story has over 50 comments, comments over 50 are loaded by JS later.;(
04:34:32<nullpeta>To get all comment, we need to click "すべてのコメントを取得" (Get all comments) button manually.
04:34:39<fireonlive>masterX244: do you know if there's a final/hard cutoff date set yet for discordcdn urls expiring/the signature&etc parameters being mandatory?
04:38:56<pabs>nullpeta: yeah, thats a POST request, which isn't archivable
04:40:04<h2ibot>Switchnode edited Deathwatch (+239, /* 2024 */ add srad.jp): https://wiki.archiveteam.org/?diff=51539&oldid=51537
04:40:41<pabs>hmm, the site is timing out for me now :(
04:40:46<pabs>and also in AB
04:40:55<pabs>!d 6p7aqfevk41es3iuyywvw68a7 1800000 1800000
04:41:38<pabs>nullpeta: re comments, they look enumerable https://srad.jp/comment/4597211
04:41:50<thuban>pabs: wrong channel
04:42:17<pabs>yeah :)
04:43:34<fireonlive>site is down for me as well
04:43:48<fireonlive>oh there it goes
04:43:50<fireonlive>jus very slow
04:44:20<nulldata>Yeah occasional 500s
04:44:25<fireonlive>the quote at the bottom was "人生unstable -- あるハッカー" which google converts to "Life is unstable -- a hacker"
04:44:26<fireonlive>suiting :)
04:46:39<nullpeta>pabs: So comments are archivable via https://srad.jp/comment/* URI? Good!
04:47:12<nulldata>Looks like OSDN's magazine site has been broken at least since November of last year. https://osdn.net/mag/
04:49:15<pabs>sadly SWH won't be able to save OSDN git/hg/svn repos due to the domains having expired certs
04:49:44<pabs>posted about that to #swh (Libera) and #codearchiver
04:49:53<pabs>and escalated within SWH
04:50:36<fireonlive>🤞
04:52:42<nullpeta>srad.jp is running on a very old system (Perl-based?). I guess it is not strong enough to handle high load.;(
04:53:26<fireonlive>Server: Apache/1.3.42 (Debian) mod_gzip/1.3.26.1a mod_perl/1.31
04:53:38<fireonlive>that sounds quite old
04:54:02<fireonlive>X-Fry: It's all there, in the macaroni.
04:54:03<fireonlive>X-Powered-By: Slash 2.005001
04:54:03<fireonlive>??
04:55:01<@JAA>Whew, that's ancient, yeah. Apache 1.x was the old thing ~20 years ago.
04:55:41jasons quits [Ping timeout: 272 seconds]
04:55:59<fireonlive>oh wow yeah not even apache2
04:56:13<@JAA>1.3.42 is from early 2010.
04:56:24<@JAA>At least it's the last 1.x version, but yeah.
04:57:07<fireonlive>ah ye was just looking that up
04:57:19<fireonlive>surprised it lasted that long
04:57:58<fireonlive>mod_perl 1.0: Version 1.31 - May 11, 2009 (also the current version of 1.0);(2.0: mod_perl 2.0: Version 2.0.13 - October 21, 2023)
05:03:01<@arkiver>proactive archiving is not always something we can do due to size.
05:03:22<fireonlive>arkiver: referring to furaffinity?
05:03:54<fireonlive>(also, hi :3)
05:05:11nic90701 quits [Ping timeout: 272 seconds]
05:05:12nic9070 (nic) joins
05:07:08<@JAA>More SAP fuckery: on some URLs, I sometimes get a 302 to the login page and sometimes a 200.
05:07:43<@JAA>Just happened on https://answers.sap.com/content/sapmetadata/24301/software-product-function.html which is the redirect target of https://answers.sap.com/questions/690021/index.html
05:08:52<@JAA>I wonder whether I should retry when I get a login redirect.
05:09:58<fireonlive>hmm, seems to be a blank post
05:10:06<fireonlive>but... probably :/
05:12:51BearFortress_ quits [Client Quit]
05:14:47<nullpeta>I found some OSDN related doc which maybe help archiveing. shujisado is former CEO of OSDN. https://gist.github.com/shujisado/2864e2475567fbbad8f8bacdb290d48a
05:15:46<fireonlive>has a comment from a minute ago too, very nice
05:21:09DogsRNice quits [Read error: Connection reset by peer]
05:24:53BearFortress joins
05:30:41<pabs>ah they have CVS too :(
05:33:32<@JAA>So the incomplete SAP responses are indeed truncated JSON. It simply 'forgets' to send the final }}}.
05:34:07<@JAA>But not with wrong Content-Length as in the other cases. This is chunked TE, and it sends the terminating zero-length chunk.
05:35:01<@JAA>I'd be surprised if I wasn't getting incomplete HTML as well.
05:35:25<@JAA>I'll retry if there is no </html> or if the JSON doesn't parse.
05:35:31<@JAA>THIS IS SUCH FUN!
05:36:11<fireonlive>(╯°□°)╯︵ ┻━┻
05:37:59<@arkiver>fireonlive: somewhat
05:44:55<fireonlive>ah ok :)
05:47:42fishingforsoup joins
05:49:16<h2ibot>Pokechu22 edited ISP Hosting (+447, LaCoocan): https://wiki.archiveteam.org/?diff=51540&oldid=51360
05:54:16<h2ibot>Pokechu22 edited Deathwatch (+454, /* 2024 */ domain@nifty, not sure what actions…): https://wiki.archiveteam.org/?diff=51541&oldid=51539
05:57:59<@JAA>lol, nearly every response I get is truncated...
05:59:03jasons (jasons) joins
05:59:47<fireonlive>x_x
06:00:11<fireonlive>𝓺𝓾𝓪𝓵𝓲𝓽𝔂
06:08:48<@JAA>I had a bug in the check, but I was indeed getting lots of truncated responses, especially on 404s.
06:20:31Arcorann (Arcorann) joins
06:26:43<@JAA>Yeah, I won't retry on 404s. It's just too ridiculous.
06:28:31<@JAA>80+% of 404s get truncated.
06:30:05<fireonlive>oof yeah..
06:36:49Island_ joins
06:37:49<@JAA>15% of requests generate a warning, almost all of them about extra data after the response.
06:38:56<@JAA>Around 2k warnings per minute now. I think I'm on panel 7 or 8 of <this_is_fine.png>.
06:39:15<nullpeta>The former CEO of OSDN confirmed that OSDN.net will also close at the end of January. https://nitter.net/shujisado/status/1749300822691958969
06:39:36<@JAA>If you don't want to math, I'm doing around 200 req/s.
06:40:07<@JAA>ETA if that holds up: 33 hours
06:40:11Island quits [Ping timeout: 272 seconds]
06:53:49<fireonlive>pabs: ^
06:54:45jasons quits [Ping timeout: 272 seconds]
06:56:01<pabs>thanks. all the git/hg repos are being saved by SWH, and #codearchiver after SWH is done
06:56:14<pabs>the svn and CVS repos I'm not sure about how to find them
06:56:29<pabs>and the site is in AB but times out a fucking lot
06:57:14<pabs>the osdn_mirror_contents_url.md gist is interesting, but it looks like none of the mirrors allow enumeration of projects
06:59:29<h2ibot>Switchnode edited Deathwatch (+67, /* 2024 */ cleanup): https://wiki.archiveteam.org/?diff=51542&oldid=51541
07:14:27Island_ quits [Read error: Connection reset by peer]
07:18:09<nullpeta>For srad.jp, https://srad.jp/journal/* and https://srad.jp/submission/* are also enumerable URIs. Could this be used for crawl seeds?
07:22:46<pabs>enumerable things can't be used as crawl seeds, you can only enumerate and save them
07:23:00<pabs>at least with archivebot right now
07:24:28<pabs>also, it looks like the main job is finding ~user/journal/1111 URLs, but not /journal/1111 URLs
07:24:31<pabs>nullpeta: ^
07:32:45<pabs>arkiver JAA - I think we should save all of https://dotsrc.dl.osdn.net/osdn/ (alias of http://mirrors.dotsrc.org/osdn/) because OSDN is going down and http://osdn.dl.osdn.net/ is not enumerable
07:33:15<nullpeta>pabs: Thanks. So that means some non-AB crawls are needed to save all the comments, etc.... Hmmm.
07:33:19<pabs>(the rest of the site is mirrors of other stuff)
07:33:52<pabs>nullpeta: no, the enumeration is saving all the comments, see 7veb8mluv16mtw6ezn2jqmfbv in AB
07:34:28<pabs>the issue is that the enumeration won't save anything else except the comments
07:34:45<pabs>the main job is saving everything found from the front page
07:34:55<pabs>that is 6p7aqfevk41es3iuyywvw68a7
07:35:04<pabs>in http://archivebot.com/
07:36:23<pabs>hmm, maybe I could use https://dotsrc.dl.osdn.net/osdn/ to enumerate and then translate http://osdn.dl.osdn.net/ URLs
07:38:01BlueMaxima quits [Read error: Connection reset by peer]
07:40:38<nullpeta>pabs: I see, thanks for the explanation.
07:49:47<@JAA>SAP slowed down massively a bit ago, and now I got banned. We'll see how long it lasts. The 403s arrived *very* fast though. I peaked at almost 1k req/s. lol
07:51:03qwertyasdfuiopghjkl quits [Remote host closed the connection]
07:57:36jasons (jasons) joins
08:10:38<nullpeta>https://srad.jp/story/24/ This page has links to each months page for 2024, its child pages have links to each days page for each month, and each days page has all the stories submitted for that day. Each year's page also has a link to the previous year's page. Hope this helps crawl all the stories.
08:42:02nullpeta leaves
08:42:13nullpeta joins
09:26:39<c3manu>nullpeta: the www.goodsmile.info server responds fairly slowly, so AB might not be able to grab everything in time. but i see for now the website (and its suspension announcement) are still up, so fingers crossed they don't just relaunch it immediately
09:49:42qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
10:00:02Bleo18260 quits [Client Quit]
10:01:23Bleo18260 joins
10:22:06tzt (tzt) joins
10:43:23jasons quits [Ping timeout: 272 seconds]
11:03:26decky joins
11:03:28nullpeta quits [Remote host closed the connection]
11:03:29qwertyasdfuiopghjkl quits [Remote host closed the connection]
11:04:48katia quits [Remote host closed the connection]
11:06:11decky_e quits [Ping timeout: 272 seconds]
11:06:16katia (katia) joins
11:06:52katia quits [Remote host closed the connection]
11:07:41katia (katia) joins
11:08:36katia quits [Remote host closed the connection]
11:10:59katia (katia) joins
11:46:02jasons (jasons) joins
11:52:09wessel15127 joins
11:53:37panicman45 joins
11:53:41wessel1512 quits [Ping timeout: 272 seconds]
11:53:41wessel15127 is now known as wessel1512
11:54:22panicman45 quits [Remote host closed the connection]
11:56:29nullpeta joins
11:58:50tertu quits [Ping timeout: 240 seconds]
12:02:26<nullpeta>c3manu: Thank you for trying to archive goodsmile.info.According to the official website, the update has been postponed. As of now, the re-launch date has not been disclosed.
12:12:42<ScenarioPlanet>Maybe Spore backups should get their own collection to be mentioned on https://wiki.archiveteam.org/index.php/Spore ?
12:18:34lizardexile joins
12:44:23nullpeta quits [Remote host closed the connection]
12:45:37Arcorann quits [Ping timeout: 272 seconds]
12:46:15jasons quits [Ping timeout: 272 seconds]
13:05:26<Pedrosso>pabs: Did you convert the other lists into the static subdomain too?
13:07:20<Pedrosso>ScenarioPlanet: I think you're right. The wiki should note the AB job on spore.com as well as these new lists
13:08:17razul quits [Quit: Bye -]
13:09:19<pabs>Pedrosso: I only saw one spore job running in the AB dashboard
13:09:59<pabs>so I only replaced that one
13:10:14razul joins
13:11:15<Pedrosso>Alright, I'll update the other 5 lists (2 more png lists, 3 xml lists) to be static.spore.com
13:12:13<pabs>thanks
13:12:15<Pedrosso>The wiki should also mention said subdomain that I failed to recgonize
13:14:42<ScenarioPlanet>It does
13:15:06<ScenarioPlanet>> Should be requested on static subdomain which uses CDN.
13:16:53<Pedrosso>If you're quoting from discord, I didn't notice that
13:19:17<Pedrosso>I overlooked it due to having /static/ in the url
13:20:36<Pedrosso>Thank you for the correction, this'll be much faster :)
13:22:17<Pedrosso>ooh
13:22:30<Pedrosso>You're quoting from the wiki nvm, they did say it on the discord server as well though
13:22:48<ScenarioPlanet>That's on https://wiki.archiveteam.org/index.php/Spore#:~:text=Should%20be%20requested%20on%20static%20subdomain%20which%20uses%20CDN
13:22:54<Pedrosso>What's that about postcards?
13:23:22<ScenarioPlanet>Postcards are exclusion here
13:23:40<ScenarioPlanet>Their web pages (/view/postcard/) use www.
13:23:45<Pedrosso>I see
13:24:07<Pedrosso>Spore randomly throws .jpgs into the mix
13:24:51<Pedrosso>> Note that all the non-ASCII symbols will be replaced with ? in the response.
13:24:51<Pedrosso>Does this apply when viewing the item as well?
13:28:09<ScenarioPlanet>Not viewing, only REST requests
13:29:12<Pedrosso>Well, that's a shame.
13:30:05<Pedrosso>So, what of the REST service? If we'd be able to grab creation data (like creator, subscribers, comments) via those we'd be able to find all users and grab even more from there; notably sporecasts.
13:30:05<Pedrosso>How would REST deal with an AB job?
13:30:40<ScenarioPlanet>Mostly fine, if that's 1000ms
13:30:52<Pedrosso>shrug better than nothing
13:31:04<ScenarioPlanet>So we are not doing DDoS
13:31:39<Pedrosso>Wouldn't DDoS require c>1 ?
13:32:06<ScenarioPlanet>Yes, that's why it should be 1 or 2-3 at most
13:32:10<Pedrosso>I see
13:32:36<Pedrosso>DWR Interface, has anything more been done/researched regarding that since the wiki was last updated?
13:32:52<ScenarioPlanet>Also, we must not to use REST as the only option to preserve creation metadata
13:33:02<ScenarioPlanet>ATOM and especially DWR too
13:34:08<Pedrosso>hm?
13:34:09<ScenarioPlanet>I could ask Kade to join Archiveteam's IRC. They know a lot about the DWR stuff.
13:34:22<Pedrosso>Yeah. But what did you mean just now?
13:35:11<ScenarioPlanet>We need to save ATOM responses too (see wiki page, "ATOM Feeds")
13:35:31<Pedrosso>So what you meant is that we should not solely save one? I agree
13:35:37<ScenarioPlanet>Sure
13:36:09<Pedrosso>Does DWR have something special, other than unicode?
13:36:25<ScenarioPlanet>Yes, POST requests
13:37:06<Pedrosso>I mean information-wise
13:37:17<Pedrosso>Something the others lack information-wise
13:38:11<ScenarioPlanet>It also holds adventure leaderboards, captain stats data and more
13:38:39<Pedrosso>Oooh, yeah that's good
13:39:42<ScenarioPlanet>Basically everything you can see on http://www.spore.com/sporepedia and can't find in any of the mentioned endpoints responses
13:49:36jasons (jasons) joins
13:54:20godane1 quits [Ping timeout: 240 seconds]
14:03:49<Pedrosso>I'd say yeah, ask Kade to join
14:06:58<ScenarioPlanet>Done
14:23:00onetruth quits [Read error: Connection reset by peer]
14:34:09godane (godane) joins
14:50:23jasons quits [Ping timeout: 272 seconds]
14:52:41TempleOfGoo joins
14:54:16TempleOfGoo88 joins
14:57:57TempleOfGoo quits [Ping timeout: 265 seconds]
14:58:24<TempleOfGoo88>Hello. I've been getting an error several times when trying to upload something. Can anyone help?
15:04:17AlsoHP_Archivist joins
15:04:57HP_Archivist quits [Ping timeout: 272 seconds]
15:09:19<TempleOfGoo88>This is error I'm getting:
15:09:22<TempleOfGoo88><?xml version='1.0' encoding='UTF-8'?><Error><Code>SlowDown</Code><Message>Please reduce your request rate.</Message><Resource>Your upload of goo-goo-dolls-all-that-you-are-live-at-the-late-late-show-19.12.2011 from username templeofgoo@gmail.com appears to be spam. If you believe this is a mistake, contact info@archive.org and include this entire
15:09:23<TempleOfGoo88>message in your email.</Resource><RequestId>6ee4aeaf-231a-4a93-bbb1-ace0055e9b4a</RequestId></Error>
15:10:01AlsoHP_Archivist quits [Ping timeout: 272 seconds]
15:10:12TempleOfGoo joins
15:10:45HP_Archivist (HP_Archivist) joins
15:11:55Wohlstand (Wohlstand) joins
15:14:23TempleOfGoo88 quits [Ping timeout: 265 seconds]
15:19:40AlsoHP_Archivist joins
15:21:25HP_Archivist quits [Ping timeout: 272 seconds]
15:22:13monoxane9 (monoxane) joins
15:23:19monoxane quits [Ping timeout: 272 seconds]
15:23:19monoxane9 is now known as monoxane
15:23:35TempleOfGoo is now known as TempleOfGoo88
15:23:51AlsoHP_Archivist quits [Remote host closed the connection]
15:24:17AlsoHP_Archivist joins
15:24:52<c3manu>TempleOfGoo88: i've had good results with following the instructions in the error message.
15:25:25<TempleOfGoo88>Well, the error message just says to send an email
15:25:59<TempleOfGoo88>Which I did. I'll just wait and see
15:27:36<c3manu>you shouldn't have to wait longer than 24h in my experience.
15:27:50<TempleOfGoo88>(y)
15:28:22<c3manu>"archiveteam" and "archive.org" are separate entities, so bothering us wouldn't have helped either ;)
15:28:54<TempleOfGoo88>Oh ok. thanks
15:29:04<c3manu>np :)
15:29:59TempleOfGoo88 quits [Remote host closed the connection]
15:30:17pedantic-darwin quits [Ping timeout: 272 seconds]
15:30:25AlsoHP_Archivist quits [Read error: Connection reset by peer]
15:30:40HP_Archivist (HP_Archivist) joins
15:31:27pedantic-darwin joins
15:53:42jasons (jasons) joins
15:56:15yawkat quits [Ping timeout: 272 seconds]
16:00:03HP_Archivist quits [Ping timeout: 272 seconds]
16:00:53HP_Archivist (HP_Archivist) joins
16:06:30tertu (tertu) joins
16:24:36<fireonlive>"Terraform Labs files for bankruptcy: Terraform Labs, the company behind the Terra blockchain, has filed for bankruptcy. Its flagship product, the Terra stablecoin and associated LUNA token, failed spectacularly in May 2022." https://web3isgoinggreat.com/single/terraform-labs-files-for-bankruptcy
16:24:42<fireonlive>(via #web3)
16:26:51<nulldata>Was just about to ask if someone could throw https://www.terra.money/ and https://medium.com/terra-money/ into AB :P
16:26:52pedantic-darwin1 joins
16:27:15qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
16:27:52<fireonlive>:P
16:29:04<fireonlive>there's also "CFTC files complaint against Debiex platform for using "romance scam tactics" to steal $2.3 million https://web3isgoinggreat.com/single/debiex-cftc-complaint" - which lists their domains; but I can only find https://www.debiex.com/wap/ that still works; everything else they seem to have 404'd
16:30:20pedantic-darwin quits [Ping timeout: 240 seconds]
16:30:20pedantic-darwin1 is now known as pedantic-darwin
16:30:28kuro68k joins
16:32:24<kuro68k>Hi guys. I came here because I heard about srag.jp going offline at the end of the month. It's not listed in Warrior but I'd like to help archive it if I can. Is it possible to contribute to that job, ideally via Warrior?
16:33:12<nulldata>kuro68k - you mean srad.jp ?
16:33:41pedantic-darwin2 joins
16:34:46<nulldata>So far we're able to grab it with #ArchiveBot , so it's not a Warrior project
16:35:07<nulldata>You can check the progress on http://archivebot.com/?showNicks=1
16:37:20pedantic-darwin quits [Ping timeout: 240 seconds]
16:37:20pedantic-darwin2 is now known as pedantic-darwin
16:37:23<fireonlive>threw in terra's two urls there (medium i had to tack on `archive` on the end)
16:45:47razul quits [Client Quit]
16:46:44razul joins
16:57:21<@JAA>Still banned from SAP
16:57:26<@JAA>All this effort for that...
16:59:33<kuro68k>Yes, srad.org, sorry typo
16:59:46<kuro68k>So does ArchiveBot mean you don't need any help?
16:59:49<fireonlive>:|
17:00:49<kuro68k>I worry about slashdot.org too. It's a hard site to archive properly, in a way that preserves all the links and conversations. I tried archiving it once, the results were not stellar.
17:01:04razul quits [Client Quit]
17:02:13razul joins
17:06:56yonerboner joins
17:09:05Wohlstand quits [Ping timeout: 272 seconds]
17:15:57kuro68k quits [Remote host closed the connection]
17:26:33<@arkiver>JAA: do we need anything Warrior for OSDN?
17:28:03<@arkiver>pabs: i put https://dotsrc.dl.osdn.net/osdn/ in AB
17:28:14<@arkiver>looking further into it as well
17:32:07lizardexile quits [Client Quit]
17:35:23<@JAA>arkiver: Most of the discussion has been happening in #codearchiver.
17:35:38<@arkiver>ah
17:41:37icedice quits [Client Quit]
17:42:48nicey666 joins
17:43:02nicey666 quits [Remote host closed the connection]
17:50:32Megame (Megame) joins
18:00:11rohvani joins
18:03:59<@JAA>I don't know what we can do about SAP Q&A.
18:04:05<@JAA>I suspect the ban is manual.
18:05:52<@JAA>arkiver: Do you want to try a DPoS? It'd have to start *very* soon. They intend to finish their migration (which I'm sure will lose data given it's SAP) on Wed.
18:06:58<@JAA>There is a lot of weirdness in this one. Incomplete responses, wrong Content-Length headers, etc. I intend to document it all on the wiki.
18:42:50jasons quits [Ping timeout: 240 seconds]
18:46:29<ScenarioPlanet>What should i do with these? https://storage.scenariopla.net/explorer_fRbGKbbgLh_2024-01-22_20.45.50.903-1705949150.png
18:46:43<ScenarioPlanet>(that's like 30% of everything)
18:57:40<@arkiver>ScenarioPlanet: what is this?
19:14:57Wohlstand (Wohlstand) joins
19:17:59Naruyoko5 quits [Client Quit]
19:18:58myself4 is now known as myself
19:30:55guest324 joins
19:31:20imer quits [Quit: Oh no]
19:31:45guest324 quits [Remote host closed the connection]
19:32:33imer (imer) joins
19:42:06imer quits [Client Quit]
19:45:16imer (imer) joins
19:46:38jasons (jasons) joins
19:48:31<ScenarioPlanet>arkiver: fullsized images lists (wordpress+drupal+google+wix)
19:51:51BearFortress quits [Ping timeout: 272 seconds]
20:00:46<fireonlive>so we don't just have the thumbnails from the AB jobs
20:00:48<fireonlive>AIUI
20:10:04beastbg8 joins
20:20:38Naruyoko joins
20:21:04Naruyoko5 joins
20:25:09<beastbg8>Hello. I would like to bring something to your attention. I hope I'm in the right place. In a month time as of today the largest local video portal in Bulgaria circa 2006, VBOX7 (very similar to the Hungarian "videa.hu") is about to "hide" all user-uploaded content, which according to their devs are over 14M videos. They already did that last
20:25:09<beastbg8>week, but opened the gates for final time after social media outrage. It contains a lot of, rare nowhere-to-be-found media, specifically concerning Bulgaria and quite a lot otherwise "lost" media (foreign movies, TV series) that survives only there, either with a dub or not. Is there something that can be done in a such a narrow time frame?
20:25:25Naruyoko quits [Ping timeout: 272 seconds]
20:27:19<beastbg8>Currently these videos can only be accessed with Bulgarian IP. Only "partnered" videos (only content that will remain on the site starting 22 February 2024) can be watched from abroad.
20:29:19<@JAA>These appear to be the relevant announcements: https://blog.vbox7.com/video-access/ https://blog.vbox7.com/blagodarim-vi/
20:29:59<@JAA>The georestriction is going to be a pain.
20:35:20BlueMaxima joins
20:36:04<h2ibot>JustAnotherArchivist edited Deathwatch (+428, /* 2024 */ Add Vbox7): https://wiki.archiveteam.org/?diff=51543&oldid=51542
20:43:38<nulldata>Example of a working video in the US: https://www.vbox7.com/play:2e08276728 -> (player loads this which links to a mpd file, the URLs of which seem to be static - or at least didn't change when accessing via a different connection and computer) https://www.vbox7.com/aj/player/item/options?vid=2e08276728
20:45:24<nulldata>And here's one that's not allowed in US. https://www.vbox7.com/play:b312ac1a7a -> https://www.vbox7.com/aj/player/item/options?vid=b312ac1a7a
20:47:20jasons quits [Ping timeout: 240 seconds]
20:55:35<beastbg8>Currently the yt-dlp's extractor for VBOX7 is broken (only extracts mpds from some videos) but a friend fixed it awhile ago. Providing it with their consent. Please take a look. https://pastebin.com/raw/300v4NwC (vbox7.py)
20:55:51<nulldata>No geo restriction on the video server itself it seems - if you have a Bulgaria connection to grab the URL from the API response it'll download no problem on a US connection
20:56:03<beastbg8>yep
20:56:40<nulldata>The API gives a mpd file, but you can change the extension to m3u8 and play with VLC. https://edge211.vbox7.com/sl/1iswCSTN-zXpz-8oOGnlrg/1706133600/b3/b312ac1a7a/b312ac1a7a.m3u8
20:58:36Barto quits [Client Quit]
20:58:44<nicolas17>oh joy they're using byte ranges
20:58:54<nicolas17>so it's not 3 million tiny files with 5 seconds of video each
20:59:16Barto (Barto) joins
21:00:54<@JAA>And DASH and HLS both reference the same MP4 file as well!
21:02:36<nulldata>Oh yeah I should've looked harder - the mpd file specifies the mp4 files in the BaseURL node. https://edge211.vbox7.com/sl/1iswCSTN-zXpz-8oOGnlrg/1706133600/b3/b312ac1a7a/b312ac1a7a_480_track1_dash.mp4
21:07:03<nulldata>Question becomes - is there a nice way to enumerate all valid video ids lol
21:09:33<@JAA>424k requests per second could bruteforce the trillion possible [0-9a-f]{10} IDs in 30 days. That's not going to happen but less unreasonable than I expected.
21:10:46<nicolas17>and how do you go from video ID to mpd URL?
21:11:14griz joins
21:13:58<nicolas17>hmm weird
21:14:08<nicolas17>you posted an mpd/m3u8 for b312ac1a7a
21:14:23<nicolas17>but https://www.vbox7.com/ajax/video/nextvideo.php?vid=b312ac1a7a returns a direct .mp4 instead of an mpd manifest
21:14:53<griz>old ones are mostly mp4
21:15:15<griz>some are only flv
21:15:24<nulldata>nicolas17 - I think the only way would be finding someone with a legit connection, or VPN, dedicated to grabbing the URLs from the API to feed to a tracker. The grabbers could be on any connection.
21:16:03<nicolas17>but where did you even get that mpd if the API returns mp4?
21:16:11<nulldata>That's because of the geoblock - if you access via Bulgaria it returns the mpd link
21:16:12<h2ibot>Pokechu22 edited List of website hosts (+471, XFree / Thin Cloud for Free): https://wiki.archiveteam.org/?diff=51544&oldid=51508
21:16:23<nicolas17>oh
21:16:36<nicolas17>I didn't realize that .mp4 was the "not available" placeholder >_>
21:22:01<griz>direct mp4 over https is fastest route.. then HLS & MPD
21:28:39<@JAA>I mean, we don't need to refetch the video for HLS and MPD since it's all the same MP4 file behind it.
21:32:09Island joins
21:37:45sec^nd quits [Ping timeout: 255 seconds]
21:40:46sec^nd (second) joins
21:42:11sec^nd quits [Remote host closed the connection]
21:42:44sec^nd (second) joins
21:51:15jasons (jasons) joins
21:54:37Island quits [Read error: Connection reset by peer]
21:56:09Island joins
22:00:44ThetaDev quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
22:00:52<ThreeHM>I'm able to bypass the georestriction by adding an X-Forwarded-For header with a Bulgarian IP to the API request
22:00:56ThetaDev joins
22:01:07<project10>lol
22:02:00<fireonlive>😏
22:03:29<ThreeHM>Even works if I use the IP that's behind vbox7.com
22:03:31<ThreeHM>lol
22:04:01<Barto>someone screwed their reverse proxy config
22:04:51nulldata quits [Ping timeout: 272 seconds]
22:05:43<fireonlive>excellent
22:06:10<griz>X-Forwarded-For is also used in the yt-dlp script, that's how it works
22:07:34<Barto>:-) I wonder if the more 'recent' Forwarded header works
22:10:21<beastbg8>made an article on the wiki
22:10:22<kiska>When warrior go brrr? :D
22:10:34<kiska>Seems like that'll be patched out soon
22:11:17nulldata (nulldata) joins
22:13:10<beastbg8>i doubt they're checking their code base too myopically
22:14:15<beastbg8>there's a large hole in their subtitle writing functionality, where videos can be obtained even if hidden, but it needs a user account
22:14:34<@JAA>Nice
22:14:39<@JAA>So how do we find video IDs?
22:14:48<kiska>Bruteforce?
22:14:52<beastbg8>https://www.vbox7.com/subrec:db089cbeef
22:14:58<@JAA>There's a trillion of them, kiska.
22:15:09<@JAA>16^10
22:15:17<kiska>Seems more doable than something like youtube :D
22:15:34<griz>the android app is interesting too, mobile api is very extensive
22:15:44<kiska>I suppose we could do discovery if they have "recommended" on the side
22:15:46<beastbg8> <div class="container">
22:16:00<@JAA>We're not going to do 425k req/s for a month though. lol
22:16:32<@JAA>But yeah, as I wrote above, I would've expected it to be worse.
22:16:36<kiska>I'd doubt they let us do that for a month :D
22:17:24<beastbg8>one way is through obtaining all user accounts through search keywords and scraping their channels
22:18:07<beastbg8>but that's kinda impractical
22:18:19<ThreeHM>They have recommendations, but we'd have to check if that gives you geoblocked videos
22:19:55<griz>top 100 videos of all time (v3 & v4 api only): https://api.vbox7.com/v4/?action=r_video_top100&app_token=vbox7_android_1.0.0_gT13mY
22:19:59<kiska>Do they have a sitemap :D
22:20:22<griz>https://api.vbox7.com/v5/?action=r_channels&app_token=vbox7_android_1.0.0_gT13mY
22:20:35<griz>random links from mobile app
22:20:40<beastbg8>also the subrec method (which will likely stay post feb 22) does not support .flv videos
22:21:00<beastbg8>it can be used as a last resort
22:21:14<beastbg8>btw hidden videos still retain their metadata
22:21:23<beastbg8>assuming you have the URL
22:23:53<beastbg8>https://www.vbox7.com/play:4a81cafe81 here is one such video. no playback endpoints but title, thumbnail, comments etc are all there
22:24:29<beastbg8>all videos were like this yesterday
22:24:39<nicolas17>crawl recommended videos recursively
22:26:20<nstrom|m>user pages also seem to have all the videos a user posted, so that probably helps w discovery a bit too. eg. https://www.vbox7.com/user:wochit_news has 626 pages of videos
22:26:39<nstrom|m>search by tag potentially useful https://www.vbox7.com/tag:video?ord=date&period=day&order=rate&page=20
22:26:59<griz>mobile api example: https://api.vbox7.com/v5/?action=r_user_videos&username=wochit_news&app_token=vbox7_android_1.0.0_gT13mY
22:27:17<griz>though api bugs after about 1000 results
22:31:44<griz>mobile api search video by query example: https://api.vbox7.com/v4/?action=r_video_search&query=preslava&page=1&app_token=vbox7_android_1.0.0_gT13mY
22:50:20jasons quits [Ping timeout: 240 seconds]
22:50:31griz leaves
22:54:33<beastbg8>if videos in subrec mode give 404 error, renaming the extension from .mp4 (they're always passed as .mp4) to .flv might work, though not always
23:38:20pabs quits [Ping timeout: 240 seconds]
23:39:32pabs (pabs) joins
23:40:40missaustraliana joins
23:43:41<h2ibot>Missaustraliana edited Deathwatch (+4, Add 'was ' to Studio 10 as the time has passed.): https://wiki.archiveteam.org/?diff=51545&oldid=51543
23:43:57<missaustraliana>yuh\
23:50:37godane quits [Ping timeout: 272 seconds]
23:53:47jasons (jasons) joins
23:59:17nautical_jesus joins