00:08:09 | | nicolas17 quits [Ping timeout: 272 seconds] |
00:12:38 | | nicolas17 joins |
00:19:28 | | magmaus3 (magmaus3) joins |
00:52:29 | | jasons quits [Ping timeout: 272 seconds] |
01:13:01 | | Naruyoko5 quits [Remote host closed the connection] |
01:13:20 | | Naruyoko5 joins |
01:41:15 | | Barto quits [Ping timeout: 272 seconds] |
01:47:35 | | Barto (Barto) joins |
01:55:33 | | jasons (jasons) joins |
02:35:43 | | Megame quits [Client Quit] |
02:55:50 | | jasons quits [Ping timeout: 240 seconds] |
03:00:50 | | pseudorizer quits [Client Quit] |
03:01:39 | | pseudorizer (pseudorizer) joins |
03:22:53 | | pabs quits [Read error: Connection reset by peer] |
03:23:42 | | pabs (pabs) joins |
03:30:53 | <@JAA> | I've been battling with SAP's Q&A site that'll be taken down very soon. Their server is very broken. I kind of expected that, given it's SAP, but it's impressive just how bad it is. |
03:31:34 | <@JAA> | We got decent coverage from ArchiveBot already, but that can't have gotten everything. Among the things missed will be pagination of answers and attachments in answers or comments, I believe. |
03:31:49 | <@JAA> | https://answers.sap.com/ is the site. |
03:32:16 | <@JAA> | The good news is there don't seem to be any significant rate limits. |
03:34:14 | <pabs> | were they missed due to JS or? |
03:34:46 | <@JAA> | Yeah |
03:35:25 | <@JAA> | XHR that returns HTML in JSON in JSON for some data, and so on. |
03:35:39 | <nicolas17> | yo. dawg. |
03:35:40 | <@JAA> | Real pleasure to work with. |
03:35:42 | <pabs> | fuuuugly |
03:36:46 | <pabs> | hmm, image attachments look enumerable ? https://answers.sap.com/storage/attachments/2234744-image.png |
03:36:56 | <@JAA> | The server also returns truncated responses and extra data after completed responses. |
03:37:05 | <nicolas17> | are they all called image.png? |
03:37:08 | <@JAA> | They are not. |
03:37:44 | <pabs> | the one I linked is from here btw https://answers.sap.com/questions/14034200/how-can-i-filter-a-listpicker-to-display-only-dist.html |
03:37:54 | <@JAA> | Earlier, I saw 404s on URLs that had been returning 302s just before. Shortly after writing about it in #archivebot, it started returning 302s again. |
03:38:25 | <@JAA> | I'll try my best, but this is a shitshow. |
03:39:05 | <@JAA> | Oh yeah, I'm seeing that 404 thing again right now. |
03:39:13 | <@JAA> | I'll probably get a lot of false 404s. |
03:39:43 | <@JAA> | It seems to only affect IDs that aren't questions anyway, so maybe that's 'fine', but yeah. |
03:40:15 | <pabs> | hmm, on that page above, the attachments are just hrefs |
03:40:34 | <pabs> | no JS needed... |
03:40:51 | <nicolas17> | I loaded that answer with NoScript enabled |
03:40:56 | <nicolas17> | I could see the inline image and the link |
03:41:07 | <@JAA> | Only for attachments in the question, not in answers. |
03:41:12 | <pabs> | ah |
03:41:33 | <@JAA> | Nor in comments on answers. |
03:41:46 | <@JAA> | The comments are where that JSON in JSON happens. |
03:42:58 | <pabs> | the answer attachments are in JSON in the HTML it seems |
03:43:33 | <pabs> | curl -s https://answers.sap.com/questions/14031315/dunning-notice-on-small-balance.html | grep /attachments/ |
03:44:10 | <@JAA> | Yeah right, there's another layer of HTML around it for the first page of answers, too. |
03:44:22 | <@JAA> | So HTML in JSON in JSON in HTML :-) |
03:44:25 | <nicolas17> | /o\ |
03:44:41 | <pabs> | | grep -oE '/storage/[^\"]+' |
03:44:59 | <@JAA> | Anyway, I've got all these things figured out already, the remaining problem is their server not cooperating at the HTTP level. |
03:45:02 | <pabs> | can just grep the WARC of the answer HTML :) |
03:45:19 | <pabs> | ah, whats the issue there? |
03:45:20 | <@JAA> | I also liked this one: |
03:45:22 | <@JAA> | > const simplifiedQuestionView = JSON.parse("true"); |
03:45:26 | <nicolas17> | <JAA> Earlier, I saw 404s on URLs that had been returning 302s just before. Shortly after writing about it in #archivebot, it started returning 302s again. |
03:45:51 | <@JAA> | No, the truncated responses and extra data after responses. The latter is probably a wrong Content-Length header. |
03:46:10 | <pabs> | how are you detecting that? |
03:46:22 | <@JAA> | By getting JSON and HTTP parser errors. |
03:47:04 | <pabs> | oh geez... |
03:47:32 | <fireonlive> | <nicolas17> are they all called image.png? < that's my power move |
03:47:36 | <pabs> | does retrying the failing ones help? |
03:47:43 | <@JAA> | As I wrote in #archivebot earlier, it takes some effort to fuck up this badly. |
03:47:59 | <@JAA> | > Request for https://answers.sap.com/users/login.html failed 721 times |
03:48:00 | <@JAA> | :-) |
03:48:06 | <nicolas17> | to err is human |
03:48:11 | <nicolas17> | to really fuck things up you need a computer |
03:48:12 | <pabs> | yeah, just saw that... |
03:48:19 | <pabs> | we don't need the login page though :) |
03:48:44 | <@JAA> | Yeah, that's a bug in my code. Already fixed that. |
03:49:10 | <fireonlive> | lord lol |
03:49:12 | <@JAA> | The failures are responses with (presumably) extra data at the end. |
03:49:35 | <@JAA> | > * Excess found in a non pipelined read: excess = 97, size = 3632, maxdownload = 3632, bytecount = 0 |
03:49:40 | <@JAA> | That's what curl emits. |
03:49:53 | <@JAA> | And the response ends at a random point in the middle of the HTML. |
03:50:02 | <@JAA> | That's why I suspect they're sending the wrong Content-Length. |
03:50:08 | <pabs> | is it one particular IP address that is bad or all three of them? |
03:50:24 | | fireonlive sits back and wonders how |
03:51:03 | <@JAA> | I think all are affected, but I'm not currently logging that information on the errors, so I can't confirm. |
03:51:19 | <@JAA> | Highly unlikely it didn't hit all IPs on 721 attempts though. |
03:51:53 | <pabs> | and does --http1.1 help? |
03:52:10 | <pabs> | nope, it doesn't |
03:52:21 | <@JAA> | WARC can only store HTTP/1.1 anyway. |
03:52:46 | <pabs> | ah |
03:53:17 | <fireonlive> | modern web-- |
03:53:17 | <eggdrop> | [karma] 'modern web' now has -1 karma! |
03:53:22 | <@JAA> | Oh yeah, there's a PROTOCOL_ERROR on HTTP/2 as well, right. :-) |
03:53:32 | <pabs> | s/-1/-Inf/ |
03:53:38 | <@JAA> | SAP-- |
03:53:39 | <eggdrop> | [karma] 'SAP' now has -1 karma! |
03:54:26 | <fireonlive> | SAP-- |
03:54:26 | <eggdrop> | [karma] 'SAP' now has -2 karma! |
03:54:29 | <pabs> | plain http redirects to TLS too, so no way to check that |
03:55:13 | <pabs> | love it how even when you get the right Content-Length, the HTML is still broken, no </body></html> |
03:55:38 | <pabs> | doesn't even close the <form> tag |
03:56:20 | <fireonlive> | full, solid, enterprise software |
03:56:33 | <@JAA> | Confirmed I see the excess data error on all three IPs. |
03:57:57 | <@JAA> | One item was looping on the login retries, had to just ^C it after almost 10k attempts. |
03:58:30 | <@JAA> | Yep, 10k retries, all with that excess data stuff, apparently. |
03:58:31 | <@JAA> | > ClientResponseError("400, message='invalid constant string'",) |
03:58:35 | <@JAA> | That's how it shows up on qwarc. |
03:58:35 | <fireonlive> | wow |
03:58:49 | <pabs> | how does SPN do? |
03:59:13 | <pabs> | its interesting, here I don't get the extra data every time, but have once or twice |
03:59:13 | <@JAA> | It might be that the excess data is always there and it's a matter of timing whether your HTTP client considers the transaction done or actually emits an error. |
03:59:23 | <@JAA> | Or how are you testing exactly? |
03:59:26 | | jasons (jasons) joins |
03:59:34 | <pabs> | curl -v ... |
03:59:51 | <@JAA> | Yeah, same |
04:00:00 | <pabs> | I almost always see this tho: * Connection #0 to host answers.sap.com left intact |
04:01:06 | <pabs> | so I guess that means the server is always not closing the connection, and you're right with the timing comment |
04:01:07 | <nicolas17> | I think that means "not closing the connection since we may reuse it in the next request" |
04:01:13 | <pabs> | ah |
04:01:31 | <nicolas17> | * Connection #0 to host google.com left intact |
04:01:59 | <fireonlive> | <resists joke> |
04:02:27 | <nicolas17> | if you do two URLs (same hostname) in the same curl command: |
04:02:35 | <nicolas17> | * Connection #0 to host google.com left intact |
04:02:37 | <nicolas17> | * Re-using existing connection! (#0) with host google.com |
04:03:37 | <@JAA> | Oh |
04:03:39 | <@JAA> | lol |
04:04:03 | <@JAA> | It seems to break at a hidden <input> containing a CSRF token! |
04:04:13 | <fireonlive> | ooh! |
04:04:33 | <fireonlive> | is there a token? or is it trying to make one and just catastrophically failing |
04:04:33 | <@JAA> | Usually, the response ends with the submit button. |
04:04:40 | <@JAA> | Immediately after that button would be the hidden token. |
04:04:55 | <pabs> | so the token is the extra data? |
04:04:56 | <fireonlive> | ah! |
04:04:56 | <@JAA> | But I only get that (and the closing </form></body></html>) sometimes. |
04:05:33 | <nicolas17> | JAA: is this on truncated responses or on extra-data responses? |
04:05:36 | <pabs> | so something is choking depending on what token gets generated? |
04:06:06 | <pabs> | is it only the login page that gets this extra data? |
04:06:08 | <nicolas17> | server asking the overkill HSM to encrypt and sign the CSRF token and failing (?) |
04:06:17 | <@JAA> | Example of a complete response: https://transfer.archivete.am/inline/rXyZ0/openssl-s_client-answers.sap.com-login-page |
04:06:30 | <@JAA> | `printf '%s\r\n' 'GET /users/login.html HTTP/1.1' 'User-Agent: curl/7.38.0' 'Host: answers.sap.com' 'Accept: */*' '' | openssl s_client -connect answers.sap.com:443 -ign_eof` |
04:07:33 | <@JAA> | There's an extra 96 bytes at the end, which is exactly that CSRF input tag plus the closing tags. |
04:07:44 | <@JAA> | Those are not accounted for in the Content-Length. |
04:07:58 | <nicolas17> | tried twice, once the response ended at </html>, next time the response ended at value="Submit" /> |
04:08:09 | <nicolas17> | I didn't count bytes to see what the content-length included :P |
04:08:37 | <pabs> | does --ignore-content-length help? :) |
04:08:58 | <@JAA> | pabs: Not only the login page, I saw the same error elsewhere as well, all on 404 pages. |
04:09:12 | <@JAA> | Maybe it generates a token there as well for a 'report an error' thing. |
04:09:40 | <@JAA> | nicolas17: Extra data, I haven't looked at the truncation yet. |
04:10:11 | <@JAA> | The extra data makes qwarc very sad. The truncation just causes a small number of items to crash. |
04:10:12 | <pabs> | with --ignore-content-length I still see the truncation sometimes |
04:10:32 | <@JAA> | Yeah, try the OpenSSL command instead, it'll stall after the submit button randomly. |
04:10:37 | <pabs> | but --ignore-content-length fixes lack of body |
04:11:07 | <pabs> | some broken ass hit |
04:11:13 | <@JAA> | I can't ignore the Content-Length in qwarc anyway; that happens deep in a C library for HTTP parsing. |
04:11:25 | <pabs> | qwarc doesn't use curl? |
04:11:35 | <@JAA> | No |
04:11:38 | <@JAA> | aiohttp |
04:11:48 | | nullpeta joins |
04:12:28 | <@JAA> | I'm not aware of anyone having implemented WARC into curl. |
04:12:32 | <fireonlive> | nulldata you have competition |
04:13:36 | <@JAA> | Anyway, taking a bit of a break, then trying to fix the remaining issues and getting it properly started. |
04:13:42 | <nulldata> | :O |
04:14:21 | <@JAA> | First very rough size estimate is ~300 GiB. |
04:15:03 | <nullpeta> | Hi. Japanese edition of slashdot (https://srad.jp) will be shutdown at 2024/01/31. Can anyone help archive? |
04:16:07 | | pabs looks |
04:16:18 | <nullpeta> | shutdown notice: https://srad.jp/story/24/01/22/0311225/ |
04:16:28 | <pabs> | lots of subdomains... |
04:17:18 | <pabs> | looks like the article subdomains are just categories, so subdomain stories are on the main domain too |
04:18:31 | <pabs> | comments are very JS-y like slashdot |
04:18:38 | <nullpeta> | According to the closure notice, OSDN.net (japanese github like sites) may also be closed. |
04:18:54 | <pabs> | fuck |
04:19:32 | <fireonlive> | oh no :( |
04:21:12 | | Island quits [Read error: Connection reset by peer] |
04:21:54 | <pabs> | nullpeta: started a job for srad.jp, see archivebot.com to watch it run |
04:22:24 | <pabs> | osdn.net was ultra-broken a while back, wonder how it is now |
04:23:18 | <nullpeta> | pabs: Thank you very much! |
04:23:45 | | Island joins |
04:23:53 | <pabs> | don't think I will do article subdomains, that will be duplication I think |
04:24:01 | <pabs> | not sure how to deal with comments either |
04:32:37 | <nullpeta> | pabs: Subdomains are just categories, so they should accessible from the main domain. If a story has over 50 comments, comments over 50 are loaded by JS later.;( |
04:34:32 | <nullpeta> | To get all comment, we need to click "すべてのコメントを取得" (Get all comments) button manually. |
04:34:39 | <fireonlive> | masterX244: do you know if there's a final/hard cutoff date set yet for discordcdn urls expiring/the signature&etc parameters being mandatory? |
04:38:56 | <pabs> | nullpeta: yeah, thats a POST request, which isn't archivable |
04:40:04 | <h2ibot> | Switchnode edited Deathwatch (+239, /* 2024 */ add srad.jp): https://wiki.archiveteam.org/?diff=51539&oldid=51537 |
04:40:41 | <pabs> | hmm, the site is timing out for me now :( |
04:40:46 | <pabs> | and also in AB |
04:40:55 | <pabs> | !d 6p7aqfevk41es3iuyywvw68a7 1800000 1800000 |
04:41:38 | <pabs> | nullpeta: re comments, they look enumerable https://srad.jp/comment/4597211 |
04:41:50 | <thuban> | pabs: wrong channel |
04:42:17 | <pabs> | yeah :) |
04:43:34 | <fireonlive> | site is down for me as well |
04:43:48 | <fireonlive> | oh there it goes |
04:43:50 | <fireonlive> | jus very slow |
04:44:20 | <nulldata> | Yeah occasional 500s |
04:44:25 | <fireonlive> | the quote at the bottom was "人生unstable -- あるハッカー" which google converts to "Life is unstable -- a hacker" |
04:44:26 | <fireonlive> | suiting :) |
04:46:39 | <nullpeta> | pabs: So comments are archivable via https://srad.jp/comment/* URI? Good! |
04:47:12 | <nulldata> | Looks like OSDN's magazine site has been broken at least since November of last year. https://osdn.net/mag/ |
04:49:15 | <pabs> | sadly SWH won't be able to save OSDN git/hg/svn repos due to the domains having expired certs |
04:49:44 | <pabs> | posted about that to #swh (Libera) and #codearchiver |
04:49:53 | <pabs> | and escalated within SWH |
04:50:36 | <fireonlive> | 🤞 |
04:52:42 | <nullpeta> | srad.jp is running on a very old system (Perl-based?). I guess it is not strong enough to handle high load.;( |
04:53:26 | <fireonlive> | Server: Apache/1.3.42 (Debian) mod_gzip/1.3.26.1a mod_perl/1.31 |
04:53:38 | <fireonlive> | that sounds quite old |
04:54:02 | <fireonlive> | X-Fry: It's all there, in the macaroni. |
04:54:03 | <fireonlive> | X-Powered-By: Slash 2.005001 |
04:54:03 | <fireonlive> | ?? |
04:55:01 | <@JAA> | Whew, that's ancient, yeah. Apache 1.x was the old thing ~20 years ago. |
04:55:41 | | jasons quits [Ping timeout: 272 seconds] |
04:55:59 | <fireonlive> | oh wow yeah not even apache2 |
04:56:13 | <@JAA> | 1.3.42 is from early 2010. |
04:56:24 | <@JAA> | At least it's the last 1.x version, but yeah. |
04:57:07 | <fireonlive> | ah ye was just looking that up |
04:57:19 | <fireonlive> | surprised it lasted that long |
04:57:58 | <fireonlive> | mod_perl 1.0: Version 1.31 - May 11, 2009 (also the current version of 1.0);(2.0: mod_perl 2.0: Version 2.0.13 - October 21, 2023) |
05:03:01 | <@arkiver> | proactive archiving is not always something we can do due to size. |
05:03:22 | <fireonlive> | arkiver: referring to furaffinity? |
05:03:54 | <fireonlive> | (also, hi :3) |
05:05:11 | | nic90701 quits [Ping timeout: 272 seconds] |
05:05:12 | | nic9070 (nic) joins |
05:07:08 | <@JAA> | More SAP fuckery: on some URLs, I sometimes get a 302 to the login page and sometimes a 200. |
05:07:43 | <@JAA> | Just happened on https://answers.sap.com/content/sapmetadata/24301/software-product-function.html which is the redirect target of https://answers.sap.com/questions/690021/index.html |
05:08:52 | <@JAA> | I wonder whether I should retry when I get a login redirect. |
05:09:58 | <fireonlive> | hmm, seems to be a blank post |
05:10:06 | <fireonlive> | but... probably :/ |
05:12:51 | | BearFortress_ quits [Client Quit] |
05:14:47 | <nullpeta> | I found some OSDN related doc which maybe help archiveing. shujisado is former CEO of OSDN. https://gist.github.com/shujisado/2864e2475567fbbad8f8bacdb290d48a |
05:15:46 | <fireonlive> | has a comment from a minute ago too, very nice |
05:21:09 | | DogsRNice quits [Read error: Connection reset by peer] |
05:24:53 | | BearFortress joins |
05:30:41 | <pabs> | ah they have CVS too :( |
05:33:32 | <@JAA> | So the incomplete SAP responses are indeed truncated JSON. It simply 'forgets' to send the final }}}. |
05:34:07 | <@JAA> | But not with wrong Content-Length as in the other cases. This is chunked TE, and it sends the terminating zero-length chunk. |
05:35:01 | <@JAA> | I'd be surprised if I wasn't getting incomplete HTML as well. |
05:35:25 | <@JAA> | I'll retry if there is no </html> or if the JSON doesn't parse. |
05:35:31 | <@JAA> | THIS IS SUCH FUN! |
05:36:11 | <fireonlive> | (╯°□°)╯︵ ┻━┻ |
05:37:59 | <@arkiver> | fireonlive: somewhat |
05:44:55 | <fireonlive> | ah ok :) |
05:47:42 | | fishingforsoup joins |
05:49:16 | <h2ibot> | Pokechu22 edited ISP Hosting (+447, LaCoocan): https://wiki.archiveteam.org/?diff=51540&oldid=51360 |
05:54:16 | <h2ibot> | Pokechu22 edited Deathwatch (+454, /* 2024 */ domain@nifty, not sure what actions…): https://wiki.archiveteam.org/?diff=51541&oldid=51539 |
05:57:59 | <@JAA> | lol, nearly every response I get is truncated... |
05:59:03 | | jasons (jasons) joins |
05:59:47 | <fireonlive> | x_x |
06:00:11 | <fireonlive> | 𝓺𝓾𝓪𝓵𝓲𝓽𝔂 |
06:08:48 | <@JAA> | I had a bug in the check, but I was indeed getting lots of truncated responses, especially on 404s. |
06:20:31 | | Arcorann (Arcorann) joins |
06:26:43 | <@JAA> | Yeah, I won't retry on 404s. It's just too ridiculous. |
06:28:31 | <@JAA> | 80+% of 404s get truncated. |
06:30:05 | <fireonlive> | oof yeah.. |
06:36:49 | | Island_ joins |
06:37:49 | <@JAA> | 15% of requests generate a warning, almost all of them about extra data after the response. |
06:38:56 | <@JAA> | Around 2k warnings per minute now. I think I'm on panel 7 or 8 of <this_is_fine.png>. |
06:39:15 | <nullpeta> | The former CEO of OSDN confirmed that OSDN.net will also close at the end of January. https://nitter.net/shujisado/status/1749300822691958969 |
06:39:36 | <@JAA> | If you don't want to math, I'm doing around 200 req/s. |
06:40:07 | <@JAA> | ETA if that holds up: 33 hours |
06:40:11 | | Island quits [Ping timeout: 272 seconds] |
06:53:49 | <fireonlive> | pabs: ^ |
06:54:45 | | jasons quits [Ping timeout: 272 seconds] |
06:56:01 | <pabs> | thanks. all the git/hg repos are being saved by SWH, and #codearchiver after SWH is done |
06:56:14 | <pabs> | the svn and CVS repos I'm not sure about how to find them |
06:56:29 | <pabs> | and the site is in AB but times out a fucking lot |
06:57:14 | <pabs> | the osdn_mirror_contents_url.md gist is interesting, but it looks like none of the mirrors allow enumeration of projects |
06:59:29 | <h2ibot> | Switchnode edited Deathwatch (+67, /* 2024 */ cleanup): https://wiki.archiveteam.org/?diff=51542&oldid=51541 |
07:14:27 | | Island_ quits [Read error: Connection reset by peer] |
07:18:09 | <nullpeta> | For srad.jp, https://srad.jp/journal/* and https://srad.jp/submission/* are also enumerable URIs. Could this be used for crawl seeds? |
07:22:46 | <pabs> | enumerable things can't be used as crawl seeds, you can only enumerate and save them |
07:23:00 | <pabs> | at least with archivebot right now |
07:24:28 | <pabs> | also, it looks like the main job is finding ~user/journal/1111 URLs, but not /journal/1111 URLs |
07:24:31 | <pabs> | nullpeta: ^ |
07:32:45 | <pabs> | arkiver JAA - I think we should save all of https://dotsrc.dl.osdn.net/osdn/ (alias of http://mirrors.dotsrc.org/osdn/) because OSDN is going down and http://osdn.dl.osdn.net/ is not enumerable |
07:33:15 | <nullpeta> | pabs: Thanks. So that means some non-AB crawls are needed to save all the comments, etc.... Hmmm. |
07:33:19 | <pabs> | (the rest of the site is mirrors of other stuff) |
07:33:52 | <pabs> | nullpeta: no, the enumeration is saving all the comments, see 7veb8mluv16mtw6ezn2jqmfbv in AB |
07:34:28 | <pabs> | the issue is that the enumeration won't save anything else except the comments |
07:34:45 | <pabs> | the main job is saving everything found from the front page |
07:34:55 | <pabs> | that is 6p7aqfevk41es3iuyywvw68a7 |
07:35:04 | <pabs> | in http://archivebot.com/ |
07:36:23 | <pabs> | hmm, maybe I could use https://dotsrc.dl.osdn.net/osdn/ to enumerate and then translate http://osdn.dl.osdn.net/ URLs |
07:38:01 | | BlueMaxima quits [Read error: Connection reset by peer] |
07:40:38 | <nullpeta> | pabs: I see, thanks for the explanation. |
07:49:47 | <@JAA> | SAP slowed down massively a bit ago, and now I got banned. We'll see how long it lasts. The 403s arrived *very* fast though. I peaked at almost 1k req/s. lol |
07:51:03 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
07:57:36 | | jasons (jasons) joins |
08:10:38 | <nullpeta> | https://srad.jp/story/24/ This page has links to each months page for 2024, its child pages have links to each days page for each month, and each days page has all the stories submitted for that day. Each year's page also has a link to the previous year's page. Hope this helps crawl all the stories. |
08:42:02 | | nullpeta leaves |
08:42:13 | | nullpeta joins |
09:26:39 | <c3manu> | nullpeta: the www.goodsmile.info server responds fairly slowly, so AB might not be able to grab everything in time. but i see for now the website (and its suspension announcement) are still up, so fingers crossed they don't just relaunch it immediately |
09:49:42 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
10:00:02 | | Bleo18260 quits [Client Quit] |
10:01:23 | | Bleo18260 joins |
10:22:06 | | tzt (tzt) joins |
10:43:23 | | jasons quits [Ping timeout: 272 seconds] |
11:03:26 | | decky joins |
11:03:28 | | nullpeta quits [Remote host closed the connection] |
11:03:29 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
11:04:48 | | katia quits [Remote host closed the connection] |
11:06:11 | | decky_e quits [Ping timeout: 272 seconds] |
11:06:16 | | katia (katia) joins |
11:06:52 | | katia quits [Remote host closed the connection] |
11:07:41 | | katia (katia) joins |
11:08:36 | | katia quits [Remote host closed the connection] |
11:10:59 | | katia (katia) joins |
11:46:02 | | jasons (jasons) joins |
11:52:09 | | wessel15127 joins |
11:53:37 | | panicman45 joins |
11:53:41 | | wessel1512 quits [Ping timeout: 272 seconds] |
11:53:41 | | wessel15127 is now known as wessel1512 |
11:54:22 | | panicman45 quits [Remote host closed the connection] |
11:56:29 | | nullpeta joins |
11:58:50 | | tertu quits [Ping timeout: 240 seconds] |
12:02:26 | <nullpeta> | c3manu: Thank you for trying to archive goodsmile.info.According to the official website, the update has been postponed. As of now, the re-launch date has not been disclosed. |
12:12:42 | <ScenarioPlanet> | Maybe Spore backups should get their own collection to be mentioned on https://wiki.archiveteam.org/index.php/Spore ? |
12:18:34 | | lizardexile joins |
12:44:23 | | nullpeta quits [Remote host closed the connection] |
12:45:37 | | Arcorann quits [Ping timeout: 272 seconds] |
12:46:15 | | jasons quits [Ping timeout: 272 seconds] |
13:05:26 | <Pedrosso> | pabs: Did you convert the other lists into the static subdomain too? |
13:07:20 | <Pedrosso> | ScenarioPlanet: I think you're right. The wiki should note the AB job on spore.com as well as these new lists |
13:08:17 | | razul quits [Quit: Bye -] |
13:09:19 | <pabs> | Pedrosso: I only saw one spore job running in the AB dashboard |
13:09:59 | <pabs> | so I only replaced that one |
13:10:14 | | razul joins |
13:11:15 | <Pedrosso> | Alright, I'll update the other 5 lists (2 more png lists, 3 xml lists) to be static.spore.com |
13:12:13 | <pabs> | thanks |
13:12:15 | <Pedrosso> | The wiki should also mention said subdomain that I failed to recgonize |
13:14:42 | <ScenarioPlanet> | It does |
13:15:06 | <ScenarioPlanet> | > Should be requested on static subdomain which uses CDN. |
13:16:53 | <Pedrosso> | If you're quoting from discord, I didn't notice that |
13:19:17 | <Pedrosso> | I overlooked it due to having /static/ in the url |
13:20:36 | <Pedrosso> | Thank you for the correction, this'll be much faster :) |
13:22:17 | <Pedrosso> | ooh |
13:22:30 | <Pedrosso> | You're quoting from the wiki nvm, they did say it on the discord server as well though |
13:22:48 | <ScenarioPlanet> | That's on https://wiki.archiveteam.org/index.php/Spore#:~:text=Should%20be%20requested%20on%20static%20subdomain%20which%20uses%20CDN |
13:22:54 | <Pedrosso> | What's that about postcards? |
13:23:22 | <ScenarioPlanet> | Postcards are exclusion here |
13:23:40 | <ScenarioPlanet> | Their web pages (/view/postcard/) use www. |
13:23:45 | <Pedrosso> | I see |
13:24:07 | <Pedrosso> | Spore randomly throws .jpgs into the mix |
13:24:51 | <Pedrosso> | > Note that all the non-ASCII symbols will be replaced with ? in the response. |
13:24:51 | <Pedrosso> | Does this apply when viewing the item as well? |
13:28:09 | <ScenarioPlanet> | Not viewing, only REST requests |
13:29:12 | <Pedrosso> | Well, that's a shame. |
13:30:05 | <Pedrosso> | So, what of the REST service? If we'd be able to grab creation data (like creator, subscribers, comments) via those we'd be able to find all users and grab even more from there; notably sporecasts. |
13:30:05 | <Pedrosso> | How would REST deal with an AB job? |
13:30:40 | <ScenarioPlanet> | Mostly fine, if that's 1000ms |
13:30:52 | <Pedrosso> | shrug better than nothing |
13:31:04 | <ScenarioPlanet> | So we are not doing DDoS |
13:31:39 | <Pedrosso> | Wouldn't DDoS require c>1 ? |
13:32:06 | <ScenarioPlanet> | Yes, that's why it should be 1 or 2-3 at most |
13:32:10 | <Pedrosso> | I see |
13:32:36 | <Pedrosso> | DWR Interface, has anything more been done/researched regarding that since the wiki was last updated? |
13:32:52 | <ScenarioPlanet> | Also, we must not to use REST as the only option to preserve creation metadata |
13:33:02 | <ScenarioPlanet> | ATOM and especially DWR too |
13:34:08 | <Pedrosso> | hm? |
13:34:09 | <ScenarioPlanet> | I could ask Kade to join Archiveteam's IRC. They know a lot about the DWR stuff. |
13:34:22 | <Pedrosso> | Yeah. But what did you mean just now? |
13:35:11 | <ScenarioPlanet> | We need to save ATOM responses too (see wiki page, "ATOM Feeds") |
13:35:31 | <Pedrosso> | So what you meant is that we should not solely save one? I agree |
13:35:37 | <ScenarioPlanet> | Sure |
13:36:09 | <Pedrosso> | Does DWR have something special, other than unicode? |
13:36:25 | <ScenarioPlanet> | Yes, POST requests |
13:37:06 | <Pedrosso> | I mean information-wise |
13:37:17 | <Pedrosso> | Something the others lack information-wise |
13:38:11 | <ScenarioPlanet> | It also holds adventure leaderboards, captain stats data and more |
13:38:39 | <Pedrosso> | Oooh, yeah that's good |
13:39:42 | <ScenarioPlanet> | Basically everything you can see on http://www.spore.com/sporepedia and can't find in any of the mentioned endpoints responses |
13:49:36 | | jasons (jasons) joins |
13:54:20 | | godane1 quits [Ping timeout: 240 seconds] |
14:03:49 | <Pedrosso> | I'd say yeah, ask Kade to join |
14:06:58 | <ScenarioPlanet> | Done |
14:23:00 | | onetruth quits [Read error: Connection reset by peer] |
14:34:09 | | godane (godane) joins |
14:50:23 | | jasons quits [Ping timeout: 272 seconds] |
14:52:41 | | TempleOfGoo joins |
14:54:16 | | TempleOfGoo88 joins |
14:57:57 | | TempleOfGoo quits [Ping timeout: 265 seconds] |
14:58:24 | <TempleOfGoo88> | Hello. I've been getting an error several times when trying to upload something. Can anyone help? |
15:04:17 | | AlsoHP_Archivist joins |
15:04:57 | | HP_Archivist quits [Ping timeout: 272 seconds] |
15:09:19 | <TempleOfGoo88> | This is error I'm getting: |
15:09:22 | <TempleOfGoo88> | <?xml version='1.0' encoding='UTF-8'?><Error><Code>SlowDown</Code><Message>Please reduce your request rate.</Message><Resource>Your upload of goo-goo-dolls-all-that-you-are-live-at-the-late-late-show-19.12.2011 from username templeofgoo@gmail.com appears to be spam. If you believe this is a mistake, contact info@archive.org and include this entire |
15:09:23 | <TempleOfGoo88> | message in your email.</Resource><RequestId>6ee4aeaf-231a-4a93-bbb1-ace0055e9b4a</RequestId></Error> |
15:10:01 | | AlsoHP_Archivist quits [Ping timeout: 272 seconds] |
15:10:12 | | TempleOfGoo joins |
15:10:45 | | HP_Archivist (HP_Archivist) joins |
15:11:55 | | Wohlstand (Wohlstand) joins |
15:14:23 | | TempleOfGoo88 quits [Ping timeout: 265 seconds] |
15:19:40 | | AlsoHP_Archivist joins |
15:21:25 | | HP_Archivist quits [Ping timeout: 272 seconds] |
15:22:13 | | monoxane9 (monoxane) joins |
15:23:19 | | monoxane quits [Ping timeout: 272 seconds] |
15:23:19 | | monoxane9 is now known as monoxane |
15:23:35 | | TempleOfGoo is now known as TempleOfGoo88 |
15:23:51 | | AlsoHP_Archivist quits [Remote host closed the connection] |
15:24:17 | | AlsoHP_Archivist joins |
15:24:52 | <c3manu> | TempleOfGoo88: i've had good results with following the instructions in the error message. |
15:25:25 | <TempleOfGoo88> | Well, the error message just says to send an email |
15:25:59 | <TempleOfGoo88> | Which I did. I'll just wait and see |
15:27:36 | <c3manu> | you shouldn't have to wait longer than 24h in my experience. |
15:27:50 | <TempleOfGoo88> | (y) |
15:28:22 | <c3manu> | "archiveteam" and "archive.org" are separate entities, so bothering us wouldn't have helped either ;) |
15:28:54 | <TempleOfGoo88> | Oh ok. thanks |
15:29:04 | <c3manu> | np :) |
15:29:59 | | TempleOfGoo88 quits [Remote host closed the connection] |
15:30:17 | | pedantic-darwin quits [Ping timeout: 272 seconds] |
15:30:25 | | AlsoHP_Archivist quits [Read error: Connection reset by peer] |
15:30:40 | | HP_Archivist (HP_Archivist) joins |
15:31:27 | | pedantic-darwin joins |
15:53:42 | | jasons (jasons) joins |
15:56:15 | | yawkat quits [Ping timeout: 272 seconds] |
16:00:03 | | HP_Archivist quits [Ping timeout: 272 seconds] |
16:00:53 | | HP_Archivist (HP_Archivist) joins |
16:06:30 | | tertu (tertu) joins |
16:24:36 | <fireonlive> | "Terraform Labs files for bankruptcy: Terraform Labs, the company behind the Terra blockchain, has filed for bankruptcy. Its flagship product, the Terra stablecoin and associated LUNA token, failed spectacularly in May 2022." https://web3isgoinggreat.com/single/terraform-labs-files-for-bankruptcy |
16:24:42 | <fireonlive> | (via #web3) |
16:26:51 | <nulldata> | Was just about to ask if someone could throw https://www.terra.money/ and https://medium.com/terra-money/ into AB :P |
16:26:52 | | pedantic-darwin1 joins |
16:27:15 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
16:27:52 | <fireonlive> | :P |
16:29:04 | <fireonlive> | there's also "CFTC files complaint against Debiex platform for using "romance scam tactics" to steal $2.3 million https://web3isgoinggreat.com/single/debiex-cftc-complaint" - which lists their domains; but I can only find https://www.debiex.com/wap/ that still works; everything else they seem to have 404'd |
16:30:20 | | pedantic-darwin quits [Ping timeout: 240 seconds] |
16:30:20 | | pedantic-darwin1 is now known as pedantic-darwin |
16:30:28 | | kuro68k joins |
16:32:24 | <kuro68k> | Hi guys. I came here because I heard about srag.jp going offline at the end of the month. It's not listed in Warrior but I'd like to help archive it if I can. Is it possible to contribute to that job, ideally via Warrior? |
16:33:12 | <nulldata> | kuro68k - you mean srad.jp ? |
16:33:41 | | pedantic-darwin2 joins |
16:34:46 | <nulldata> | So far we're able to grab it with #ArchiveBot , so it's not a Warrior project |
16:35:07 | <nulldata> | You can check the progress on http://archivebot.com/?showNicks=1 |
16:37:20 | | pedantic-darwin quits [Ping timeout: 240 seconds] |
16:37:20 | | pedantic-darwin2 is now known as pedantic-darwin |
16:37:23 | <fireonlive> | threw in terra's two urls there (medium i had to tack on `archive` on the end) |
16:45:47 | | razul quits [Client Quit] |
16:46:44 | | razul joins |
16:57:21 | <@JAA> | Still banned from SAP |
16:57:26 | <@JAA> | All this effort for that... |
16:59:33 | <kuro68k> | Yes, srad.org, sorry typo |
16:59:46 | <kuro68k> | So does ArchiveBot mean you don't need any help? |
16:59:49 | <fireonlive> | :| |
17:00:49 | <kuro68k> | I worry about slashdot.org too. It's a hard site to archive properly, in a way that preserves all the links and conversations. I tried archiving it once, the results were not stellar. |
17:01:04 | | razul quits [Client Quit] |
17:02:13 | | razul joins |
17:06:56 | | yonerboner joins |
17:09:05 | | Wohlstand quits [Ping timeout: 272 seconds] |
17:15:57 | | kuro68k quits [Remote host closed the connection] |
17:26:33 | <@arkiver> | JAA: do we need anything Warrior for OSDN? |
17:28:03 | <@arkiver> | pabs: i put https://dotsrc.dl.osdn.net/osdn/ in AB |
17:28:14 | <@arkiver> | looking further into it as well |
17:32:07 | | lizardexile quits [Client Quit] |
17:35:23 | <@JAA> | arkiver: Most of the discussion has been happening in #codearchiver. |
17:35:38 | <@arkiver> | ah |
17:41:37 | | icedice quits [Client Quit] |
17:42:48 | | nicey666 joins |
17:43:02 | | nicey666 quits [Remote host closed the connection] |
17:50:32 | | Megame (Megame) joins |
18:00:11 | | rohvani joins |
18:03:59 | <@JAA> | I don't know what we can do about SAP Q&A. |
18:04:05 | <@JAA> | I suspect the ban is manual. |
18:05:52 | <@JAA> | arkiver: Do you want to try a DPoS? It'd have to start *very* soon. They intend to finish their migration (which I'm sure will lose data given it's SAP) on Wed. |
18:06:58 | <@JAA> | There is a lot of weirdness in this one. Incomplete responses, wrong Content-Length headers, etc. I intend to document it all on the wiki. |
18:42:50 | | jasons quits [Ping timeout: 240 seconds] |
18:46:29 | <ScenarioPlanet> | What should i do with these? https://storage.scenariopla.net/explorer_fRbGKbbgLh_2024-01-22_20.45.50.903-1705949150.png |
18:46:43 | <ScenarioPlanet> | (that's like 30% of everything) |
18:57:40 | <@arkiver> | ScenarioPlanet: what is this? |
19:14:57 | | Wohlstand (Wohlstand) joins |
19:17:59 | | Naruyoko5 quits [Client Quit] |
19:18:58 | | myself4 is now known as myself |
19:30:55 | | guest324 joins |
19:31:20 | | imer quits [Quit: Oh no] |
19:31:45 | | guest324 quits [Remote host closed the connection] |
19:32:33 | | imer (imer) joins |
19:42:06 | | imer quits [Client Quit] |
19:45:16 | | imer (imer) joins |
19:46:38 | | jasons (jasons) joins |
19:48:31 | <ScenarioPlanet> | arkiver: fullsized images lists (wordpress+drupal+google+wix) |
19:51:51 | | BearFortress quits [Ping timeout: 272 seconds] |
20:00:46 | <fireonlive> | so we don't just have the thumbnails from the AB jobs |
20:00:48 | <fireonlive> | AIUI |
20:10:04 | | beastbg8 joins |
20:20:38 | | Naruyoko joins |
20:21:04 | | Naruyoko5 joins |
20:25:09 | <beastbg8> | Hello. I would like to bring something to your attention. I hope I'm in the right place. In a month time as of today the largest local video portal in Bulgaria circa 2006, VBOX7 (very similar to the Hungarian "videa.hu") is about to "hide" all user-uploaded content, which according to their devs are over 14M videos. They already did that last |
20:25:09 | <beastbg8> | week, but opened the gates for final time after social media outrage. It contains a lot of, rare nowhere-to-be-found media, specifically concerning Bulgaria and quite a lot otherwise "lost" media (foreign movies, TV series) that survives only there, either with a dub or not. Is there something that can be done in a such a narrow time frame? |
20:25:25 | | Naruyoko quits [Ping timeout: 272 seconds] |
20:27:19 | <beastbg8> | Currently these videos can only be accessed with Bulgarian IP. Only "partnered" videos (only content that will remain on the site starting 22 February 2024) can be watched from abroad. |
20:29:19 | <@JAA> | These appear to be the relevant announcements: https://blog.vbox7.com/video-access/ https://blog.vbox7.com/blagodarim-vi/ |
20:29:59 | <@JAA> | The georestriction is going to be a pain. |
20:35:20 | | BlueMaxima joins |
20:36:04 | <h2ibot> | JustAnotherArchivist edited Deathwatch (+428, /* 2024 */ Add Vbox7): https://wiki.archiveteam.org/?diff=51543&oldid=51542 |
20:43:38 | <nulldata> | Example of a working video in the US: https://www.vbox7.com/play:2e08276728 -> (player loads this which links to a mpd file, the URLs of which seem to be static - or at least didn't change when accessing via a different connection and computer) https://www.vbox7.com/aj/player/item/options?vid=2e08276728 |
20:45:24 | <nulldata> | And here's one that's not allowed in US. https://www.vbox7.com/play:b312ac1a7a -> https://www.vbox7.com/aj/player/item/options?vid=b312ac1a7a |
20:47:20 | | jasons quits [Ping timeout: 240 seconds] |
20:55:35 | <beastbg8> | Currently the yt-dlp's extractor for VBOX7 is broken (only extracts mpds from some videos) but a friend fixed it awhile ago. Providing it with their consent. Please take a look. https://pastebin.com/raw/300v4NwC (vbox7.py) |
20:55:51 | <nulldata> | No geo restriction on the video server itself it seems - if you have a Bulgaria connection to grab the URL from the API response it'll download no problem on a US connection |
20:56:03 | <beastbg8> | yep |
20:56:40 | <nulldata> | The API gives a mpd file, but you can change the extension to m3u8 and play with VLC. https://edge211.vbox7.com/sl/1iswCSTN-zXpz-8oOGnlrg/1706133600/b3/b312ac1a7a/b312ac1a7a.m3u8 |
20:58:36 | | Barto quits [Client Quit] |
20:58:44 | <nicolas17> | oh joy they're using byte ranges |
20:58:54 | <nicolas17> | so it's not 3 million tiny files with 5 seconds of video each |
20:59:16 | | Barto (Barto) joins |
21:00:54 | <@JAA> | And DASH and HLS both reference the same MP4 file as well! |
21:02:36 | <nulldata> | Oh yeah I should've looked harder - the mpd file specifies the mp4 files in the BaseURL node. https://edge211.vbox7.com/sl/1iswCSTN-zXpz-8oOGnlrg/1706133600/b3/b312ac1a7a/b312ac1a7a_480_track1_dash.mp4 |
21:07:03 | <nulldata> | Question becomes - is there a nice way to enumerate all valid video ids lol |
21:09:33 | <@JAA> | 424k requests per second could bruteforce the trillion possible [0-9a-f]{10} IDs in 30 days. That's not going to happen but less unreasonable than I expected. |
21:10:46 | <nicolas17> | and how do you go from video ID to mpd URL? |
21:11:14 | | griz joins |
21:13:58 | <nicolas17> | hmm weird |
21:14:08 | <nicolas17> | you posted an mpd/m3u8 for b312ac1a7a |
21:14:23 | <nicolas17> | but https://www.vbox7.com/ajax/video/nextvideo.php?vid=b312ac1a7a returns a direct .mp4 instead of an mpd manifest |
21:14:53 | <griz> | old ones are mostly mp4 |
21:15:15 | <griz> | some are only flv |
21:15:24 | <nulldata> | nicolas17 - I think the only way would be finding someone with a legit connection, or VPN, dedicated to grabbing the URLs from the API to feed to a tracker. The grabbers could be on any connection. |
21:16:03 | <nicolas17> | but where did you even get that mpd if the API returns mp4? |
21:16:11 | <nulldata> | That's because of the geoblock - if you access via Bulgaria it returns the mpd link |
21:16:12 | <h2ibot> | Pokechu22 edited List of website hosts (+471, XFree / Thin Cloud for Free): https://wiki.archiveteam.org/?diff=51544&oldid=51508 |
21:16:23 | <nicolas17> | oh |
21:16:36 | <nicolas17> | I didn't realize that .mp4 was the "not available" placeholder >_> |
21:22:01 | <griz> | direct mp4 over https is fastest route.. then HLS & MPD |
21:28:39 | <@JAA> | I mean, we don't need to refetch the video for HLS and MPD since it's all the same MP4 file behind it. |
21:32:09 | | Island joins |
21:37:45 | | sec^nd quits [Ping timeout: 255 seconds] |
21:40:46 | | sec^nd (second) joins |
21:42:11 | | sec^nd quits [Remote host closed the connection] |
21:42:44 | | sec^nd (second) joins |
21:51:15 | | jasons (jasons) joins |
21:54:37 | | Island quits [Read error: Connection reset by peer] |
21:56:09 | | Island joins |
22:00:44 | | ThetaDev quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
22:00:52 | <ThreeHM> | I'm able to bypass the georestriction by adding an X-Forwarded-For header with a Bulgarian IP to the API request |
22:00:56 | | ThetaDev joins |
22:01:07 | <project10> | lol |
22:02:00 | <fireonlive> | 😏 |
22:03:29 | <ThreeHM> | Even works if I use the IP that's behind vbox7.com |
22:03:31 | <ThreeHM> | lol |
22:04:01 | <Barto> | someone screwed their reverse proxy config |
22:04:51 | | nulldata quits [Ping timeout: 272 seconds] |
22:05:43 | <fireonlive> | excellent |
22:06:10 | <griz> | X-Forwarded-For is also used in the yt-dlp script, that's how it works |
22:07:34 | <Barto> | :-) I wonder if the more 'recent' Forwarded header works |
22:10:21 | <beastbg8> | made an article on the wiki |
22:10:22 | <kiska> | When warrior go brrr? :D |
22:10:34 | <kiska> | Seems like that'll be patched out soon |
22:11:17 | | nulldata (nulldata) joins |
22:13:10 | <beastbg8> | i doubt they're checking their code base too myopically |
22:14:15 | <beastbg8> | there's a large hole in their subtitle writing functionality, where videos can be obtained even if hidden, but it needs a user account |
22:14:34 | <@JAA> | Nice |
22:14:39 | <@JAA> | So how do we find video IDs? |
22:14:48 | <kiska> | Bruteforce? |
22:14:52 | <beastbg8> | https://www.vbox7.com/subrec:db089cbeef |
22:14:58 | <@JAA> | There's a trillion of them, kiska. |
22:15:09 | <@JAA> | 16^10 |
22:15:17 | <kiska> | Seems more doable than something like youtube :D |
22:15:34 | <griz> | the android app is interesting too, mobile api is very extensive |
22:15:44 | <kiska> | I suppose we could do discovery if they have "recommended" on the side |
22:15:46 | <beastbg8> | <div class="container"> |
22:16:00 | <@JAA> | We're not going to do 425k req/s for a month though. lol |
22:16:32 | <@JAA> | But yeah, as I wrote above, I would've expected it to be worse. |
22:16:36 | <kiska> | I'd doubt they let us do that for a month :D |
22:17:24 | <beastbg8> | one way is through obtaining all user accounts through search keywords and scraping their channels |
22:18:07 | <beastbg8> | but that's kinda impractical |
22:18:19 | <ThreeHM> | They have recommendations, but we'd have to check if that gives you geoblocked videos |
22:19:55 | <griz> | top 100 videos of all time (v3 & v4 api only): https://api.vbox7.com/v4/?action=r_video_top100&app_token=vbox7_android_1.0.0_gT13mY |
22:19:59 | <kiska> | Do they have a sitemap :D |
22:20:22 | <griz> | https://api.vbox7.com/v5/?action=r_channels&app_token=vbox7_android_1.0.0_gT13mY |
22:20:35 | <griz> | random links from mobile app |
22:20:40 | <beastbg8> | also the subrec method (which will likely stay post feb 22) does not support .flv videos |
22:21:00 | <beastbg8> | it can be used as a last resort |
22:21:14 | <beastbg8> | btw hidden videos still retain their metadata |
22:21:23 | <beastbg8> | assuming you have the URL |
22:23:53 | <beastbg8> | https://www.vbox7.com/play:4a81cafe81 here is one such video. no playback endpoints but title, thumbnail, comments etc are all there |
22:24:29 | <beastbg8> | all videos were like this yesterday |
22:24:39 | <nicolas17> | crawl recommended videos recursively |
22:26:20 | <nstrom|m> | user pages also seem to have all the videos a user posted, so that probably helps w discovery a bit too. eg. https://www.vbox7.com/user:wochit_news has 626 pages of videos |
22:26:39 | <nstrom|m> | search by tag potentially useful https://www.vbox7.com/tag:video?ord=date&period=day&order=rate&page=20 |
22:26:59 | <griz> | mobile api example: https://api.vbox7.com/v5/?action=r_user_videos&username=wochit_news&app_token=vbox7_android_1.0.0_gT13mY |
22:27:17 | <griz> | though api bugs after about 1000 results |
22:31:44 | <griz> | mobile api search video by query example: https://api.vbox7.com/v4/?action=r_video_search&query=preslava&page=1&app_token=vbox7_android_1.0.0_gT13mY |
22:50:20 | | jasons quits [Ping timeout: 240 seconds] |
22:50:31 | | griz leaves |
22:54:33 | <beastbg8> | if videos in subrec mode give 404 error, renaming the extension from .mp4 (they're always passed as .mp4) to .flv might work, though not always |
23:38:20 | | pabs quits [Ping timeout: 240 seconds] |
23:39:32 | | pabs (pabs) joins |
23:40:40 | | missaustraliana joins |
23:43:41 | <h2ibot> | Missaustraliana edited Deathwatch (+4, Add 'was ' to Studio 10 as the time has passed.): https://wiki.archiveteam.org/?diff=51545&oldid=51543 |
23:43:57 | <missaustraliana> | yuh\ |
23:50:37 | | godane quits [Ping timeout: 272 seconds] |
23:53:47 | | jasons (jasons) joins |
23:59:17 | | nautical_jesus joins |